Carlpedia
Child pages
  • Digital Preservation and Processing Procedure / Workflow
Skip to end of metadata
Go to start of metadata

This procedure is designed to make a consistent and scalable workflow to prepare digital files for long-term preservation, description online and distribution online.  

I will also describe the standards we follow, including any modifications to those standards.

Standards and Formats

Bagit

Setup and instructions

https://github.com/CarletonArchives/BagBatch

For storing files in groups, in a predictable structure that is both machine and human readable.  The structure of our bags is slightly different from the most basic bags discussed in the standard but still follows the guidelines.  

    • [bag name]/
      • bag-info.txt
      • bagit-txt
      • manifest-md5.txt
      • data/
        • originals/
        • dips/
        • meta/

The data/originals directory is for storing all the original files, usually in the same order that we got them in.  The data/dips directory is for storing access copies of files in the originals directory.  The structure of dips will usually be the same as the originals.  The "data/meta" directory is for storing any metadata about the originals and dips.  These can include format analysis reports, extracted tag information, import templates for upload to our database, original box and folder lists.  

OAIS

http://en.wikipedia.org/wiki/OAIS

These are guidelines for digital preservation that are designed to be general and flexible enough to apply to a number of institution types and record types.  Specifically, we now use the concepts of DIPS, SIPS, and AIPS in our workflow and storage practices.

DROID - Digital Record Object Identification

Reports for each bag stored in: [bag name]/data/meta

Setup and instructions

https://github.com/CarletonArchives/DROID-Batch-Processor

Exif Metadata Extraction Tool

Reports for each bag stored in: [bag name]/data/meta

Setup and instructions

https://github.com/CarletonArchives/ExifTool-Batch-Processor

 

Arrangement

Directories of files will typically remain in their original order.  We can reorder directory contents but only when necessary.   

Workflow

Virus Scan

Everything in this folder is unprocessed and the network systems routinely perform virus scans on this directory without needing direct action by anyone in the Archives Department.  Leave any new records in this folder for 3-4 weeks before moving them on to later stages in the processing workflow.  This will allow 1) all the contents of the new records to be scanned for viruses and 2) the virus scanners to be updated with the latest virus definitions for any threats that may be present in the records.  

Disk Image and transfer 

We currently make logical copies of disks as opposed to forensic copies both to save space and to protect the privacy of records donors.  We would consider making forensic copies if there was a need to preserve all aspects of a set of files.  

Appraisal and selection

Begin assessing the records for preservation concerns, weeding and rearrangement if necessary.  Record a brief description of these actions in our archival management system when you accession the records.

Accessions should also be set up as "proto-bags" containing the basic directory structure we use at the Carleton Archives following the Bagit standard, but without the documents containing the manifest, tag manifest and the Bagit version; those documents will be created later using Bagit.  The proto-bag should follow this basic structure:

    • BAGNAME/
      • data/
        • dips/
        • originals/
        • meta/

All original files from the accession will be placed in the directory named "originals."  All access copies will be placed in "dips."  All administrative files and technical reports that do not go into our collection management system will be put in the "meta" directory such as format analysis reports, directory trees reports, extracted technical metadata, PREMIS reports, extracted tagged metadata, etc..  Note: we do not call the folder "metadata" because it causes errors when running Bagit.   

From this point forward we will now refer to the parent directory for this accession as a "bag"

To do this step in batches, run the ProtoBagger tool here 

https://github.com/CarletonArchives/Proto-Bagger

Normalize File Names

Normalize the file names for each accession using our File Name Cleaner here:

https://github.com/CarletonArchives/Filename-Cleaner

This tool will change any character not included in this list "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890_." to an underscore.  It will also make a log of those changes in BAGNAME/data/meta/renames.txt

Accession, rename parent directory and sub-directory structure

Accession the files as you would a collection of physical records and using the archives procedures for creating an accession ID, date and description.  
If the files are not in a single parent directory, create one and put all files for the accession in that directory.
Name the parent directory according to these rules. 

    • Date of the accession or receipt of the files following this pattern.
      • yyyymmdd
    • Underscore delimiter and index # as needed if multiple accessions are being processed the same day.
    • Underscore delimiter and brief human-readable description or directories original name
    • For example:
      • 20120604_LairdStadium
      • 20120604_2_LairdStadium
      • 20120604_3_metricFootballGame
      • 20110926_LairdStadiumrRenovation
      • The first 3 are accessions that came in on the same day.  The second laird stadium accession has a "2" after the date to make it distinct from the first.

The descriptive part of the directory name does create some uniqueness to the name but is for human browsing to make it easier for the processor to quickly keep track of a folder they are working on.
For the location of the digital accession in our archival management system, select the pre-defined network drive location, add the bag name to the value of the content field. The row, section, shelf fields will be blank.  The extent will be an approximation of the size of the directory in gigabytes.  

The Protobagger tool listed in the Appraisal and Selection section should have already created an accession record that can be imported into Archon.  You can import this data sheet, or enter it by hand.

Collection or Collection Content Records

Most digital accessions should be recorded in the database in 1 of 4 ways.  

  1. As an accession to be processed when possible.
  2. An accession record only, possibly associated with an existing collection.  
    1. This has been used for periodicals like the Carletonian that have issues accessible through a separate digital collection, but the original accession needs recording to manage its location, size, and additional contents.
  3. A new collection entirely with no detailed inventory.  
    1. Create a new collection record following the methods used for all other collections in the archives.  
    2. Use the Location Manager to record the name of any bag that belongs with this collection, their locations, and sizes.  
    3. If the bag being processed contains more than one collection's material, add as many sub-folders to the bag name as needed to indicate which portion of the bag is for the collection being edited.  Ex. 20171006_PresidentsOffice/data/originals/personal_papers and 20171006_PresidentsOffice/data/originals/Annual_reports could each be associated with their own collection but be stored in the same bag.  
    4. Note, you do not need to include detailed information about the digital nature or makeup of the bag and technical metadata will primarily be stored either in the bag itself or in the original accession record.  
    5. Note, the collection ID does not need to match the name of the bag you are processing. 
  4. A part of a collection’s contents such as a series, folder, or item.  
    1. Do not store the bag names in the Location Manager for the collection, instead use the UnitID field for any collection content record that contains a digital file that is part of a bag.  
    2. The title of the UnitID field should be “UnitID” and the value should be a relative path that includes the name of the bag and any subfolders or file names needed to navigate to the item(s) being referred to in the collection content record. 
      1. For example, if I scan photos from a collection content record about the construction of Laird Stadium at Carleton I might store those scans in a bag named "20120221_Laird_construction" in sub-folder "/data/originals/photos."  The UnitID value, in this case, would be "20120221_Laird_construction/data/originals/photos" to refer to any photos from this set that I scanned.  I could also separately refer to scanned construction drawings from this collection content record by adding another UnitID field containing "20120221_Laird_construction/data/originals/drawings."  

If you would like digital content to be displayed online, create a corresponding record with the Digital Library Manager and associate it with the appropriate collection or collection content.

Extract tagged metadata

To extract tagged metadata with ExifTool.  See our ExifTool Batch Processor GitHub repository for instructions on setting up and running of this tool.  Run ExifTool on the entire directory and write the results to XML file named exif.xml, following the instructions listed in our documentation on ExifTool.  

Place the resulting file in the "meta" directory

Set up instructions, commands and a batch process for ExifTool reports is available here

https://github.com/CarletonArchives/ExifTool-Batch-Processor

Format validation 

Create a format validation report using DROID and put all resulting files in the "meta" folder of the bag.  

Set up instructions, commands and a batch process for DROID can be found here:

https://github.com/CarletonArchives/DROID-Batch-Processor

Run DROID on the entire directory.  When finished, view and export the results of the comprehensive report as a PDF named droid.pdf and an XML file named droid.xml.

  • The PDF creates a high level, human-readable report on the kinds of files in the record set.  Information in the report includes numbers of files by type, any files that are unreadable, files that do not match any format profiles on record.  It can help to assess any special preservation or processing needs for these records.  
  • The XML version of this report is a machine-readable version of the same information in the above comprehensive PDF report.  

Migrate Files

Using the DROID reports, determine if any of the files need migration to preservation formats and what to do with the original formats if migration is necessary.  Refer to the Archives Conceptual Framework for Digital Preservation for guidance on the disposition of original file formats after migration as well as when we migrate files and to what format.   

Store migrated files in the same location as their original copy in the "originals" directory.  

If the original files need to be maintained after migration and have a name that is unique leave them in the directory they came in.  

If you have to maintain the originals and the file name of the migrated version is the same as the original, you can either put the migrated files into a new sub-directory of its original location or give the migrated file a similar but unique name.

Create DIPS

Create your Disseminated Information Packets, DIPS or web copies, of files that will be available online.   These files will be reduced in size and often are only a sample of the full set.  Examples include lower resolution jpg versions of original tiffs, smaller mp3 copies of original WAVE files, smaller MPEG versions of large video files.  

    • For instance, a set of digital photographs may include 500 images from the same event.  You may only make small jpg web copies for 5-10 of the best photos if many of the photos are of the same scene or were taken in rapid succession.  

If possible, have the directory structure of the "dips" directory match as close as possible the directory structure for "originals."

Regardless of original format, make DIPS using the following formats.  

    • Images
      • JPG
      • >100K-500K/photo
    • Documents
      • PDF 
      • >20Mg/file
    • Audio 
      • MP3
      • >100Mg/file
      • bitrate about 96-128
    • Video
      • MPEG-4 (.m4v).  Note: JWPlayer is the media player for our instance of Archon.  It does not currently support mp4's in some browsers.
      • >100Mg/file
      • bitrate about under 300 is preferable.  Max bitrate 700.   
      • Recommended Handbreak settings 12/17/
        • Preset = ipad
        • Format = mp4, or m4v
        • Video codec = H.264
        • Video quality = RF 28
        • Large File size = off
        • Web Optimized = on
        • Audio = if speech only set to 96 otherwise leave default.

An application for creating DIPS in a bag is available here, but it is still the development stage.  

https://github.com/CarletonArchives/ContentConverter.py-in-development

Bagit

You will now create a completed bag of the contents of the entire directory using the Bagit standard.

Set up instructions, commands and a batch process for Bagit can be found here

https://github.com/CarletonArchives/BagBatch

Upload DIPS

If adding these files to Archon, upload the bag and its "dips" folder to the web server, we currently use FileZilla for this step.  Do not include the "originals" directory, the "meta" directory or any of the text files created by Bagit.  For instance, a bag on the web server could look like:

    • 20120828_LairdRenovation/
      • data/
        • dips/
          • photos/
            • file1
            • file2
          • drawings/
            • file3
          • correspondence/
            • file4
            • file5

Change the permissions for the bag and all its contents to 755.

Create digital library records for each set that you want represented in Archon and associate these digital library records with its corresponding collection or collection content record.  This can be done in batches with the following MySQL scripts. 

INSERT INTO tblDigitalLibrary_DigitalContent(SELECT "" AS ID, "1" AS Browsable,tblCollections_Content.Title, tblCollections_Content.CollectionID, tblCollections_Content.ID AS CollectionContentID, tblCollections_UserFields.Value As Identifier,tblCollections_Content.Description AS Scope, NULL AS PhysicalDesc, tblCollections_Content.Date, NULL AS Publisher, NULL AS Contributor, NULL AS RightsStatement, REPLACE( CONCAT( 'https://archivedb.carleton.edu/files/', REPLACE( tblCollections_UserFields.Value, '.tif', '.jpg' ) ) , '/originals', '/dips' ) AS ContentURL, "0" AS Hyperlink, CURRENT_TIMESTAMP as dateadded FROM `tblCollections_Content` LEFT JOIN tblCollections_UserFields ON tblCollections_Content.ID = tblCollections_UserFields.ContentID WHERE tblCollections_UserFields.Title LIKE '%UnitID%' AND (Value LIKE 'BAGNAME1%' OR Value LIKE 'BAGNAME2%'))

Run the the digital content sync utility at "Archon/Database Manager/Synchrony II/Sync Digital Content."  This tool will look at all the ContentURL values in digital content records, go to that location on the server, index any files it finds at that location, create records for those files in the tblDigitalLibrary_Files table and associate those files with the correct digital library record.  

Store AIPS

Add the entire bag to the AIPS directory on our server for long-term storage.  

  • No labels