Child pages
  • Digital Preservation and Processing Procedure / Workflow
Skip to end of metadata
Go to start of metadata

This procedure is designed to make a consistent and scalable workflow to prepare digital files for long-term preservation, description online and distribution online.  

I will also describe the standards we follow, including any modifications to those standards.

Standards and Formats


Setup and instructions

For storing files in groups, in a predictable structure that is both machine and human readable.  The structure of our bags is slightly different from the most basic bags discussed in the standard but still follows the guidelines.  

    • [bag name]/
      • bag-info.txt
      • bagit-txt
      • manifest-md5.txt
      • data/
        • originals/
        • dips/
        • meta/

The data/originals directory is for storing all the original files, usually in the same order that we got them in.  The data/dips directory is for storing access copies of files in the originals directory.  The structure of dips will usually be the same as the originals.  The "data/meta" directory is for storing any metadata about the originals and dips.  These can include format analysis reports, extracted tag information, import templates for upload to our database, original box and folder lists.  


These are guidelines for digital preservation that are designed to be general and flexible enough to apply to a number of institution types and record types.  Specifically, we now use the concepts of DIPS, SIPS, and AIPS in our workflow and storage practices.

DROID - Digital Record Object Identification

Reports for each bag stored in: [bag name]/data/meta

Setup and instructions

Exif Metadata Extraction Tool

Reports for each bag stored in: [bag name]/data/meta

Setup and instructions


Directories of files will typically remain in their original order.  We can reorder directory contents but only when necessary. 

If a directory covers many topics and you decide they need multiple records to describe the different topics, group the directory's contents into new sub-folders and catalog at that level. For example: 

  • 20151008_SportsPhotos/data/originals/127 is a ripped compact disk containing 150 unsorted photos of multiple sports, multiple teams, and multiple events
  • You want to create separate records for each of the different events and for the different teams.
  • Make 1 new sub-folder for each event or team group that you want to catalog, sort the photos into those new sub-folders and make a catalog record for each sub-folder. 
  • The final folder arrangement might look like 
    • 20151008_SportsPhotos/data/originals/127/mencrosscountry
    • 20151008_SportsPhotos/data/originals/127/womensoccer
    • 20151008_SportsPhotos/data/originals/127/menswimming
    • 20151008_SportsPhotos/data/originals/127/volleyball
  • Include the new relative paths above for the sub-folder being cataloged with the respective catalog record in the UnitID field.

Preservation Workflow

Files do not need to be processed in this order, but most of these steps should be applied to all bags.

Virus Scan

Everything in this folder is unprocessed and the network systems routinely perform virus scans on this directory without needing direct action by anyone in the Archives Department.  Leave any new records in this folder for 3-4 weeks before moving them on to later stages in the processing workflow.  This will allow 1) all the contents of the new records to be scanned for viruses and 2) the virus scanners to be updated with the latest virus definitions for any threats that may be present in the records.  

Disk Image and transfer 

We currently make logical copies of disks as opposed to forensic copies both to save space and to protect the privacy of records donors.  We would consider making forensic copies if there was a need to preserve all aspects of a set of files.  

Accession, rename parent directory and sub-directory structure

Accession the files as you would a collection of physical records and using the archives procedures for creating an accession ID, date and description.  
If the files are not in a single parent directory, create one and put all files for the accession in that directory.
Name the parent directory according to these rules. 

    • Date of the accession or receipt of the files following this pattern.
      • yyyymmdd
    • Underscore delimiter and index # as needed if multiple accessions are being processed the same day.
    • Underscore delimiter and brief human-readable description or directories original name
    • For example:
      • 20120604_LairdStadium
      • 20120604_2_LairdStadium
      • 20120604_3_metricFootballGame
      • 20110926_LairdStadiumrRenovation
      • The first 3 are accessions that came in on the same day.  The second laird stadium accession has a "2" after the date to make it distinct from the first.

The descriptive part of the directory name does create some uniqueness to the name but is for human browsing to make it easier for the processor to quickly keep track of a folder they are working on.

For the location of the digital accession in our archival management system, select the pre-defined network drive location, add the bag name to the value of the content field. The row, section, shelf fields will be blank.  The extent will be an approximation of the size of the directory in gigabytes.  

The Protobagger tool listed in the Appraisal and Selection section can be used to create an accession record that can be imported into Archon.  You can import this data sheet, or enter it by hand.

Appraisal and selection

Begin assessing the records for preservation concerns, weeding and rearrangement if necessary.  Record a brief description of these actions in our archival management system when you accession the records.

Accessions should also be set up as "proto-bags" containing the basic directory structure we use at the Carleton Archives following the Bagit standard, but without the documents containing the manifest, tag manifest and the Bagit version; those documents will be created later using Bagit.  The proto-bag should follow this basic structure:

    • BAGNAME/
      • data/
        • dips/
        • originals/
        • meta/

All original files from the accession will be placed in the directory named "originals."  All access copies will be placed in "dips."  All administrative files and technical reports that do not go into our collection management system will be put in the "meta" directory such as format analysis reports, directory trees reports, extracted technical metadata, PREMIS reports, extracted tagged metadata, etc..  Note: we do not call the folder "metadata" because it causes errors when running Bagit.   

From this point forward we will now refer to the parent directory for this accession as a "bag"

To do this step in batches, run the ProtoBagger tool here

Normalize File Names

Normalize the file names for each accession using our File Name Cleaner here:

This tool will change any character not included in this list "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890_." to an underscore.  It will also make a log of those changes in BAGNAME/data/meta/renames.txt

Collection or Collection Content Records

Most digital accessions should be recorded in the database in 1 of 4 ways.  

  1. As an accession to be processed when possible.
  2. An accession record only, possibly associated with an existing collection.  
    1. This has been used for periodicals like the Carletonian that have issues accessible through a separate digital collection, but the original accession needs recording to manage its location, size, and additional contents.
  3. A new collection entirely with no detailed inventory.  
    1. Create a new collection record following the methods used for all other collections in the archives.  
    2. Use the Location Manager to record the name of any bag that belongs with this collection, their locations, and sizes.  
    3. If the bag being processed contains more than one collection's material, add as many sub-folders to the bag name as needed to indicate which portion of the bag is for the collection being edited.  Ex. 20171006_PresidentsOffice/data/originals/personal_papers and 20171006_PresidentsOffice/data/originals/Annual_reports could each be associated with their own collection but be stored in the same bag.  
    4. Note, you do not need to include detailed information about the digital nature or makeup of the bag and technical metadata will primarily be stored either in the bag itself or in the original accession record.  
    5. Note, the collection ID does not need to match the name of the bag you are processing. 
  4. Using the UnitID field to inventory it as part of a collection’s contents such as a series, folder, or item.  
    1. Do not store the bag names in the Location Manager for the collection, instead use the UnitID field for any collection content record that contains a digital file that is part of a bag.  
    2. The title of the UnitID field should be “UnitID” and the value should be a relative path that includes the name of the bag and any subfolders or file names needed to navigate to the item(s) being referred to in the collection content record. 
      1. For example, if I scan photos from a collection content record about the construction of Laird Stadium at Carleton I might store those scans in a bag named "20120221_Laird_construction" in sub-folder "/data/originals/photos."  The UnitID value, in this case, would be "20120221_Laird_construction/data/originals/photos" to refer to any photos from this set that I scanned.  I could also separately refer to scanned construction drawings from this collection content record by adding another UnitID field containing "20120221_Laird_construction/data/originals/drawings."  

Cataloging at the series, folder, or item level

When adding digital content to the database, there are a few typical scenarios you will encounter.

  • Digital file is a digital surrogate for a physical item in our collections
    • When adding scans of items we hold in the archives, avoid creating a new intellectual record in the database for the digital surrogates.  Instead use the existing metadata record for the original and add a location for the surrogate in the UnitID field.  The location value should contain a relative path, not an absolute path, to the item or folder described in the metadata record, starting with the container directory, referred to as a "bag" following the Bagit standard.
      • For example: Scans of Series 3 - Folder .001: May Fete, 1911/12 would have a UnitID field with a value of "20161220_mayfete/data/originals/1912"  indicating there are digital versions stored on the server in bag called 20161220_mayfete, in sub-folder data/originals/1912. 

  • Digital file is an image with no physical copy in the archives, such as born digital items or our only copy is digital.
    • Digital items can be filed as siblings of non-digital files or in collections of strictly digital items.  Use the Level/Container "Digital Items."  The item numbers are mostly arbitrary, but usually begin with "Digital Items 1" unless there is a reason for displaying items out of sequence.  If the digital records are entered as siblings of non-digital items, place the digital records after the non-digital records. For example:

      • Series 4: Circa 2000 to circa 2010

        • Folder 2005/06.001 Commencement, 2005/06

        • Folder 2005/06.002 Reunion, 2005/06

        • Folder 2005/06.003 Mid-winter Ball, 2005/06

        • Digital Items 2005/06.1 Hmong Student Coalition, 2005/06

        • Digital Items 2005/06.2 Mid-winter Ball, 2005/06 (for copies that only exist in digital form, possibly coming from a later donation)

    • When adding digital files, always add a UnitID value as described above in previous section, "Digital file is a digital surrogate for a physical item in our collections".
    • NOTE : Sets of photographs that arrive in the Archives with both physical and digital versions will be incorporated into Series 3 same as other physical prints with links to their digital versions. Part of the reason for the creation of separate Level/Container set as Digital Item is to distinguish between folders that do and folders that do not have a physical presence in the collection. If digital and print versions of the same set of photos exist, their folder will have a physical presence and will therefore not need its Level/Container set as Digital Item.
  • Incorporating digital items at the collection level only.

    • If digital records are being added to a collection that does not have collection content records, such as many PB collections, add information on the digital records to an accession record and associate the accession with the appropriate collection. 

    • Accessions of digital records are organized the same way as those for physical records with a few differences. 

      • The Received Extent, Unprocessed Extent and Location Extent should be in gigabytes rounded to the nearest .01G.
      • The Location Information/Location field should be the network storage volume where the master copies are stored. As of 2022, there is only 1 storage location option - Archives Network Storage 1.  Additional storage locations can be defined in the Collections/Location Manager.
      • The Location Information/Content field should contain the bag names for all bags associated with the accession. 1 entry/bag.

Extract tagged metadata

To extract tagged metadata with ExifTool.  See our ExifTool Batch Processor GitHub repository for instructions on setting up and running of this tool.  Run ExifTool on the entire directory and write the results to XML file named exif.xml, following the instructions listed in our documentation on ExifTool.  

Place the resulting file in the "meta" directory

Set up instructions, commands and a batch process for ExifTool reports is available here

Format validation 

Create a format validation report using DROID and put all resulting files in the "meta" folder of the bag.  

Set up instructions, commands and a batch process for DROID can be found here:

Run DROID on the entire directory.  When finished, view and export the results of the comprehensive report as a PDF named droid.pdf and an XML file named droid.xml.

  • The PDF creates a high level, human-readable report on the kinds of files in the record set.  Information in the report includes numbers of files by type, any files that are unreadable, files that do not match any format profiles on record.  It can help to assess any special preservation or processing needs for these records.  
  • The XML version of this report is a machine-readable version of the same information in the above comprehensive PDF report.  

Migrate Files

Using the DROID reports, determine if any of the files need migration to preservation formats and what to do with the original formats if migration is necessary.  Refer to the Archives Conceptual Framework for Digital Preservation for guidance on the disposition of original file formats after migration as well as when we migrate files and to what format.   

Store migrated files in the same location as their original copy in the "originals" directory.  

If the original files need to be maintained after migration and have a name that is unique leave them in the directory they came in.  

If you have to maintain the originals and the file name of the migrated version is the same as the original, you can either put the migrated files into a new sub-directory of its original location or give the migrated file a similar but unique name.

Create DIPS

Create your Disseminated Information Packets, DIPS or web copies, of files that will be available online.   These files will be reduced in size and often are only a sample of the full set.  Examples include lower resolution jpg versions of original tiffs, smaller mp3 copies of original WAVE files, smaller MPEG versions of large video files.  

    • For instance, a set of digital photographs may include 500 images from the same event.  You may only make small jpg web copies for 5-10 of the best photos if many of the photos are of the same scene or were taken in rapid succession.  

If possible, have the directory structure of the "dips" directory match as close as possible the directory structure for "originals." 

Regardless of original format, make DIPS using the following formats.  

    • Images
      • JPG
      • >100K-500K/photo
    • Documents
      • PDF 
      • >20Mg/file
    • Audio 
      • MP3
      • >100Mg/file
      • bitrate about 96-128
    • Video
      • MPEG-4 (.m4v).  Note: JWPlayer is the media player for our instance of Archon.  It does not currently support mp4's in some browsers.
      • >100Mg/file
      • bitrate about under 300 is preferable.  Max bitrate 700.   
      • Recommended Handbreak settings as  of 9/27/2019 
        • Preset = Very fast 720p30 or 480p30
        • Format = mp4, or m4v
        • Video codec = H.264
        • Video quality = RF 22-28
        • Large File size = off
        • Web Optimized = on
        • Audio = if speech only set to 96 otherwise leave default.

An application for creating DIPS in a bag is available here, but it is still the development stage.


You will now create a completed bag of the contents of the entire directory using the Bagit standard.

Set up instructions, commands and a batch process for Bagit can be found here

Upload DIPS

If adding these files to Archon, upload the bag and its "dips" folder to the web server, we currently use FileZilla for this step.  Do not include the "originals" directory, the "meta" directory or any of the text files created by Bagit.  For instance, a bag on the web server could look like:

    • 20120828_LairdRenovation/
      • data/
        • dips/
          • photos/
            • file1
            • file2
          • drawings/
            • file3
          • correspondence/
            • file4
            • file5

Change the permissions for the bag and all its contents to 755.

Create digital library records for each set that you want represented in Archon and associate these digital library records with its corresponding collection or collection content record.  This can be done in batches with the following MySQL scripts. 

INSERT INTO tblDigitalLibrary_DigitalContent(SELECT "" AS ID, "1" AS Browsable,tblCollections_Content.Title, tblCollections_Content.CollectionID, tblCollections_Content.ID AS CollectionContentID, tblCollections_UserFields.Value As Identifier,tblCollections_Content.Description AS Scope, NULL AS PhysicalDesc, tblCollections_Content.Date, NULL AS Publisher, NULL AS Contributor, NULL AS RightsStatement, REPLACE( CONCAT( '', REPLACE( tblCollections_UserFields.Value, '.tif', '.jpg' ) ) , '/originals', '/dips' ) AS ContentURL, "0" AS Hyperlink, CURRENT_TIMESTAMP as dateadded FROM `tblCollections_Content` LEFT JOIN tblCollections_UserFields ON tblCollections_Content.ID = tblCollections_UserFields.ContentID WHERE tblCollections_UserFields.Title LIKE '%UnitID%' AND (Value LIKE 'BAGNAME1%' OR Value LIKE 'BAGNAME2%'))  

Store AIPS

Add the entire bag to the AIPS directory on our server for long-term storage.  

  • No labels