Specifications for Bagging Accessions

Table of Contents

1 Specifications for Bagging Accessions

1.1 Bagging accessions

The first step in bagging accessions for permanent storage in the repository is to organize the files on disk into a valid BagIt 0.97 directory. (BagIt was jointly developed by the Library of Congress and the California Digital Library as a standard for receiving, storing and retrieving digital content: http://www.digitalpreservation.gov/news/2008/20080602news_article_bagit.html; https://confluence.ucop.edu/display/Curation/BagIt) Metadata from the accessions database, the EAD records store and in the files themselves are extracted and stored in three text files and stored with the files.

1.1.1 Bag-info.txt file

This file is created by extracting metadata from the accessions database. Its purpose is redundancy of information: to ensure that the accessioning information is stored with the files themselves. The BagIt specification requires certain fields, and the following crosswalk defines where the information is retrieved from in the current system.

Bagit field Accessions form site field  
Source-org Record.department.name  
Org-address Record.department.name searched in LDAP  
Contact-phone Record.recordMaker.cnetid searched in LDAP  
Contact-email Record.recordMaker pre-prended to string '@uchicago.edu'  
External-Description Record.collection.description  
Internal-Description Record.summary  
Bagging-date Record.createDate  
External-identifier Record.receipt  
Internal-identifier Record.receipt  
Bag-size Sum(file.size) from file inner join record on record.id=file.accession where record.receipt=accession  
Payload.0xum (Sum(file.size) from file inner join record on record.id=file.accession where record.receipt=accession)/8 prepended to string '.' prepended to  
  count(file.name) from file inner join record on record.id=file.accession where record.receipt=accession  
Bag-group-identifier 'ark:/61001'  
Internal-sender-identifier Record.yourIdentifier  
Internal-sender-description Record.summary  
Bag-count Record.counter  

1.1.2 Manifest\_<alg>.txt

The manifest file is a complete list of almost every file in the BagIt directory. Filepaths are relative to the top the of the BagIt directory. Therefore, every file path in the manifest files start with the string 'data/' and is followed by the remainder of the path of the file including subdirectories. A file in the manifest file is one that when the bagging process occurred could be evaluated for a checksum. The algorithm used for these checksums is MD5.

Manifest\_<alg> field Accessions form site field Alternate measurement
Checksum File.checksum Evaluate checksum on the fly
Filepath File.path split after receipt identifier Filepath split after accession identifier

1.1.3 Erc.txt

In the University of Chicago Library Digital Repository's implementation, Erc.txt adds a descriptive element to the BagIt specification by integrating the Kernel Metadata and Electronic Resource Citation (ERC) specification with the BagIt specification. This information is collection-level metadata defining the "who", "what", "when" and "where" for an accession.

Erc field Accessions form site field EAD field
Who 'University of Chicago Library' archdesc/did/unitpublisher
What Record.collection.title prepended to Record.createDate archdesc/did/unittitle
When Record.createDate archdesc/did/unitdate
Where 'ark:/61001/' prepended to Record.receipt n/a

1.1.4 Fetch.txt

This file contains filepaths for files that could not have a checksum computed and therefore could not be transferred into the new bag. This file is optional.

Fetch field Field computed on the fly
URL Filepath preceded by the string data/repository/ac'
Size File size computed on-the-fly

1.1.5 Directory structure of a bag

  1. Can directory
    1. Pa/ir/tr/ee directory tree
      1. Noid directory
        1. fetch.txt
        2. bag-ifo.txt
        3. manifest\md5.txt
        4. data directory
          1. All files in the accessions

1.2 Interface for this tool

This tool is being designed as a background task. It will run as a cron job.

1.2.1 Auditing

Audit statement Who sees it How it is communicated
List of unsuccessfully transferred files repository administrator email
Count of successful transfers vs total files in the accession repository administrator and depositor email
Date and time the processing started repository administrator email, accession forms database record update
Date and time the processing ended repository administrator email, accession forms database record update
Status of accession record repository administrator accessions forms database record update

1.2.2 Accessions database record update

All accession records are created with the status 'started'. This is changed to 'in process' by a cron job that runs nightly and checks for file records associated with that accession record and creates a row in the report table for that accession.

This final cron job will check for any accession record with status 'in process' and create a bagit directory. Once completed, the status of the bagged accession record will change to 'processed'.

Author: Tyler Danstrom (repository@lib.uchicago.edu)

Date:

Emacs 25.3.1 (Org mode 8.2.10)

Valid XHTML 1.0 Strict