Specifications for Bagging Accessions
Table of Contents
1 Specifications for Bagging Accessions
1.1 Bagging accessions
The first step in bagging accessions for permanent storage in the repository is to organize the files on disk into a valid BagIt 0.97 directory. (BagIt was jointly developed by the Library of Congress and the California Digital Library as a standard for receiving, storing and retrieving digital content: http://www.digitalpreservation.gov/news/2008/20080602news_article_bagit.html; https://confluence.ucop.edu/display/Curation/BagIt) Metadata from the accessions database, the EAD records store and in the files themselves are extracted and stored in three text files and stored with the files.
1.1.1 Bag-info.txt file
This file is created by extracting metadata from the accessions database. Its purpose is redundancy of information: to ensure that the accessioning information is stored with the files themselves. The BagIt specification requires certain fields, and the following crosswalk defines where the information is retrieved from in the current system.
| Bagit field | Accessions form site field | |
|---|---|---|
| Source-org | Record.department.name | |
| Org-address | Record.department.name searched in LDAP | |
| Contact-phone | Record.recordMaker.cnetid searched in LDAP | |
| Contact-email | Record.recordMaker pre-prended to string '@uchicago.edu' | |
| External-Description | Record.collection.description | |
| Internal-Description | Record.summary | |
| Bagging-date | Record.createDate | |
| External-identifier | Record.receipt | |
| Internal-identifier | Record.receipt | |
| Bag-size | Sum(file.size) from file inner join record on record.id=file.accession where record.receipt=accession | |
| Payload.0xum | (Sum(file.size) from file inner join record on record.id=file.accession where record.receipt=accession)/8 prepended to string '.' prepended to | |
| count(file.name) from file inner join record on record.id=file.accession where record.receipt=accession | ||
| Bag-group-identifier | 'ark:/61001' | |
| Internal-sender-identifier | Record.yourIdentifier | |
| Internal-sender-description | Record.summary | |
| Bag-count | Record.counter |
1.1.2 Manifest\_<alg>.txt
The manifest file is a complete list of almost every file in the BagIt directory. Filepaths are relative to the top the of the BagIt directory. Therefore, every file path in the manifest files start with the string 'data/' and is followed by the remainder of the path of the file including subdirectories. A file in the manifest file is one that when the bagging process occurred could be evaluated for a checksum. The algorithm used for these checksums is MD5.
| Manifest\_<alg> field | Accessions form site field | Alternate measurement |
|---|---|---|
| Checksum | File.checksum | Evaluate checksum on the fly |
| Filepath | File.path split after receipt identifier | Filepath split after accession identifier |
1.1.3 Erc.txt
In the University of Chicago Library Digital Repository's implementation, Erc.txt adds a descriptive element to the BagIt specification by integrating the Kernel Metadata and Electronic Resource Citation (ERC) specification with the BagIt specification. This information is collection-level metadata defining the "who", "what", "when" and "where" for an accession.
| Erc field | Accessions form site field | EAD field |
|---|---|---|
| Who | 'University of Chicago Library' | archdesc/did/unitpublisher |
| What | Record.collection.title prepended to Record.createDate | archdesc/did/unittitle |
| When | Record.createDate | archdesc/did/unitdate |
| Where | 'ark:/61001/' prepended to Record.receipt | n/a |
1.1.4 Fetch.txt
This file contains filepaths for files that could not have a checksum computed and therefore could not be transferred into the new bag. This file is optional.
| Fetch field | Field computed on the fly |
|---|---|
| URL | Filepath preceded by the string data/repository/ac' |
| Size | File size computed on-the-fly |
1.2 Interface for this tool
This tool is being designed as a background task. It will run as a cron job.
1.2.1 Auditing
| Audit statement | Who sees it | How it is communicated |
|---|---|---|
| List of unsuccessfully transferred files | repository administrator | |
| Count of successful transfers vs total files in the accession | repository administrator and depositor | |
| Date and time the processing started | repository administrator | email, accession forms database record update |
| Date and time the processing ended | repository administrator | email, accession forms database record update |
| Status of accession record | repository administrator | accessions forms database record update |
1.2.2 Accessions database record update
All accession records are created with the status 'started'. This is changed to 'in process' by a cron job that runs nightly and checks for file records associated with that accession record and creates a row in the report table for that accession.
This final cron job will check for any accession record with status 'in process' and create a bagit directory. Once completed, the status of the bagged accession record will change to 'processed'.
