Digital Repository Workflow

1. Go through each accession and determine whether
2. Go through each accession and look for high-risk file formats.
3. Place the ranked list of high-risk file format accessions at the top of the second list.
4. Begin iterating through these lists and performing the steps below.
5. Go through each accession directory and evaluate what exactly constitutes an object for each accession.
6. Go through each accession and determine what metadata format (if any) is available for each accession.
7. Investigate the files for every object.
8. Generate DCMI kernel metadata.
9. Use the tool which implements the specifications of the California Digital Library's Microservices for storage to finish processing the accession.

1 Go through each accession and determine whether

it is needed for an active digital collection, for example, Campus Publications or the Photographic Archive, using a list of these provided by the DLDC. This is the first list.
it is marked as discoverable and accessible. This is the second list.
it is marked as discoverable but not accessible. This is the third list.
it is marked not discoverable and not accessible, or accessible but not discoverable. Make this the fourth list.

2 Go through each accession and look for high-risk file formats.

Rank these accessions by degree of risk. The definition of high-risk file format is any file format that is proprietary, and any file-format that has been deprecated or is likely to be deprecated in the near future. The definition of near-future is a time span from now to 3 years from now. Resources that may be used to assist with this include the following.

2.1 DROID

"DROID is a software tool developed by The National Archives to perform automated batch identification of file formats."

2.2 PRONOM

PRONOM is the technical registry of The National Archives (U.K.).

2.3 JHOVE

JHOVE may be used to assist with file format identification and validation.

2.4 Library of Congress: Digital Preservation

See Sustainability of Digital Formats Planning for Library of Congress Collections, including Format Descriptions.

3 Place the ranked list of high-risk file format accessions at the top of the second list.

4 Begin iterating through these lists and performing the steps below.

Take the first list, begin iterating through the following steps with that list.
When the first list is completed, begin iterating through the second list.
When the second list is completed, begin iterating through the third list.
When the third list is completed, begin iterating through the fourth list.
Do steps 1 and 2 for any accessions that came in while processing lists 1-4.

5 Go through each accession directory and evaluate what exactly constitutes an object for each accession.

For example, in Campus Publications an object is a given issue of the magazine and all of the files (XML, PDF, TIFF, etc.) are connected to that particular issue; see Workflow for Campus Publications.

6 Go through each accession and determine what metadata format (if any) is available for each accession.

This should be done in the following stages.

Is there metadata available?
In what format is that metadata?
Is there a metadata record for every object (as defined above)?
If there is a metadata record for every object, is every metadata record in the same format?
Is every metadata record a valid instance of that format? This means: is it even possible to automate reading these metadata records or are there problems that will cause the automation to fail completely?
If there are errors in the metadata records, what are they? Where are they? Is it an encoding issue? Is it bad field names? If it's XML, is the XML malformed? These questions are not meant to be a complete list of the questions that need to be answered: they are a start to get the processor started with what they need to investigate.
Fix any bad metadata.
Resolve any issue with missing metadata records.

7 Investigate the files for every object.

Are there any obvious missing files in any particular object? For example: if a Campus Publications issue were missing an entire XML directory this would be the type of low granularity issue that would be discovered at this stage of the process.
Resolve any missing file issues.
Are there any mystery files in any particular object? For example, in APF there used to be some objects without thumbnail files but many that did have thumbnail files. There might also be some objects that have different file formats than most objects in a given accession.
Resolve any mystery file issues.
Verify that every object is as complete as it can be and that any incomplete object has a reason that is understood and recorded.
Generate technical metadata for the files using an automated tool (JHOVE).

8 Generate DCMI kernel metadata.

9 Use the tool which implements the specifications of the California Digital Library's Microservices for storage to finish processing the accession.

For example: darling -v can addversion /data/repository/CAN 5s8x13nn9dw43 /data/repository/ac/5s8x13nn9dw43