Digital Repository Overview

2009-04-23

The University of Chicago Library's Digital Repository is a preservation repository for digital content for which the University of Chicago Library has assumed curatorial responsibility. Its primary purpose is to ensure that this content persists through time. Persistence in a digital context may require transformation of deposited content into new digital formats if it is expected that the originally deposited formats may become obsolete in time. At bottom, ensuring persistence requires two things: that bitstreams are physically safe (that the bits have not been corrupted or destroyed); that bitstreams are logically safe (that the bits can be converted back into usable information by a machine, such as a desktop computer, that wants to consume the bitstreams and render them meaningfully). The core responsibility of digital repository management is to ensure these two kinds of persistence.

The University of Chicago Library's Digital Repository is managed by the Digital Library Development Center (DLDC). Currently, it consists of two mirrored computer systems. Nightly, content is copied from the primary system onto a second system, which can serve as a live backup to the first in case of need. From there, content is transferred to NSIT's centralized TSM tape-storage system, for disaster recovery. When content is deposited into the Repository (discussed below) it is inspected for at-risk digital formats (formats that are currently expected to become obsolete); if detected, content in these formats is converted into formats that are expected to persist for some time.

The Repository will also include a multi-terabyte storage array dedicated to scientific datasets from the Sloan Digital Sky Survey (SDSS). This array will be a mirror of one at Fermilab. A second mirror will exists at Johns Hopkins University.

It is assumed that digital content for which the University of Chicago Library has assumed curatorial responsibility is more or less analogous to non-digital content for which such responsibility has already been assumed. Examples include a digitally reformatted brittle book which is already part of the Library's collections, a map in digital form, an electronic thesis or dissertation, archival materials in digital form (e.g., letters, manuscripts, correspondence), and so on. However, the digital world is not precisely analogous to the physical world, and new types of objects, the persistence of which the Library might want to ensure, can be imagined, for example, a website, a weblog, a scientific dataset, and so on.

It is not envisioned at this time that the Digital Repository will contain all digital content for which the Library wants to assume curatorial responsibility. Instead, it is envisioned that the Repository will preserve those digital objects for which other solutions do not (at least at present) exist. Examples of content which the Repository might not house include those licensed resources which are being preserved by Portico, books digitized by Google and destined for Hathi Trust, and so on. Thus Digital Repository management exists in a context and with an awareness of other relationships and agreements which the Library might enter into for this common purpose. Selection for the Repository thus also has to have this context and awareness.

Digital content entering the Repository has a life-cycle: (a) Deposit; (b) Accessioning; (c) Processing.

Deposit requires an interaction between the depositor and the Repository, and has the following components.

Who can deposit: Anyone authorized to select content for the Library's collections may deposit content into the Digital Repository. Conversely, all content included in the Repository must have been selected by such a selector.
What to deposit: Content that is appropriate for the Digital Repository is content that is unique or rarely held in its digital form. Examples include: the digital masterfile in TIFF format for an image file, or in WAV format for an audio file, either locally created, or created by a vendor--in either case, the process is expensive, and the digital resource is a valuable and unique asset; purchased and locally stored digital content--again, the process to acquire represents an expense which one does not want to duplicate, and the asset may be rarely held, e.g., a Soviet-era map of China. In addition to the digital content, the Repository requires that a description of the digital content also be deposited. For some types of discrete content, such as a digitally reformatted book or an electronic thesis or dissertation, description of the content usually takes the form of MARC cataloging. In these cases, the cataloging should be deposited in MARC communications format, which is easily machine readable. Other types of discrete content, such as digital images, might be accompanied by other kinds of metadata, for example Dublin Core or VRA Core, in tab-delimited, comma-separated values (CSV) or XML formats. Other types of content will be accompanied by collection-level metadata, for example, manuscript collections coming from the Special Collections Research Center, which are typically not described at the item level.
When to deposit
: Content should be deposited before it becomes at risk of (physical) corruption or destruction, or (logical) unreadability. Some kinds of content, which are already being backed up from centralized computer storage to well-managed central backup systems, such as the TSM system managed by NSIT, are not immediately at risk physically. However, content stored in proprietary formats, even when backed up in this manner, become at risk of (logical) unreadability if the proprietary format becomes obsolete, or if the system creating the content in proprietary format is about to become unsupported. In these cases, materials should be deposited into the Repository to ensure that migration to a non-proprietary, long-term preservation format takes place promptly, unless such migration happens outside of the Repository as part of the normal course of Library business. Content not on centralized, centrally backed up storage media, such as content on CD-ROM, should either be moved to centralized storage with centralized backup, or else be deposited into the Repository according to established selection criteria (for example, the Repository is not a substitute for the @work storage supported by Administrative and Desktop Systems).
How to deposit: Currently, deposit is a mediated function. It is initiated by sending email to repository@lib.uchicago.edu. Depositors must be prepared to supply information about rights and permissions, including who owns the material: the person making the deposit is typically not the owner or rights- and permissions-holder. For some kinds of content, web forms are an appropriate method to initiate a deposit; these are being developed, but are not yet in production. For other kinds of content, such as multi-terabyte scientific datasets, web forms will never be appropriate. Currently content is transferred to the Repository in a variety of ways: using optical media (a CD-ROM or DVD), or an external hard-drive; putting the content on centralized storage, such as Monsoon, or storage managed by the DLDC; etc. These details are worked out after the deposit process has been initiated by sending email to repository@lib.uchicago.edu.

Accessioning and Processing are internal Repository functions. Accessioning means to move the deposit from the place where it was originally transferred into a place where it can be managed. Processing means to take the deposited and accessioned content, and the description of that content, and package it according to established standards for packaging digital content, such as METS, and best practices for the application of those standards.

In addition to its core function of ensuring the physical and logical persistence of the digital content it contains, and in addition to a Deposit function, a digital repository may support a Discovery function, and must support an Access (Delivery) function. Deposit, Discovery and Access functions all presuppose answers to the questions, Who may Deposit? Who may Discover? and Who may Access? Who may Deposit has been addressed above. Who may Discover and Who may Access are determined by the rights and permissions associated with the deposited content. In addition, some materials present the additional question, When may these be accessed? For example, some archival materials are embargoed for some period of time (e.g., 25 years) before access is allowed. Because these rights and permissions issues considerably complicate automation, implementing Discovery and Access functions for the Digital Repository is being implemented after implementation of the Deposit function. Currently, the Repository supports viewing simple lists of what has been deposited at http://repository.lib.uchicago.edu/; an RSS feed has also been implemented for convenience. Access is currently mediated, by sending email to repository@lib.uchicago.edu. More sophisticated Discovery functions for materials for which there are no rights and permissions issues are being built. The first of these will take the form of an OAI-PMH (Open Archives Initiative - Protocol for Metadata Harvesting) provider, allowing services that know how to harvest from OAI-PMH providers to extract metadata for freely available content; these metadata will have been created for such content from the originally deposited metadata. More interactive Discovery mechanisms are being considered, but any forward motion on this path must be preceded by the question, How should Repository discovery interoperate with other Library discovery services such as LENS, or discovery services that are being contemplated as part of Project Bamboo? In other words, rushing to create yet another silo'd interface, instead of thinking carefully about how a repository might best interoperate with what exists or is being contemplated, is neither well thought-out, nor coordinated, and potentially not cost-effective, not at any time, but especially not in this economic climate which restricts available resources for all of the Library's initiatives.

The University of Chicago Library's Digital Repository is not a so-called institutional repository. An institutional repository as currently construed is designed to hold the scholarly or research output of an institution, specifically faculty publications or pre-publications, but institutional repositories do not necessarily guarantee the persistence through time of the content they contain. They are designed primarily for public access and discovery. Though the Digital Repository may in future support public access and discovery for some materials, its primary purpose is to ensure the preservation of digital content selected by recognized selection processes. If the University were to impose a "self-archiving mandate," discovery and access for content deposited according to that mandate would have to be provided, but the need for the Digital Repository as a place to preserve content indefinitely would not go away; either the Repository would have to provide these functions itself for these materials, or it might serve as the preservation component of systems that already provide these functions, such as DSpace, EPrints, or Fedora.