HOWTO: Package files for staging on the Drop Server

From Adpnwiki
Jump to: navigation, search

Here is a checklist based on a write-up by Charles Johnson on 19 February 2021.

Check List

Here are some quick practical suggestions on how to package up files into a LOCKSS Archival Unit, and how to get the AU staged on the drop server for harvest and consumption by LOCKSS.

Start with a Folder

Step 1. The materials you will be preserving SHOULD start out as files packaged into a directory tree under one top-level folder.

  • EXAMPLE: ADAH is preserving high-quality digital masters from our ongoing scanning projects in a set of directories called Q-Numbers Masters files. Each directory contains 500 TIFF files (with poetic and evocative names like Q0000150001.tif), along with some additional files for metadata pertaining to that package of files.
  • FORMAT: The directory can be organized into any hierarchical structure of folders and subfolders you want, so long as it’s all stored underneath one top-level folder.
  • EXCEPTION: If any of the files in your subdirectories happen to be named index.html or index.htm, this is a special case that requires certain precautions to make sure that all your content gets correctly harvested. Check with me for details on how to deal with this.
  • RECOMMENDATION: The number and size of files is up to you, but there are some practical constraints based on network capacity. It’s probably best practice to divvy up assets into AUs that will contain LESS THAN 1,000,000 or so individual files, and LESS THAN 1 TiB of data per AU. (The LOCKSS network can in fact handle very, very large AUs, and ADPNet is currently preserving AUs that are larger than these suggested, fairly arbitrary limits. But (1) nodes like that take forever to upload and crawl, which means that it’s a much slower turnaround time for you before we can confirm that they are preserved in the network; and (2) nodes like that also make some practical systems administration tasks more of a pain in the neck for the people who run the preservation nodes.

Name the Folder

Step 2. Your top-level folder can have almost any name, but the name MUST be unique among all the AUs you will ever upload.

  • Once you ingest an AU, you SHOULD NOT re-use that directory name unless you actually intend to replace the old materials with the new materials. You need a new name to ingest new AUs.
  • EXAMPLE: ADAH has a bunch of Q-Number Masters files to stage for ingest, so we give the directory a name unique to those contents, for example: Digitization-Masters-Q-numbers-Master-Q0000150001_Q0000150500m (so, next time we upload materials, we upload them under a new directory with the next numbers in the sequence, Digitization-Masters-Q-numbers-Master-Q0000105501_Q0000106000m).

Bag the Folder

Step 3. Once you have your top-level folder prepared and named, you SHOULD enclose the folder in a BagIt formatted directory.

  • You can do this easily using an open-source Python script (BagIt-Python) or using the Bagger application (Bagger).
  • EXAMPLE: When a Q-numbers directory is ready to be bagged at ADAH, I open Windows PowerShell, then I run:
 python bagit.py ${DIRNAME}
  
  • This encloses the directory with BagIt preservation data. As a result, Digitization-Masters-Q-numbers-Master-Q0000150001_Q0000150500m (for example) is now reorganized so that the top-level folder contains a single “payload” subdirectory, called data, that contains the 500+ TIFFs and associated metadata files, and then a set of small text files (baginfo.txt, bagit.txt, manifest-sha256.txt, tag-manifest-sha256.txt, etc.) that provide a manifest and checksums for those payload files, along with some meta-data about the packaging process.
  • If I want to validate the contents of the preservation package before I upload it, I can do that by running this on the same directory:
 python bagit.py --validate ${DIRNAME}

Prepare a LOCKSS Manifest and Drop It Into the Top Level of the Bag

Step 4. Once you have your top-level folder prepared, named, and bagged, you MUST create a small HTML file named manifest.html and drop it into the top-level directory alongside baginfo.txt, bagit.txt, etc.

  • FORMAT: The manifest.html file needs to include a link to your AU’s location on the drop server, some boilerplate HTML, and some boilerplate language that gives the LOCKSS daemon permission to harvest content. This is a bit of a pain in the neck and the format is under-documented, but LOCKSS won’t ingest your AU unless it includes a file like this with the correct URL and the correct boilerplate language.
  • EXAMPLE: After I’ve bagged a Q-Numbers directory, I generate a manifest.html file using a script to file in a standard template with information about the AU I’m about to upload. I place the file in the top level of the BagIt-formatted directory, alongside baginfo.txt, bagit.txt, etc. So now my directory looks like:
 Digitization-Masters-Q-numbers-Master-Q0000150001_Q0000150500m\
 - bagit.txt
 - baginfo.txt
 - manifest.html
 - manifest-sha256.txt
 - manifest-sha512.txt
 - tagmanifest-sha256.txt
 - tagmanifest-sha512.txt
 - data\
       - Q0000150001.tif
 […]
  • RECOMMENDATION: You can access a version of the templates I use if you go to this URL:
 archives.alabama.gov/Services/ADPNet/manifest.php
  • Fill in the form fields with your own information. The “Institution Code” is the username you’ve been assigned. The Directory Name is the unique name you chose in step 2. The Staging Area URL field is filled in with a pretty universal default value and should not need to be changed if you are working with our drop server.

Upload Your Archival Unit to Your Drop Server Staging Area

STEP 5. At this point your AU should be packaged and ready to be considered as an Archival Unit (AU) by the LOCKSS daemon.

  • Use WinSCP (or any other SFTP tool that you like) to upload the whole packaged-up directory to drop.adpn.org, storing it under the drop_au_content_in_here subdirectory of your staging area.
  • Notify ADPNet TPC to let us know you’re ready to go ahead with the ingest.