Difference between revisions of "Archival Unit"

From Adpnwiki
Jump to navigation Jump to search
(cross-link to some HOWTO and Q/A pages)
Line 9: Line 9:
 
* [[HOWTO: Add a new AU to your node for preservation]]
 
* [[HOWTO: Add a new AU to your node for preservation]]
  
== From [https://plnwiki.lockss.org/index.php?title=LOCKSS_Technical_Manual LOCKSS Technical Manual] ==
+
== Technical Documentation ==
 +
=== From [https://plnwiki.lockss.org/index.php?title=LOCKSS_Technical_Manual LOCKSS Technical Manual] ===
  
 
: LOCKSS caches: The LOCKSS caches are the nodes in the network in which digital objects will be preserved and monitored as archival units. One archival unit is typically a one-year collection (size: 1GB to 20GB). The size of an AU results of a trade-off between the processing overhead required by large AUs and the guarantee that all AUs integrity can be regularly checked in the case of a multitude of small AUs. The daemon is a java application which collects digital objects through http requests from the original website, store them inside the cache as an Archival Unit, computes an SHA-1 checksum and regularly monitors their integrity by comparing the preserved content with the other caches in the network with a specific voting protocol (LCAP). The AUs are collected at various moments in time by the different nodes in the network to reduce the risk of communication issues. The content is regularly recrawled from the original website if it is still available. If a new version of the AU is available, the previous version is kept but only the most recent AUs will be checked for fixity.  
 
: LOCKSS caches: The LOCKSS caches are the nodes in the network in which digital objects will be preserved and monitored as archival units. One archival unit is typically a one-year collection (size: 1GB to 20GB). The size of an AU results of a trade-off between the processing overhead required by large AUs and the guarantee that all AUs integrity can be regularly checked in the case of a multitude of small AUs. The daemon is a java application which collects digital objects through http requests from the original website, store them inside the cache as an Archival Unit, computes an SHA-1 checksum and regularly monitors their integrity by comparing the preserved content with the other caches in the network with a specific voting protocol (LCAP). The AUs are collected at various moments in time by the different nodes in the network to reduce the risk of communication issues. The content is regularly recrawled from the original website if it is still available. If a new version of the AU is available, the previous version is kept but only the most recent AUs will be checked for fixity.  
Line 15: Line 16:
 
: —From [https://plnwiki.lockss.org/index.php?title=LOCKSS_Technical_Manual LOCKSS Technical Manual: Basic Private LOCKSS Network infrastructure] (Plnwiki)
 
: —From [https://plnwiki.lockss.org/index.php?title=LOCKSS_Technical_Manual LOCKSS Technical Manual: Basic Private LOCKSS Network infrastructure] (Plnwiki)
  
== From [[Getting Started with LOCKSS, Midge Coates, 2013-06|"Getting Started with LOCKSS" (June 2016)]] ==
+
=== From [[Getting Started with LOCKSS, Midge Coates, 2013-06|"Getting Started with LOCKSS" (June 2016)]] ===
  
 
: Each [[Archival Unit]] (AU) must be Web-accessible so the [[LOCKSS daemon]] can get at it. That means that it needs to be put on a Web-accessible server computer. If you have a server, but it isn't Web-accessible or access to it is blocked (at the firewall, for example), the LOCKSS crawl won't work. You can use a firewall to protect your collection, but make sure the LOCKSS daemon has access. Check with your IT support person or system administrator to confirm that this is the case.
 
: Each [[Archival Unit]] (AU) must be Web-accessible so the [[LOCKSS daemon]] can get at it. That means that it needs to be put on a Web-accessible server computer. If you have a server, but it isn't Web-accessible or access to it is blocked (at the firewall, for example), the LOCKSS crawl won't work. You can use a firewall to protect your collection, but make sure the LOCKSS daemon has access. Check with your IT support person or system administrator to confirm that this is the case.

Revision as of 09:29, 12 April 2022

An Archival Unit (AU) is the basic unit of preservation on the ADPNet LOCKSS network. Each AU has a Manifest Page and a collection of digital assets that the manifest page links to.

Network Members

Preservation Node Managers

Technical Documentation

From LOCKSS Technical Manual

LOCKSS caches: The LOCKSS caches are the nodes in the network in which digital objects will be preserved and monitored as archival units. One archival unit is typically a one-year collection (size: 1GB to 20GB). The size of an AU results of a trade-off between the processing overhead required by large AUs and the guarantee that all AUs integrity can be regularly checked in the case of a multitude of small AUs. The daemon is a java application which collects digital objects through http requests from the original website, store them inside the cache as an Archival Unit, computes an SHA-1 checksum and regularly monitors their integrity by comparing the preserved content with the other caches in the network with a specific voting protocol (LCAP). The AUs are collected at various moments in time by the different nodes in the network to reduce the risk of communication issues. The content is regularly recrawled from the original website if it is still available. If a new version of the AU is available, the previous version is kept but only the most recent AUs will be checked for fixity.
—From LOCKSS Technical Manual: Basic Private LOCKSS Network infrastructure (Plnwiki)

From "Getting Started with LOCKSS" (June 2016)

Each Archival Unit (AU) must be Web-accessible so the LOCKSS daemon can get at it. That means that it needs to be put on a Web-accessible server computer. If you have a server, but it isn't Web-accessible or access to it is blocked (at the firewall, for example), the LOCKSS crawl won't work. You can use a firewall to protect your collection, but make sure the LOCKSS daemon has access. Check with your IT support person or system administrator to confirm that this is the case.
[...]
The next major part of getting collections ingested is preparing each collection’s manifest page. The manifest page is a basic HTML document that contains at least two things: the permission statement that gives the LOCKSS daemon the right to harvest the collection (at the foot of the document) and the base URL of the AU for the collection (or for a fraction of the collection).
Here is an example of a fairly simple manifest page for a fairly simple collection (Auburn University’s Alabama Postcards images collection). Feel free to use it as a template.
ExampleAlabamaPostcardsCollectionManifest.png
The base URL isn't visible if you've opened the manifest in a browser window, because it's located inside the HTML <a href> tag. To see this tag, open the HTML file in DreamWeaver or Notepad++ . If you're opening it in a browser window, go to View => Source to see the tags.
I usually put all the URLs for the sub-folders in my AU, but the LOCKSS folks tell me it isn't really necessary, as the daemon will go to all the folders inside the original unless you tell it not to. So you could leave out everything except the base URL. NOTE: These URLS are not the CONTENTdm collection URLs; they're the URLs for the collections (or AUs) on your Web server.
The manifest page should be kept in the same folder as the AU so that the LOCKSS daemon can find it easily. If you break a collection up into digestible chunks (50 GB or so), you should have a separate manifest file for each AU chunk, with each AU chunk in a separate folder, along with its own manifest page. I also like to do a metadata export from the relevant CONTENTdm collection as a tab-delimited text file and put a copy of that in each folder also, in case the CONTENTdm collection needs to be reconstituted after a disaster.
So, if you have one AU for a collection, you need one manifest with the appropriate base URL. If you have two AUs, you need two manifests, one in each AU folder, each with the appropriate base URL for that particular AU. Here is a screenshot of the folder directory for the Alabama Postcards AU, so you can see what is in the folder. The manifest has a little box around it, so you can't miss it.
ExampleAlabamaPostcardsCollectionDirectoryListing.png
—From "Getting Started with LOCKSS" (June 2016) by Midge Coates