LOCKSS Basics

From Adpnwiki
Jump to: navigation, search

This pages should describe the very basics of LOCKSS for new members of the preservation network.

Lots of Copies Keeps Stuff Safe

LOCKSS -- Lots Of Copies Keeps Stuff Safe

LOCKSS is a preservation network that operates on the premise that the most surefire approach to long term preservation is by creating multiple replicas of data and distributing those replicas across geographies and hardware.

While one could make multiple copies of data on various media and mail those media to random corners of the globe, the media themselves are not interconnected. Invariably, when data is replicated across discrete media, discrepancies between sets of data will emerge. That is, without checking and monitoring for changes, one will find that two presumably identical sets of data are in fact not identical. At that point it would be a challenge to determine the correct data.

LOCKSS creates a preservation network where identical replicas of data are distributed across multiple servers, and the servers, while independent, are connected over network communications. On each server resides a full copy of the preservation data. In order to monitor for discrepancies the preservation servers participate in polls by creating a hash of the preserved content and send that hash as a vote. LOCKSS has elaborate logic to determine when action is taken or not, but it boils down to this:

  1. the votes are tallied to see if a quorum exists
  2. If a quorum exists and a preservation node finds itself in the minority, the preservation node will correct its own copy of the content to align with the majority

How to get data into the network

The method for 'ingest' of content in the 1.x version of LOCKSS is through a Web crawler. Content to be preserved is placed on a server that is accessible by URL. This is referred to as 'staging' content. In order to get the content into the preservation node, the LOCKSS software is given a start URL. LOCKSS uses a Web crawler to open the start URL and discovers new URLs and the content at the end of those URLs recursively.

For a Private LOCKSS Network, the preferred method of 'staging' content is by organizing the content into directories and exposing the entire directory tree as a Web page by using directory indexes.

Care should be taken here.

The URL that the staged content is retrieved with is inseparable from the content. The URL is stored in the file system in the preservation node with the content. Two identical pieces of data presented through two different URLs are treated as two completely separate pieces of preservation data.

The URL that the content is staged with should be as persistent as possible.

In order to achieve persistence, additional steps may need to be taken (for example proxies, rewrites, or using a DNS record).