Preservation of our collection as both an online and potentially offline resource is of the utmost importance in order to assure use amongst our target users. XKCD itself represents an interesting collection primarily in regards to two factors: it is entirely digital and regularly growing to include more content. Thus it is necessary for our preservation scheme to incorporate these unique aspects when treating XKCD as a collection of documents.
In working with XKCD as a collection our group has decided to employ both incremental and complete backups of all aspects of the collection such that they can be made available both online and offline in the event that XKCD itself were to suddenly disappear from the internet. Additionally we employ a system which is fault tolerant in order to avoid situations in which catastrophic events might destroy large portions of our working data.
Finally, because XKCD is a constantly growing and evolving collection we have attempted to automate as much of the preservation process as possible, while still inviting administrative assistance in cases where it is beneficial or required. This allows us to offload much of the repetitive work involved in digital preservation and still have a functional backup, and in cases where administrative intervention is required interface with our preservation procedures such that we meet all of our preservation goals.
The first step to our preservation procedures is putting in place a semi-automated system for backing up both primary and secondary sources of data. This process is roughly akin to the best backup practices a regular computer user might put into place fr their own personal data, tailored for our specific data and made automation friendly.
We begin with the two sources of our collection's primary source material: The XKCD comic images themselves and their associated JSON database entries. XKCD updates (nearly) every Monday, Wednesday, and Friday, thus these will be our primary activity days. Each day on which XKCD updates at approximately 4AM CST (after the comic has been active for approximately five hours) a bash script on our server will automatically pull both the image and the JSON database entry into two separate local folders and rename them to include the date on which they were accessed. These two folders represent incremental backup of our primary source material.
Additionally at this time the metadata associated with the comic will be processed for inclusion in our own active XML metadata database and will be prepared to accept our own metadata in addition to what is provided by XKCD. This serves as a secondary backup copy of all of the information provided in the XKCD JSON database, increasing redundancy.
This system of incremental backups allows us to generate locally stored copies of all data which is not currently included in any of our complete backups (the generation of which will be explained shortly).
The code to put this functionality into practice in available here: https://github.com/Falx4444/501Webpage/blob/master/xkcdXMLarchiveincremental.sh (note: Requires recently versions of both wget and jsawk). And, if you were interested in trying our the incremental backup functionality yourself the script to seed the original XML database with all currently existing comics is located here: https://github.com/Falx4444/501Webpage/blob/master/xkcdXMLarchive.sh
Following the generation of our incremental backups once every two weeks, on Sunday at approximately 4AM CST, our incremental backups will be merged into out primary full backup which will reside on a separate partition and physical drive of our server, optimally one in a RAID 5 configuration, via a cron scheduled rsync task. In addition to housing local copies of both the comic images and their associated JSON database entries a back up copy of our own XML database will also be taken at this time in order to preserve our own secondary data as well as the primary data drawn from the XKCD website.Such a task would likely resemble the following:
rsync -avzRb /backup/dir/of/json/* /backup/dir/of/images/* /complete/backup/dir
In performing this merge, in order to account for changes in metadata, an incremental backup of the existing complete backup will be created, and in cases where the incremental backup taken at a later date includes more information it will be preferred over the copy taken on the original date of publication. This allows us to account for cases in which the addition of the transcript lags behind the original publication of the comic on the website, a relatively common occurrence.
On the first of every month this folder would, via rsync, be transferred to a backup server physically housed in a different location than our active server. This command would resemble:
rsyc -avzRb /complete/backup/dir user@remotebackupserver:/redundant/backup/dir
Thus, we would now have complete backups of both the primary source material and the secondary cataloging material we have added to the metadata in two physically separate locations, preparing us for not only data loss via hardware or software issues on one of our machines, but also for any potential disaster related losses, such as those caused by fire, natural disasters, power loss, localized zombie invasions, etc. Additionally the entirety of the contents of the current complete backup would be zipped into a single compressed file via
tar -vzcf `date +%m%d%g`.tar.gz /path/to/backup/directory and transferred into two locations: A directory on another physical hard disk or array in our backup server, optimally configured in RAID 5, and a portable hard drive which will be inserted into the backup computer on the last day of the month and removed on the morning of the first day of every month in order to be returned to another safe location, such as a bank safety deposit box. This would require two rsync commands formatted exactly as the previous local copy command between the incremental and complete backups, replacing the source directory with the location of the now zipped complete backup and the target directory in the first case with a directory located on the separate internal disk of the computer and in the latter case with a directory located on an external hard drive connected to the backup server.
Thus, when all is said and done our data exists in three separate places: our production server, a remote backup server, and external storage housed in a secure facility. Additions since the last potential cache wipe exist in the incremental backup, all data exists as a complete backup both on our active server and on the dedicated backup server(in two places), and finally on removable storage hardware stored in a facility such as a bank.
Laying out this backup scheme visually we get something that looks like this:
This backup scheme generates a suffecient amount of redundancy such that in the case of minor data loss backups can be easily recreated from data located on the same machine, or if an entire machine were to be destroyed its functionality could be replaced via the data stored elsewhere. Additionally, this backup scheme requires only three seperate peices of hardware, in reality only two which would need to be locally administered: The remote backup server and the external storage, assuming our production server was located in cloud hosting like EC2. This comparatively small hardware requirement makes our solution exceptionally easy and cheap to put into practice.
XKCD is a comic which not infrequently employs non-standard comics. Luckily for us owing to the diligence of the author, Randall Munroe, these aberrations are well documented in the provided JSON for each comic and therefore do not break our automated solutions. Instead these special cases will be monitored and acted on by those working with the collection. In order to accurately preserve these non-standard comics we have planned to take a full backup of the web-page on which the non-standard comic is embedded and consider that, for the purposes of preservation, to be roughly equivalent to what would otherwise be an image.
Employing such a solution allows the rest of our backup solution to behave as expected after the the page has been manually added to the incremental backup.