vrijdag 30 januari 2015

Webarchaeology: finding old pages still online

Here in the Netherlands, the legal framework isn't much supportive of doing a full domain harvest. This is one of the reasons that the KB, the National Library of the Netherlands, follows a selective approach. The owner of each selected site is asked for permission to have the site harvested, stored and made available by the KB.

Almost by definition, a selective approach does not result in such a complete representation of the national web as a domain harvest. Selecting websites for the archive is a labour intensive job. So because of this approach, potentially valuable parts of the Dutch web are at risk.

Regarding the history of the Dutch web, many early sites have disappeared before they were archived by the KB or other organisations like the Internet Archive. However, some remains of that early web are still online. But how to find those pieces to preserve them for the future?

In the days before Facebook, people used personal home pages to publish on the web. Many commercial providers that hosted the early home pages have gone. Other home pages disappeared because people moved to other providers and didn't bother to keep the home page. Others were simply removed, due to a lack of interest, privacy concerns or sadly because their owners died and stopped paying the bills.

In the none-commercial world, the situation has been a bit more stable. Most of the scientific institutions that witnessed the rise of the web still exist. Generally, those organisations have experienced little pressure to save money by removing the unused home pages of employees that moved or retired. Probably payroll administration wasn't even tied to the administration of user accounts. Often employees of some of those institutions played a vital role in the early days of the web. Overall this has created a beneficial situation for what might be called 'internet archeology', using the internet to dig up stuff that was thought to be long gone.

Following this approach, a great deal of pages from the early web were found. Some great examples are the home pages found at CWI, site:homepages.cwi.nl and at NIKHEF, Willem van Leeuwen's homepage.

An other method of finding old pages is to use an other early site as 'bait'. One of the early Dutch websites was DDS.nl (see: Internet Archive), a highly successful Freenet. Overwhelmed by its success and lacking sufficient funds, DDS.nl stopped in 1999. Combined this makes DDS.nl great bait for finding old pages. A Google search for link:dds.nl -site:dds.nl will result in many pages from before the turn of the century

dinsdag 27 januari 2015

Linkrot and the mset^H^H^H^H data-versiondate attribute

I run an archaelogical website (http://vici.org/) that has a database consisting of over 20000 records as its backend. Many of these records provide external links. Of course, every now and then linkrot creeps in and links stop working or direct to a page with content other than intended.

To overcome this I've started creating a tool that wil auto-archive all external links. When a user clicks on a link, a javascript will invoke a tiny service that returns the HTTP status code of the requested page. If the page is not available (returning a 404) the user will be redirected to a web archive. Aim is that the site will eventually run its own webarchive, auto-archiving each newly discovered link.

When directing a user to an archived version of a page, ideally we link to that very version of the page the author had in mind when he created the link. So we need more information than just a hyperlink. This issue can be solved by following an approach originally suggested by Ryan Westphal, Herbert Van de Sompel and  Michael L. Nelson in "The mset Attribute". Basically it proposes to enrich hyperlinks with an attribute that provides either temporal context or refers to a specific archived copy or both. Their draft has now been superseeded by the Memento Robust Links specification (Robust Links - Link Decoration, see also Robust Links - Motivation).

A hyperlink following this specification could look like:

<a href="http://www.w3.org/spec.html" data-versiondate="2014-03-17">HTML</a>


<a href="http://www.w3.org/spec.html" data-versiondate="2014-03-17"

I intend to implement the data-versiondate attribute in the CMS of the website. When a new link is added to a record, the CMS will insert a data-versiondate attribute using the current date.

Update 2015-01-27 17:09 (CET): added the Robust Link specs and changed examples accordingly.

PS: See also the W3.org community on Robustness and Archiving.

zondag 25 januari 2015

Validating warcs?

Recently I started experimenting with Mat Kelly's Chrome extension WARCreate. An excellent idea to have a tool that enables both professional curators and amateur web archivists alike to create WARCs on-the-fly from within a browser.

My first experiments seemed to run fairly well, even better than expected, but the WARC I created from Frankfurter Algemeine couldn't be loaded in my OpenWayback engine. Creating a CDX the tool complained there were invalid characters in the WARC. The resulting cdx-file was empty.

I validated the WARC using the JWAT Tools but the output didn't offer me much more insight. Here is what it returned:

./jwattools.sh test -e 20150123091322216.warc
Showing errors: true
Validate digest: true
Using 1 thread(s).
Output Thread started.
ThreadPool started.
Queued 1 file(s).
ThreadPool shut down.                                              
Output Thread stopped.
# Job summary
GZip files: 0
  +  Arc: 0
  + Warc: 0
 Arc files: 0
Warc files: 1
    Errors: 17
  Warnings: 0
RuntimeErr: 0
   Skipped: 0
      Time: 00:00:05 (5129 ms.)
TotalBytes: 9.9 mb
  AvgBytes: 1.9 mb/s
'WARC-Target-URI' value: 14
Data before WARC version: 1
Empty lines before WARC version: 1
Trailing newlines: 1

I never used the JWAT Tools before but reading about them I noticed it also offers the ability to create cdx-files. To my suprise, this tool did create cdx-file and indeed, now I was able to display Frankfurter Algemeine in Open Wayback. I did contact Mat Kelly about the issue, perhaps it can be fixed. More verbose output from JWAT Tools would also help.