vrijdag 30 januari 2015

Webarchaeology: finding old pages still online

Here in the Netherlands, the legal framework isn't much supportive of doing a full domain harvest. This is one of the reasons that the KB, the National Library of the Netherlands, follows a selective approach. The owner of each selected site is asked for permission to have the site harvested, stored and made available by the KB.

Almost by definition, a selective approach does not result in such a complete representation of the national web as a domain harvest. Selecting websites for the archive is a labour intensive job. So because of this approach, potentially valuable parts of the Dutch web are at risk.

Regarding the history of the Dutch web, many early sites have disappeared before they were archived by the KB or other organisations like the Internet Archive. However, some remains of that early web are still online. But how to find those pieces to preserve them for the future?

In the days before Facebook, people used personal home pages to publish on the web. Many commercial providers that hosted the early home pages have gone. Other home pages disappeared because people moved to other providers and didn't bother to keep the home page. Others were simply removed, due to a lack of interest, privacy concerns or sadly because their owners died and stopped paying the bills.

In the none-commercial world, the situation has been a bit more stable. Most of the scientific institutions that witnessed the rise of the web still exist. Generally, those organisations have experienced little pressure to save money by removing the unused home pages of employees that moved or retired. Probably payroll administration wasn't even tied to the administration of user accounts. Often employees of some of those institutions played a vital role in the early days of the web. Overall this has created a beneficial situation for what might be called 'internet archeology', using the internet to dig up stuff that was thought to be long gone.

Following this approach, a great deal of pages from the early web were found. Some great examples are the home pages found at CWI, site:homepages.cwi.nl and at NIKHEF, Willem van Leeuwen's homepage.

An other method of finding old pages is to use an other early site as 'bait'. One of the early Dutch websites was DDS.nl (see: Internet Archive), a highly successful Freenet. Overwhelmed by its success and lacking sufficient funds, DDS.nl stopped in 1999. Combined this makes DDS.nl great bait for finding old pages. A Google search for link:dds.nl -site:dds.nl will result in many pages from before the turn of the century

1 opmerking:

  1. Very interesting post. Recently I did an attempt myself to recover a bunch of newspaper articles that were removed by a major Dutch newspaper, after it turned out they were based on fabricated sources. Using something quite similar to the
    'bait' approach you're describing here, I was able to locate many of the deleted articles in the Internet Archive's WayBack Machine. I wrote a blog post about it, which you can find here:

    Dutch newspaper wipes out articles citing fabricated sources - Internet Archive to the rescue!