woensdag 21 oktober 2015

A beginners guide to processing 'lots of' data

Working at a library, there is often the need to process a lot of data. Processing data, for example using XSLT or scripts, is one thing, but when there is 'a lot' of data to process additional rules apply. I consider myself a beginner when it comes to processing larger amounts of data, but here are some basic rules that I've learned so far:

1. Keep it simple, learn to use Unix-tools

For most processing tasks or data analysis jobs, just a small set of very simple tools will do the job. Complex taks can be executed efficiently by chaining ('piping') simple tools that are just very good at one specific task. On the other hand, using complex tools will often unnecessarily complicate the process, for example by having very high requirements with respect to available system memory. To start with, tools that will bring you a long way are 'grep', 'sed', 'sort', 'uniq' and 'wc -l'.

2. Keep your job running, use 'screen'

Usually, data is not processed on the actual computer that is at your desk but on a remote machine. When you work on a computer through a remote connection, you don't want to run the risk that your processing job quits just because your console lost its connection. Well, there is a tool for that called 'screen'. Before you start your job, start a 'screen' session. Now your process will just keep on running if you loose the connection to the console. When that happens, just login to the shell again and give the command 'screen -R'. Now you are back in your last session, the processing job still running.

3. Keep your data in one file

When you need to process for example one million records, your OS won't be happy if you store all these records as separate files on your filesystem. Even a simple 'ls' command wil now take ages to complete. Storing the data in a database might be a good idea but has some downsides, one of them being that it complicates required scripts and tools and usually just creates technical overkill.
My approach is to store all data in one file. My data harvesting script serializes XML to one line per OAI-PMH record. Now using a very simple shell script I can process each line separately.

4. Keep your cores working using 'parallel'

When processing data, make best use of available processing power. This can be done 'manually' splitting up your task into a number of jobs equal to the number of cores on your machine and run your processes in the background (by adding a '&' to the invocation). Besides making better use of available processing power, splitting inputfiles into smaller ones will significantly speed up the work.
Great news is that both parallelizing jobs and splitting inputfiles can all be done using one tool named GNU parallel. An example call

<in.txt parallel --pipe --blocksize 3000000 '/parse.sh > /tmp/out_{#}'
will process all the lines in 'in.txt' using the script 'parse.sh' and store results in separate files in /tmp/.

5. Your OS matters

I am experienced with Unix-tools. MS Windows does not appear to be designed to do the stuff you can easily do using Unix (or Linux or MacOS). Fortunately, with Cygwin your can run most Unix-tool on Windows. Downside is that, at least in my experience, scripts under Cygwin are much slower than on an equivalent Linux machine.

6. Your hardware matters

Maybe this goes without saying. Just would like to conclude that when it comes to I/O speed, systems with SSD are vastly superior over systems with traditional drives. Personally, I will be only buying SSDs.

vrijdag 13 februari 2015

Link rot: detecting soft-404s

Links rot and content drifts, we are all aware of that. But how can one actually detect link rot? Web servers do not always return a proper '404' HTTP status code if the requested page can't be found. Often, the replacement page that tells the user that the requested page was not found, is accompanied by a '200' status code, signifying everything is OK. This called a soft-404.

So we can't always rely on the HTTP status code to know whether a page is available. Since the robustify.js website add-on depends on knowing if a page can be found or not, I implemented an algorithm that attempts to detect these soft-404s.

I followed an approach that was suggested to me on Twitter. By sending the server a request with a random url, we know that the returning page must be a '404'. Now with a technique known as fuzzy hashing we can compare this known '404' page with the requested page. If they are identical or at least very similar it is very likely that the requested page is also a '404'.

There is room for optimizing this algorithm. To start with, the required level of similarity is something that can be tweaked. Robustify.js now only performs the soft-404 test if the original request results in one or more redirects. However, soft-404s can be generated at the original request url, without redirection. So with this approach we will miss out some soft-404s. Further, if the random request actually does return a '404' status code, we might assume this server is configured well and we might skip the comparison.
Additionally, we might try to 'read' the page and see if it contains strings like 'error' or '404'. Such an approach is clearly less elegant than the fuzzy hashing approach and would require more training (internationalization) and maintenance. Perhaps it might work well, on top of the hashing approach if the similarity between the page and the forced '404' is somewhat indecisive.

Improving the soft-404 detection algorithm will necessarily require a lot of manual testing. Without a perfect soft-404 detection there is no easy way to create a test set and without a large enough test set we can't be very effective at improving the algorithm. Since soft-404 detection is not just valuable for robustify.js but also for many other applications, the heritrix crawler being a fine example, I do hope that with some community effort with can further improve it. A first step might be for all you crawl engineers out there to send suspect crawl artefacts to the statuscode.php service (part of robustify.js) with soft-404 detection enabled and see it it does recognize it as a soft-404.

For example, test result of http://www.trouw.nl/tr/nl/4324/Nieuws/archief/article/detail/1593578/2010/05/12/Een-hel-vol-rijstkoeken-en-insecten.dhtml
is provided by
which shows at the end of the JSON output there is a 100% match between this page and the forced '404' (thus recording a '404' status code).

Without soft-404 detection the script gives a 200 status code:

maandag 2 februari 2015

robustify.js: Returning a Memento iso a “404”

To end link rot on the archaeological site I run (Vici.org), I wanted a tool that would stop sending users to pages that return a “404 File not found” error. So I wrote a nifty little script called robustify.js.

robustify.js checks the validity of each link a user clicks. If the linked page is not available, robustify.js will try to redirect the user to an archived version of the requested page. The script implements Herbert Van de Sompel's Memento Robust Links - Link Decoration specification (as part of the Hiberlink project) in how it tries to discover an archived version of the page. As a default, it will use the Memento Time Travel service as a fallback. You can easily implement robustify.js on your web pages in so that it redirects pages to your preferred web archive.

robustify.js can be found on GitHub. A demo can be seen here.

vrijdag 30 januari 2015

Webarchaeology: finding old pages still online

Here in the Netherlands, the legal framework isn't much supportive of doing a full domain harvest. This is one of the reasons that the KB, the National Library of the Netherlands, follows a selective approach. The owner of each selected site is asked for permission to have the site harvested, stored and made available by the KB.

Almost by definition, a selective approach does not result in such a complete representation of the national web as a domain harvest. Selecting websites for the archive is a labour intensive job. So because of this approach, potentially valuable parts of the Dutch web are at risk.

Regarding the history of the Dutch web, many early sites have disappeared before they were archived by the KB or other organisations like the Internet Archive. However, some remains of that early web are still online. But how to find those pieces to preserve them for the future?

In the days before Facebook, people used personal home pages to publish on the web. Many commercial providers that hosted the early home pages have gone. Other home pages disappeared because people moved to other providers and didn't bother to keep the home page. Others were simply removed, due to a lack of interest, privacy concerns or sadly because their owners died and stopped paying the bills.

In the none-commercial world, the situation has been a bit more stable. Most of the scientific institutions that witnessed the rise of the web still exist. Generally, those organisations have experienced little pressure to save money by removing the unused home pages of employees that moved or retired. Probably payroll administration wasn't even tied to the administration of user accounts. Often employees of some of those institutions played a vital role in the early days of the web. Overall this has created a beneficial situation for what might be called 'internet archeology', using the internet to dig up stuff that was thought to be long gone.

Following this approach, a great deal of pages from the early web were found. Some great examples are the home pages found at CWI, site:homepages.cwi.nl and at NIKHEF, Willem van Leeuwen's homepage.

An other method of finding old pages is to use an other early site as 'bait'. One of the early Dutch websites was DDS.nl (see: Internet Archive), a highly successful Freenet. Overwhelmed by its success and lacking sufficient funds, DDS.nl stopped in 1999. Combined this makes DDS.nl great bait for finding old pages. A Google search for link:dds.nl -site:dds.nl will result in many pages from before the turn of the century

dinsdag 27 januari 2015

Linkrot and the mset^H^H^H^H data-versiondate attribute

I run an archaelogical website (http://vici.org/) that has a database consisting of over 20000 records as its backend. Many of these records provide external links. Of course, every now and then linkrot creeps in and links stop working or direct to a page with content other than intended.

To overcome this I've started creating a tool that wil auto-archive all external links. When a user clicks on a link, a javascript will invoke a tiny service that returns the HTTP status code of the requested page. If the page is not available (returning a 404) the user will be redirected to a web archive. Aim is that the site will eventually run its own webarchive, auto-archiving each newly discovered link.

When directing a user to an archived version of a page, ideally we link to that very version of the page the author had in mind when he created the link. So we need more information than just a hyperlink. This issue can be solved by following an approach originally suggested by Ryan Westphal, Herbert Van de Sompel and  Michael L. Nelson in "The mset Attribute". Basically it proposes to enrich hyperlinks with an attribute that provides either temporal context or refers to a specific archived copy or both. Their draft has now been superseeded by the Memento Robust Links specification (Robust Links - Link Decoration, see also Robust Links - Motivation).

A hyperlink following this specification could look like:

<a href="http://www.w3.org/spec.html" data-versiondate="2014-03-17">HTML</a>


<a href="http://www.w3.org/spec.html" data-versiondate="2014-03-17"

I intend to implement the data-versiondate attribute in the CMS of the website. When a new link is added to a record, the CMS will insert a data-versiondate attribute using the current date.

Update 2015-01-27 17:09 (CET): added the Robust Link specs and changed examples accordingly.

PS: See also the W3.org community on Robustness and Archiving.

zondag 25 januari 2015

Validating warcs?

Recently I started experimenting with Mat Kelly's Chrome extension WARCreate. An excellent idea to have a tool that enables both professional curators and amateur web archivists alike to create WARCs on-the-fly from within a browser.

My first experiments seemed to run fairly well, even better than expected, but the WARC I created from Frankfurter Algemeine couldn't be loaded in my OpenWayback engine. Creating a CDX the tool complained there were invalid characters in the WARC. The resulting cdx-file was empty.

I validated the WARC using the JWAT Tools but the output didn't offer me much more insight. Here is what it returned:

./jwattools.sh test -e 20150123091322216.warc
Showing errors: true
Validate digest: true
Using 1 thread(s).
Output Thread started.
ThreadPool started.
Queued 1 file(s).
ThreadPool shut down.                                              
Output Thread stopped.
# Job summary
GZip files: 0
  +  Arc: 0
  + Warc: 0
 Arc files: 0
Warc files: 1
    Errors: 17
  Warnings: 0
RuntimeErr: 0
   Skipped: 0
      Time: 00:00:05 (5129 ms.)
TotalBytes: 9.9 mb
  AvgBytes: 1.9 mb/s
'WARC-Target-URI' value: 14
Data before WARC version: 1
Empty lines before WARC version: 1
Trailing newlines: 1

I never used the JWAT Tools before but reading about them I noticed it also offers the ability to create cdx-files. To my suprise, this tool did create cdx-file and indeed, now I was able to display Frankfurter Algemeine in Open Wayback. I did contact Mat Kelly about the issue, perhaps it can be fixed. More verbose output from JWAT Tools would also help.