Datatopia: 2017

vrijdag 1 september 2017

Harvesting metadata from OAI-PMH repositories

OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting is a commonly used protocol for harvesting metadata. Many institutions like libraries, universities and archives use OAI-PMH to offer public access to metadata records.

To facilitate harvesting and processing metadata from OAI-PMH repository I wrote a bash shell script that should run an any Unix based environment like Macs and Linux, and probably even on Windows 10 using the Linux subsystem. A special feature of this script, oai2linerec.sh available through GitHub, is that is stores the harvested records in a single file. Often, a single file will be much easier and faster to process than thousands of separate files. Of course, storing metadata in a database will make it much more complex to process and analyse the data.

The trick of oai2linerec.sh is that is serializes each XML metadata record to a single line. A single file can be processed with a few lines in bash like:

#!/bin/bash

# set IFS to use only newlines as separator, store prev. IFS:
storedIFS=$IFS
IFS=$'\n'

# now walk through each line of input.txt
for line in `cat input.txt` ; do
    # this example just sends the record to xmllint 
    # to present is nicely formatted:
    echo $line | xmllint --format -
done

#restore IFS
IFS=$storedIFS

To save space when harvesting big metadata repositories, oai2linerec.sh can optionally also compress each line (record) separately. Processing such a file is as easy as typing zcat in stead of cat:

# now walk through each line of input.txt.gz
for line in `zcat input.txt.gz` ; do
    echo $line | xmllint --format -
done

To further speed up the processing of the metadata, use tools like parallel, have a look at A beginners guide to processing 'lots of' data.

vrijdag 3 maart 2017

Fixing dockerfile issues running macOS

I have become a great fan of Docker as an easy way to test software and to run various development enviroments on the Mac. However, not all docker images that are available through dockerhub / Kitematic work out of the box on the Mac. The issues I've come across are generally linked to one or two causes:

Images need to explicitly expose a port when running on the Mac.
Exposed volumes don't always work as expected.

For both issues I've found a solution.

Exposing ports

As stated, dockerfiles on macOS need to explicitly expose the ports they listen to. So the original Dockerfile MUST have a statement like

EXPOSE 30000

If it doesn't, you can fix this by creating you own Dockerfile and building you own image. In this example we'll do this to fix the ascdc/iiif-manifest-editor Dockerfile. First, create a folder to store the new Dockerfile. Now create a Dockerfile with simply this content:

FROM ascdc/iiif-manifest-editor
EXPOSE 3000

That is all! Now just use this to build your image:

docker build -t myimage .

Run this container:

docker run -d -p 3000:3000 --name mycontainer myimage

The container will now be available at http://localhost:3000 with your browser on macOS.

Fixing volume-related issues

Directories inside a docker container can be exposed on macOS through the VOLUME directive in the Dockerfile. Unfortunately, data in this volume can't always be accessed properly in Docker. For example tenforce/virtuoso exposes a volume that is actually a symbolic link to an other folder. Docker on macOS doesn't properly deal with that. My solution was to create my own Dockerfile (as explained above) that exposes not the symlinked directory but the original directory the symlink points to. For virtuoso I also had to explicitly expose the ports so it ended up like:

FROM tenforce/virtuoso
VOLUME /var/lib/virtuoso/db
WORKDIR /var/lib/virtuoso/db
EXPOSE 8890
EXPOSE 1111

Given this incompatibility between the filesytems used by Docker and macOS you will probably run into problems when for example you try to have a MySQL container store it datafiles on the macOS side. Preferably just keep them inside the container.