vrijdag 1 september 2017

Harvesting metadata from OAI-PMH repositories

OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting is a commonly used protocol for harvesting metadata. Many institutions like libraries, universities and archives use OAI-PMH to offer public access to metadata records.

To facilitate harvesting and processing metadata from OAI-PMH repository I wrote a bash shell script that should run an any Unix based environment like Macs and Linux, and probably even on Windows 10 using the Linux subsystem. A special feature of this script, oai2linerec.sh available through GitHub, is that is stores the harvested records in a single file. Often, a single file will be much easier and faster to process than thousands of separate files. Of course, storing metadata in a database will make it much more complex to process and analyse the data.

The trick of oai2linerec.sh is that is serializes each XML metadata record to a single line. A single file can be processed with a few lines in bash like:

#!/bin/bash

# set IFS to use only newlines as separator, store prev. IFS:
storedIFS=$IFS
IFS=$'\n'

# now walk through each line of input.txt
for line in `cat input.txt` ; do
    # this example just sends the record to xmllint 
    # to present is nicely formatted:
    echo $line | xmllint --format -
done

#restore IFS
IFS=$storedIFS
To save space when harvesting big metadata repositories, oai2linerec.sh can optionally also compress each line (record) separately. Processing such a file is as easy as typing zcat in stead of cat:
# now walk through each line of input.txt.gz
for line in `zcat input.txt.gz` ; do
    echo $line | xmllint --format -
done
To further speed up the processing of the metadata, use tools like parallel, have a look at A beginners guide to processing 'lots of' data.

Geen opmerkingen:

Een reactie posten