OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting is a commonly used protocol for harvesting metadata. Many institutions like libraries, universities and archives use OAI-PMH to offer public access to metadata records.
To facilitate harvesting and processing metadata from OAI-PMH repository I wrote a bash shell script that should run an any Unix based environment like Macs and Linux, and probably even on Windows 10 using the Linux subsystem. A special feature of this script, oai2linerec.sh available through GitHub, is that is stores the harvested records in a single file. Often, a single file will be much easier and faster to process than thousands of separate files. Of course, storing metadata in a database will make it much more complex to process and analyse the data.
The trick of oai2linerec.sh is that is serializes each XML metadata record to a single line. A single file can be processed with a few lines in bash like:
#!/bin/bash # set IFS to use only newlines as separator, store prev. IFS: storedIFS=$IFS IFS=$'\n' # now walk through each line of input.txt for line in `cat input.txt` ; do # this example just sends the record to xmllint # to present is nicely formatted: echo $line | xmllint --format - done #restore IFS IFS=$storedIFSTo save space when harvesting big metadata repositories, oai2linerec.sh can optionally also compress each line (record) separately. Processing such a file is as easy as typing zcat in stead of cat:
# now walk through each line of input.txt.gz for line in `zcat input.txt.gz` ; do echo $line | xmllint --format - doneTo further speed up the processing of the metadata, use tools like parallel, have a look at A beginners guide to processing 'lots of' data.