zondag 25 januari 2015

Validating warcs?

Recently I started experimenting with Mat Kelly's Chrome extension WARCreate. An excellent idea to have a tool that enables both professional curators and amateur web archivists alike to create WARCs on-the-fly from within a browser.

My first experiments seemed to run fairly well, even better than expected, but the WARC I created from Frankfurter Algemeine couldn't be loaded in my OpenWayback engine. Creating a CDX the tool complained there were invalid characters in the WARC. The resulting cdx-file was empty.

I validated the WARC using the JWAT Tools but the output didn't offer me much more insight. Here is what it returned:

./jwattools.sh test -e 20150123091322216.warc
Showing errors: true
Validate digest: true
Using 1 thread(s).
Output Thread started.
ThreadPool started.
Queued 1 file(s).
ThreadPool shut down.                                              
Output Thread stopped.
# Job summary
GZip files: 0
  +  Arc: 0
  + Warc: 0
 Arc files: 0
Warc files: 1
    Errors: 17
  Warnings: 0
RuntimeErr: 0
   Skipped: 0
      Time: 00:00:05 (5129 ms.)
TotalBytes: 9.9 mb
  AvgBytes: 1.9 mb/s
'WARC-Target-URI' value: 14
Data before WARC version: 1
Empty lines before WARC version: 1
Trailing newlines: 1

I never used the JWAT Tools before but reading about them I noticed it also offers the ability to create cdx-files. To my suprise, this tool did create cdx-file and indeed, now I was able to display Frankfurter Algemeine in Open Wayback. I did contact Mat Kelly about the issue, perhaps it can be fixed. More verbose output from JWAT Tools would also help.

2 opmerkingen:

  1. As remarked by @nclarkedk, the JWAT Tools do provide useful information. It is written in .out files in the directory the command was run from.

  2. Update: hanzo warctools warcvalid does not give any output on the warc the jwat-tools complained about.