HAdoop-based Web Archive Record Processing.
This project consists of a collection of tools to process web archive records using the Hadoop framework.
Several command line applications are aggregated as modules in this project:
Command line applicaiton to convert ARC container files to the new ISO standard format WARC.
Hadoop job to convert ARC container files to the new ISO standard format WARC.
Hadoop Job for Identifying files using DROID (Digital Record Object Identification), Version 6.1, http://digital-preservation.github.io/droid/.
unpack2temp-identify is a tool to identify and/or characterise files packaged in container files using a standalone java application or a Hadoop job.
Hadoop Job for Identifying files using Apache Tika Version 1.0.
tomar-prepare-inputdata is a tool to prepare web archive container files in the ARC format which are stored in a Hadoop Distributed File System (HDFS) in order to allow processing of the individual files by means of the SCAPE Platform tool Tomar.