hawarp

HAdoop-based Web Archive Record Processing.

This project consists of a collection of tools to process web archive records using the Hadoop framework.

Several command line applications are aggregated as modules in this project:

arc2warc-migration-cli

Command line applicaiton to convert ARC container files to the new ISO standard format WARC.

Documentation

arc2warc-migration-cli

Hadoop job to convert ARC container files to the new ISO standard format WARC.

Documentation

droid-identify

Hadoop Job for Identifying files using DROID (Digital Record Object Identification), Version 6.1, http://digital-preservation.github.io/droid/.

Documentation

unpack2temp-identify

unpack2temp-identify is a tool to identify and/or characterise files packaged in container files using a standalone java application or a Hadoop job.

Documentation

tika-identify

Hadoop Job for Identifying files using Apache Tika Version 1.0.

Documentation

tomar-prepare-inputdata

tomar-prepare-inputdata is a tool to prepare web archive container files in the ARC format which are stored in a Hadoop Distributed File System (HDFS) in order to allow processing of the individual files by means of the SCAPE Platform tool Tomar.

Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
arc2warc-migration-cli		arc2warc-migration-cli
arc2warc-migration-hdp		arc2warc-migration-hdp
droid-identify		droid-identify
hawarp-core		hawarp-core
tika-identify		tika-identify
tomar-prepare-inputdata		tomar-prepare-inputdata
unpack2temp-identify		unpack2temp-identify
.gitignore		.gitignore
.opf.yml		.opf.yml
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

License

perdalum/hawarp

Folders and files

Latest commit

History

Repository files navigation

hawarp

arc2warc-migration-cli

arc2warc-migration-cli

droid-identify

unpack2temp-identify

tika-identify

tomar-prepare-inputdata

About

Resources

License

Stars

Watchers

Forks

Languages