ImageCatalog

This is an OODT RADIX application that uses Apache Solr, Apache Tika and Apache OODT to ingest 10s of millions of files (images,but could be extended to other files) in place, and to extract metadata and OCR information from those files/images using Tika and Tesseract OCR.

Shell Pre-Requisites

Some programs used by ImageCat require the use of the /bin/tcsh shell. You can usually install it on Linux via:

yum install tcsh; or
apt-get install tcsh

Python Pre-Requisites

pip install xmlrpclib (which comes default in python 2.7)
pip install solrpy

Other Pre-Requisites

Maven 3.x

Useful Environment Variables

The following environment variables are used in ImageCat. Set them in ~/.tcshrc

setenv JAVA_HOME `readlink -f /usr/bin/java | sed "s:bin/java::"`
setenv OODT_HOME ~/path_to_deploy_directory 
setenv GANGLIA_URL http://zipper.jpl.nasa.gov/ganglia/
setenv FILEMGR_URL http://localhost:9000
setenv WORKFLOW_URL http://localhost:9001
setenv RESMGR_URL http://localhost:9002
setenv WORKFLOW_HOME $OODT_HOME/workflow
setenv FILEMGR_HOME $OODT_HOME/filemgr
setenv PGE_ROOT $OODT_HOME/pge
setenv PCS_HOME $OODT_HOME/pcs

Automated Install

Navigate to desired location for imagecat
git clone https://github.com/chrismattmann/imagecat.git
cd imagecat
cd auto
chmod +x install.sh
./install.sh
Wait for a install to finish
Follow Manual installation step #16 (Below)
cd ../../deploy
Add the absolute paths of all images (one image path per line) in data/staging/roxy-image-list-jpg-nonzero.txt
source bin/imagecatenv.sh
./start.sh
or Manual Setup #17-#19
$OODT_HOME/bin/chunker
#win

Manual Installation

mkdir deploy
git clone https://github.com/chrismattmann/imagecat.git
cd imagecat
mvn install
cp -R distribution/target/*.tar.gz ../deploy
cd ../deploy && tar xvzf *.tar.gz
cp -R ../imagecat/solr4 ./solr4 && cp -R ../imagecat/tomcat7 ./tomcat7
edit tomcat7/conf/Catalina/localhost/solr.xml and replace "--OODT_HOME--" with the path to your deploy dir.
edit /bin/env.sh and /bin/imagecatenv.sh in your deploy directory to make sure "--OODT_HOME--" is set to the path to your deploy dir.
/bin/bash && source bin/imagecatenv.sh
mkdir tomcat7/logs
Copy cas-filemgr-VERSION.jar, cas-workflow-VERSION.jar, cas-crawler-VERSION.jar and cas-pge-VERSION.jar to the resmgr/lib directory. Grab them from their component directory (i.e. cas-filemgr-VERSION.jar from filemgr/lib/cas-filemgr-VERSION.jar)
Copy solr4/example/lib/*.jar to tomcat/common/lib
Copy solr4/example/resources/log4j.properties to tomcat/common/lib
cp filemgr/lib/cas-filemgr-VERSION.jar workflow/lib
Edit tomcat/common/lib/log4j.properties top rows to read:
Logging level
solr.log=logs/
log4j.rootLogger=INFO, CONSOLE log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout log4j.appender.CONSOLE.layout.ConversionPattern=%-4r [%t] %-5p %c %x \u2013 %m%n
cd $OODT_HOME/bin && ./oodt start
cd $OODT_HOME/tomcat7/bin && ./startup.sh
cd $OODT_HOME/resmgr/bin/ && ./start-memex-stubs
download roxy-image-list-jpg-nonzero.txt and place it in $OODT_HOME/data/staging
$OODT_HOME/bin/chunker
#win

Observing what's going on

ImageCat runs 2 Solr deployments, and a full stack OODT Deployment. The URLs are below:

http://localhost:8081/solr/#/imagecatdev - Solr4.10.3-fork Core where SolrCell runs for Image extraction.
http://localhost:8081/solr/#/imagecatoodt - Solr4.10.3-fork Core where OODT's file catalog is, home to ChunkFiles representing a 50k-sized slice of full file paths from the original file list.
http://localhost:8080/opsui/ - Apache OODT OPSUI cockpit to observe ingestion of ChunkFiles, and jobs for ingesting into SolrCell
http://localhost:8080/pcs/services/health/report - Apache OODT PCS REST Services showing system health and provenance.

The recommended way to see what's going on is to check the OPSUI, and then periodically examine $OODT_HOME/data/jobs/crawl/*/logs (where the ingest into SolrCell jobs are executing). By default ImageCat uses 8 ingest processes that can run 8 parallel ingests into SolrCell at a time, with 24 jobs on deck in the Resource Manager waiting to get in.

Each directory in $OODT_HOME/data/jobs/crawl/ is an independent, fully detached job that can be executed independent of OODT to ingest 50K image files into SolrCell and to perform TesesractOCR and EXIF metadata extraction.

Note that sometimes images will fail to ingest, e.g., with a message such as:

INFO: on.SolrException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.jpeg.JpegParser@5c0bae4a
OUTPUT:         at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
OUTPUT:         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
OUTPUT:         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
OUTPUT:         at org.apache.solr.core.RequestHandler
Apr 15, 2015 9:18:29 PM org.apache.oodt.commons.io.LoggerOutputStream flush

In the Solr Tomcat logs. This is normal, since sometimes the JpegParser will fail to parse the image.

Chunk Files

The overall workflow is as follows:

OODT starts with the original large file that contains full file paths. It then chunks this file into sizeof(file) / $OODT_HOME/workflow/policy/tasks.xml[urn:id:memex:Chunker/ChunkSize] sized files.
Each resultant ChunkFile is then ingested into OODT, by the OODT crawler, which triggers the OODT workflow manager to process a job called IngestInPlace.
Each IngestInPlace job grabs its ingested ChunkFile (stored in $OODT_HOME/data/archive/chunks/) and then runs it through $OODT_HOME/bin/solrcell_ingest which sends the 50k full file paths to http://localhost:8081/solr/imagecatdev/extract (the ExtractingRequestHandler).
8 IngestInPlace jobs can run at a time.
You can watch http://localhost:8081/solr/imagecatdev build up while it's going on. This will happen sporadically b/c $OODT_HOME/bin/solrcell_ingest ingests all 50k files in memory, and then sends a commit at the end for efficiency (resulting in 50k * 8 files every ~30-40 minutes).

Cleaning up and checking any failed ingestions

For whatever reason, sometimes ingests fail. For example initially when I was building ImageCat, I had an error in the solrcell_ingest script that didn't account for files that had spaces in the directory path (fixed now). If you find anything happens that makes ingests fail, just run:

$OODT_HOME/bin/check_failed

This program will verify all ChunkFiles in Solr and make sure all paths were ingested into Solr. If any weren't, new ChunkedFiles with the extension _missing.txt will be created and any remaining files will be ingested.

Questions, comments?

Send them to Chris A. Mattmann.

License

Apache License, version 2

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
DOCKER		DOCKER
auto		auto
crawler		crawler
distribution		distribution
extensions		extensions
filemgr		filemgr
pcs		pcs
pge		pge
resmgr		resmgr
solr		solr
solr4		solr4
tomcat7		tomcat7
webapps		webapps
workflow		workflow
README.md		README.md
pom.xml		pom.xml

youngblood/imagecat

Folders and files

Latest commit

History

Repository files navigation

ImageCatalog

Shell Pre-Requisites

Python Pre-Requisites

Other Pre-Requisites

Useful Environment Variables

Automated Install

Manual Installation

Logging level

Observing what's going on

Chunk Files

Cleaning up and checking any failed ingestions

Questions, comments?

License

About

Resources

Stars

Watchers

Forks

Languages