Skip to content

mccraigmccraig/opennlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction
============
The opennlp project is now the home of a set of java-based NLP tools 
which perform sentence detection, tokenization, pos-tagging, chunking and
parsing, named-entity detection, and coreference.  

In its previous life it was used to hold a common infrastructure code
for the opennlp.grok project.  The work previously done can be found in
the final release of that project available on the main project page.

Installing the build tools
==========================

The OpenNLP build system is based on Jakarta Ant, which is a Java
building tool originally developed for the Jakarta Tomcat project but
now used in many other Apache projects and extended by many
developers.

Ant is a little but very handy tool that uses a build file written in
XML (build.xml) as building instructions. For more information refer
to "http://jakarta.apache.org/ant/".

The only thing that you have to make sure of is that the "JAVA_HOME"
environment property is set to match the top level directory
containing the JVM you want to use. For example:

C:\> set JAVA_HOME=C:\jdk1.2.2

or on Unix:

% setenv JAVA_HOME /usr/local/java
  (csh)
> JAVA_HOME=/usr/java; export JAVA_HOME
  (ksh, bash)

That's it!


Building instructions
=====================

Ok, let's build the code. First, make sure your current working
directory is where the build.xml file is located. Then type

  ./build.sh (unix)

if everything is right and all the required packages are visible, this
action will generate a file called "opennlp-common-${version}.jar" in
the "./build" directory. Note, that if you do further development,
compilation time is reduced since Ant is able to detect which files
have changed an to recompile them at need.

Also, you'll note that reusing a single JVM instance for each task, increases
tremendously the performance of the whole build system, compared to other
tools (i.e. make or shell scripts) where a new JVM is started for each task.


Build targets
=============

The build system is not only responsible for compiling OpenNlp into a jar
file, but is also responsible for creating the HTML documentation in
the form of javadocs.

These are the meaningful targets for this build file:

 - package [default] -> creates ./build/opennlp-common.jar
 - compile -> compiles the source code
 - javadoc -> generates the API documentation in ./build/javadocs
 - clean -> restores the distribution to its original and clean state

For example, to build the Java API documentation, type

build.sh javadoc
(Unix)

To learn the details of what each target does, read the build.xml file.
It is quite understandable.

Downloading Models
==================
Models have been trained for various of the component and are required
unless one wishes to create their own models exclusivly from their own
annotated data.  These models can be downloaded clicking on the "Models"
link at opennlp.sourceforge.net.  The models are large (especially
the ones for the parser).  You may want to just fetch specific ones.
Models for the corresponding components can be found in the following
directories:

english/chunker	   - English-Penn-Treebank-style phrase chunker models.
english/coref      - MUC-style coreference.
english/namefind   - MUC-style named entity finder models.
english/parser	   - English-Penn-Treebank-style parser and pos-tag models.
english/sentdetect - English sentence detector.
english/tokenize   - English-Penn-Treebank-style tokenizer.

spanish/postag     - Spanish part-of-speech tagger.
spanish/sentdetect - Spanish sentence detector.
spanish/tokenize   - Spanish tokenizer.

Running the Tools
=================
To run any of these tools you need to have models.  Make sure you
look at the previous step before you try this.  Each of these classes
contains a main which will annotate text from standard in.  Some of
them require processing by the previous component.  Most of these take a
single argument which is the location of the model for this component.
The exceptions are the parser which requires a  model directory, and
the namefinder which takes a list of models.  Tools are separated into
packages by language.  Currently only two languages are supported (English
and Spanish), but we plan to support others in the future. 

English:

sentence detector:  opennlp.tools.lang.english.SentenceDetector
tokenizer:          opennlp.tools.lang.english.Tokenizer
pos-tagger:         opennlp.tools.lang.english.PosTagger
chunker:            opennlp.tools.lang.english.TreebankChunker
name finder:        opennlp.tools.lang.english.NameFinder
parser:             opennlp.tools.lang.english.TreebankParser
coreference:        opennlp.tools.lang.english.TreebankLinker

Spanish:

sentence detector:  opennlp.tools.lang.spanish.SentenceDetector
tokenizer:          opennlp.tools.lang.spanish.Tokenizer
token chunker:      opennlp.tools.lang.spanish.TokenChunker
pos-tagger:         opennlp.tools.lang.spanish.PosTagger

Examples: These example are simply that, example and are not a
recommendation that you run the tools this way.  It's in fact very
inefficient.  However, this should give you an idea of what the tools
can do and the kind of input they assume.  Developers should know to
look at the main's of these classes to see how to set up a particular
component for use.

Note: All these example assume that your CLASSPATH has been set to
include: opennlp-tools-1.3.0.jar, trove.jar, maxent-2.4.0.jar, and for
coreference: jwnl-1.3.3.jar.  The opennlp jar is located in the output
directory (once you've built it) and the others are located in the lib
directory.  Information about the jars in the lib directory can be found
in the lib/LIBNOTES file.  If you are un-certain about how to set your
classpath please google: java classpath where you will find many pages
on the subject.

English Phrase Chunking:
java opennlp.tools.lang.english.SentenceDetector \
  opennlp.models/english/sentdetect/EnglishSD.bin.gz < text |
java opennlp.tools.lang.english.Tokenizer \
  opennlp.models/english/tokenize/EnglishTok.bin.gz |
java opennlp.tools.lang.english.PosTagger -d \
  opennlp.models/english/parser/tagdict opennlp.models/english/parser/tag.bin.gz |
java opennlp.tools.lang.english.TreebankChunker \
  opennlp.models/english/chunker/EnglishChunk.bin.gz

English Parsing:
java opennlp.tools.lang.english.SentenceDetector \
  opennlp.models/english/sentdetect/EnglishSD.bin.gz < text |
java opennlp.tools.lang.english.Tokenizer \
  opennlp.models/english/tokenize/EnglishTok.bin.gz |
java -Xmx350m opennlp.tools.lang.english.TreebankParser -d \
  opennlp.models/english/parser
    
English Name Finding:
java opennlp.tools.lang.english.SentenceDetector \
  opennlp.models/english/sentdetect/EnglishSD.bin.gz < text |
java -Xmx200m opennlp.tools.lang.english.NameFinder  \
  opennlp.models/english/namefind/*.bin.gz
  
English Coreference:
java opennlp.tools.lang.english.SentenceDetector \
  opennlp.models/english/sentdetect/EnglishSD.bin.gz < text |
java opennlp.tools.lang.english.Tokenizer \
  opennlp.models/english/tokenize/EnglishTok.bin.gz |
java -Xmx350m opennlp.tools.lang.english.TreebankParser -d \
  opennlp.models/english/parser |
java -Xmx350m opennlp.tools.lang.english.NameFinder -parse \
  opennlp.models/english/namefind/*.bin.gz |
java -Xmx200m -DWNSEARCHDIR=$WNSEARCHDIR \
  opennlp.tools.lang.english.TreebankLinker opennlp.models/english/coref

In the example above $WNSEARCHDIR is the location of the "dict" directory for 
your WordNet 2.0 installation.  

Spanish Part-Of-Speech Tagging:

java opennlp.tools.lang.spanish.SentenceDetector \
  opennlp.models/spanish/sentdetect/SpanishSent.bin.gz < texto | 
java opennlp.tools.lang.spanish.Tokenizer \
  opennlp.models/spanish/tokenize/SpanishTok.bin.gz | 
java opennlp.tools.lang.spanish.TokenChunker \
  opennlp.models/spanish/tokenize/SpanishTokChunk.bin.gz | 
java opennlp.tools.lang.spanish.PosTagger \
  opennlp.models/spanish/postag/SpanishPOS.bin.gz

Training the Tools
==================
The main of the following classes can be used to train new models.
Look at the usage messages for these classes you are interested in
training new models.

sentence detector:  opennlp.tools.sentdetect.SentenceDetectorME
pos-tagger:         opennlp.tools.postag.POSTaggerME
chunker:            opennlp.tools.chunker.ChunkerME
name finder:        opennlp.tools.namefind.NameFinderME
parser:             opennlp.tools.parser.ParserME

The following modules currently only support training via the WordFreak
opennlp.plugin v1.3 (http://wordfreak.sourceforge.net/plugins.html).

tokenizer:          org.annotation.opennlp.OpenNlpTokenAnnotator
coreference:        org.annotation.opennlp.OpenNlpCoreferenceAnnotator

Note: In order to train a model you need all the training data.  There is
not currently a mechanism to update the models distributed with the project
with additional data.

Bug Reports
===========

Please report bugs at the bug section of the OpenNlp sourceforge site:

https://sourceforge.net/tracker/?group_id=3368&atid=103368

Note: Incorrect automatic-annotation on a specific piece of text does
not constitute a bug.  The best way to address such errors is to provide
annotated data on which the automatic-annotator/tagger can be trained
so that it might learn to not make these mistakes in the future.

Special Note
============

This README and the directory structure and the build system for this
project were taken directly from the JDOM project. Many thanks to
Jason Hunter and Brett McLaughlin for creating a very elegant way of
working with XML in Java.  See www.jdom.org for more details.

Releases

No releases published

Packages

No packages published