GitHub - haakondr/NLP-Graphs: The system can be divided into two parts: preprocessing and graph matching. The preprocessing part parses plaintext documents and outputs dependency graphs in json format. The graph matching takes these dependency graphs and applies graph edit distance to measure similarity.

haakondr / NLP-Graphs Public

Notifications You must be signed in to change notification settings
Fork 3
Star 7

The system can be divided into two parts: preprocessing and graph matching. The preprocessing part parses plaintext documents and outputs dependency graphs in json format. The graph matching takes these dependency graphs and applies graph edit distance to measure similarity.

7 stars 3 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 564 Commits
resources		resources
src		src
.gitignore		.gitignore
README		README
app.properties		app.properties
english-left3words-distsim.tagger		english-left3words-distsim.tagger
engmalt.linear-1.7.mco		engmalt.linear-1.7.mco
pom.xml		pom.xml

Repository files navigation

A thesis project focusing on the usage of dependency graphs as a representation of natural language text. 
Sentences are represented as graph objects, tagged with part-of-speech tags and relations between tokens.
This representation is used as a measure of similarity between two sentences, utilized for plagiarism detection.

The interesting part of the program is mainly GraphEditDistance.java, which is the focus of this thesis.
--------------------------------

Dependencies: java7, maven
a  MongoDB database must be running at the location specified in app.properties  (for a full run, not for calculating graph edit distance between two sentences with GED.java)

Usage:

modify app.properties and select the appropriate folders for the data set.

mvn compile
mvn exec:java