Skip to content

Implementation of the Chinese Whispers graph clustering algorithm

License

Notifications You must be signed in to change notification settings

nicolaierbs/chinese-whispers

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is an implementation of the Chinese Whispers graph clustering algorithm. For an introduction or if you need to reference the algorithm, use this paper: http://wortschatz.uni-leipzig.de/~cbiemann/pub/2006/BiemannTextGraph06.pdf

This project uses the CW algorithm specifically for Word Sense Induction (WSI).

You can compile the code using Maven, and run the WSI algorithm from the command line.

Here's a quickstart guide:

git clone https://github.com/johannessimon/chinese-whispers.git
cd chinese-whispers && mvn package shade:shade
java -cp target/chinese-whispers.jar de.tudarmstadt.lt.wsi.WSI

You may also of course use the CW algorithm directly from your code.

For an example of how to use the WSI algorithm, compile the code as shown above and download example data, like this word similarity graph extracted from a 120-million-lines English news corpus taken from the JoBimText project: http://sourceforge.net/projects/jobimtext/files/data/models/en_news120M_stanford_lemma/LMI_p1000_l200.gz

The data is formatted in ABC format, meaning that each row contains an edge of the graph, and each row contains three columns separated by a whitespace: from, to, and the edge weight.

Then run the WSI algorithm on the data (making sure you assign enough memory to the VM):

java -Xms4G -Xmx4G -cp target/chinese-whispers.jar de.tudarmstadt.lt.wsi.WSI
-in /path/to/LMI_p1000_l200.gz -n 100 -N 100 -out test-output.txt

The output (in our case test-output.txt) is then formatted as follows:

    word <TAB> cluster-id <TAB> cluster-label <TAB> cluster-node1 cluster-node2 ...
    word <TAB> cluster-id <TAB> cluster-label <TAB> cluster-node1 cluster-node2 ...
    ...

About

Implementation of the Chinese Whispers graph clustering algorithm

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 99.5%
  • Other 0.5%