-
Please
clone
this repo to your local machine. -
Download
Wikipedia
's database from here. -
Import this database into your
MySQL
Server. -
Edit
DiscrMetaPath/src/main/java/edu/nd/dsg/util/ConnectionPool.java
, changeURL
,USER
, andPASS
to yours. -
Unzip
DiscrMetaPath/data.tar.gz
toDiscrMetaPath/
, after this you should have all the data underDiscrMetaPath/data
-
Build the project by
make wikibuild
. The jar file will be generated underDiscrMetaPath/target/
-
Run generated jar file by
java -jar JAR_FILE_YOU_GENERATED
The command line arguments are:
Usage
Generate paths: -GEN [-NoSQL cache types first to speedup] [-all get all paths instead of pathLength == 2] [-p build patent]
Translate paths: -TRANS [-a output all paths] [-nd do not get most discri/similar paths] [-oNum get NUM paths between discri&similar paths] [-p build patent]
Generate Term frequency: -TERM [-BuildWikiTF generate term frequency] [-BuildPatentTF generate term frequency] [-BuildWikiDF generate document frequency] [-BuildPatentDF generate document frequency]
Generate Cos distance frequency(sequential): -COS [-p build patent]
Generate BM25 score: -BM [-ACC accumulative (x,y),(x+y,z),...] [-NODE sequential (x,y),(y,z),...] [-p build patent]
If you only interested in the results we get, you can get the data from result
folder. The data format for each file is:
-
For
CrowdFlower
result files:_unit_id, _golden, _canary, _unit_state, _trusted_judgments, _last_judgment_at, choose_path, // Path that chosen by human choose_path:confidence, end, // End article path_1, // Path between start and end, generated by our algorithm path_2, path_3, path_4, path_5, start // Start article
-
For other csv files:
groupId, // Each unique groupId represent for a CrowdFlower task pathId, // Equivalent to CrowFlower's path_* nodeId, // Score of the node at position `i` in path_*