GitHub - yavcular/WebGraphConstruction

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
projects/nyu/cs/webgraph		projects/nyu/cs/webgraph
wg-bin		wg-bin
README		README

Repository files navigation

This is the repository includes source code for web graph processing. For the motivation details of the algorithm, please refer to "Scalable Manipulation of Archival Web Graphs" paper to be appeared at LSDS-IR workshop under CIKM in Oct '11.

The algorithm consists of three phases; Construction, Representation and Updates.

- Construction:
Starts with processing "raw data format" for consumption of replacement algorithm (step 1).
Replacement algorithm takes the input and assigns numeric values for every url (step 2).
At the end of this phase, data is called "initial graph format".

- Representation:
The data in initial graph format is processed and "layered graph representation" is generated (step 3).

- Updates:
Merges a new dataset in "raw data format" with existing data in "layered graph representation" format (step 4).

So according to these concepts explained above, which part of the repository corresponds to which phase?
- Source code is under projects, and wg-bin has shell scripts to run chained hadoop jobs.
- DELIS/ and archiveUK/ directories include data specific code for preparation step, which is step 1.
- main/ directory includes hadoop jobs for replacement and representation, which are step 2 and 3.
- updates/ includes set of hadoop jobs for updating exiting layered representation, which is step 4.
- MRhelpers/ and structures/ have classes used in previously mentioned jobs.

So, how do I run?
- First of all, you need to prepare the data for replacement algorithm. Replacement algorithm expects data in following format:
<source_url> <time> <dest_url> <dest_url> <dest_url> É
- Next, the run-all-for-replace.sh script can be used to handle replacement and representation steps. Please take a look at the script and make sure directories are setup correctly on your box.
- Additionally, if there is new data, merge-all-one-step.sh script can be used to merge datasets.

And the last comment - analysis/ directory has a bunch of Hadoop jobs to analyze layered data.