yavcular/WebGraphConstruction
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is the repository includes source code for web graph processing. For the motivation details of the algorithm, please refer to "Scalable Manipulation of Archival Web Graphs" paper to be appeared at LSDS-IR workshop under CIKM in Oct '11. The algorithm consists of three phases; Construction, Representation and Updates. - Construction: Starts with processing "raw data format" for consumption of replacement algorithm (step 1). Replacement algorithm takes the input and assigns numeric values for every url (step 2). At the end of this phase, data is called "initial graph format". - Representation: The data in initial graph format is processed and "layered graph representation" is generated (step 3). - Updates: Merges a new dataset in "raw data format" with existing data in "layered graph representation" format (step 4). So according to these concepts explained above, which part of the repository corresponds to which phase? - Source code is under projects, and wg-bin has shell scripts to run chained hadoop jobs. - DELIS/ and archiveUK/ directories include data specific code for preparation step, which is step 1. - main/ directory includes hadoop jobs for replacement and representation, which are step 2 and 3. - updates/ includes set of hadoop jobs for updating exiting layered representation, which is step 4. - MRhelpers/ and structures/ have classes used in previously mentioned jobs. So, how do I run? - First of all, you need to prepare the data for replacement algorithm. Replacement algorithm expects data in following format: <source_url> <time> <dest_url> <dest_url> <dest_url> É - Next, the run-all-for-replace.sh script can be used to handle replacement and representation steps. Please take a look at the script and make sure directories are setup correctly on your box. - Additionally, if there is new data, merge-all-one-step.sh script can be used to merge datasets. And the last comment - analysis/ directory has a bunch of Hadoop jobs to analyze layered data.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published