GitHub - hszhsz/commoncrawl-crawler: The CommonCrawl Crawler Engine and Related MapReduce code

This is the primary repository for the services & map-reduce jobs used to produce the CommonCrawl web corpus.

Tree Structure

org.commoncrawl.async - Utility code used to build Async server.
org.commoncrawl.hadoop.io - ARCInputFormat and related classes.
org.commoncrawl.hadoop.mergeutils - Support for merge-sorts outside the context of a Hadoop job.
org.commoncrawl.hadoop.template - Sample Hadoop Job.
org.commoncrawl.io - CommonCrawl IO library used by crawlers.
org.commoncrawl.mapred - Root for all MapReduce jobs. Also contains data structure definitions shared across jobs (database.jr).
org.commoncrawl.mapred.ec2.parser - Code used to generate ARCFiles and intermediate data on EC2 using EMR.
org.commoncrawl.mapred.ec2.postprocess.deduper - Code to support a parallel dedupe using a 64bit Simhash.
org.commoncrawl.mapred.ec2.postprocess.linkCollector - Code to merge metadata generated by the parser job.
org.commoncrawl.mapred.pipelineV3 - The start of the new Nutch Free map-reduce pipeline used to process crawl metadata and generate new crawl lists.
org.commoncrawl.mapred.segmenter - Support code used to generate Crawl Segment (URL lists consumed by the crawlers).
org.commoncrawl.protocol - Shared data structure and enum definitions (generated).
org.commoncrawl.rpc - CommonCrawl RPC library used to build distributed systems.
org.commoncrawl.server - CommonCrawl Server base class used by various services.
org.commoncrawl.service - All long lived processes in the CommonCrawl system are house under this directory.
org.commoncrawl.service.crawler - The crawler long running process (Consumes Crawl Lists, writes content to HDFS).
org.commoncrawl.service.crawlhistory - A service that manages a crawler's crawl state in a BloomFilter.
org.commoncrawl.service.directory - A barebones service used to store and subscribe to lists via a path.
org.commoncrawl.service.dns - CommonCrawl DNS Service (used by crawlers to queue up DNS requests).
org.commoncrawl.service.listcrawler - A different type of list crawler that supports dynamic uploading a crawling of very large lists of URLS.
org.commoncrawl.service.pagerank - PageRank Master / Slave implementations (and related code) used to compute PageRank across the graph.
org.commoncrawl.service.parser - The beginnings of a distributed parser service that Crawlers can use to do on demand link extraction.
org.commoncrawl.service.queryserver - The (deprecated) crawl metadata service.
org.commoncrawl.service.statscollector - Service that receives crawl stats.
org.commoncrawl.util - The catch-all repository of Utility classes used by the CommonCrawl system.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Contributors

Ahad Rana (ahad at commoncrawl.org)

Name		Name	Last commit message	Last commit date
Latest commit History 222 Commits
bin		bin
conf		conf
lib		lib
src		src
.gitignore		.gitignore
README.md		README.md
build.properties.sample		build.properties.sample
build.xml		build.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

conf

conf

lib

lib

src

src

.gitignore

.gitignore

README.md

README.md

build.properties.sample

build.properties.sample

build.xml

build.xml

Repository files navigation

Tree Structure

License

Contributors

About

Releases

Packages

Languages

hszhsz/commoncrawl-crawler

Folders and files

Latest commit

History

Repository files navigation

Tree Structure

License

Contributors

About

Resources

Stars

Watchers

Forks

Languages