Skip to content

hszhsz/commoncrawl-crawler

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is the primary repository for the services & map-reduce jobs used to produce the CommonCrawl web corpus.

Tree Structure

  • org.commoncrawl.async - Utility code used to build Async server.
  • org.commoncrawl.hadoop.io - ARCInputFormat and related classes.
  • org.commoncrawl.hadoop.mergeutils - Support for merge-sorts outside the context of a Hadoop job.
  • org.commoncrawl.hadoop.template - Sample Hadoop Job.
  • org.commoncrawl.io - CommonCrawl IO library used by crawlers.
  • org.commoncrawl.mapred - Root for all MapReduce jobs. Also contains data structure definitions shared across jobs (database.jr).
  • org.commoncrawl.mapred.ec2.parser - Code used to generate ARCFiles and intermediate data on EC2 using EMR.
  • org.commoncrawl.mapred.ec2.postprocess.deduper - Code to support a parallel dedupe using a 64bit Simhash.
  • org.commoncrawl.mapred.ec2.postprocess.linkCollector - Code to merge metadata generated by the parser job.
  • org.commoncrawl.mapred.pipelineV3 - The start of the new Nutch Free map-reduce pipeline used to process crawl metadata and generate new crawl lists.
  • org.commoncrawl.mapred.segmenter - Support code used to generate Crawl Segment (URL lists consumed by the crawlers).
  • org.commoncrawl.protocol - Shared data structure and enum definitions (generated).
  • org.commoncrawl.rpc - CommonCrawl RPC library used to build distributed systems.
  • org.commoncrawl.server - CommonCrawl Server base class used by various services.
  • org.commoncrawl.service - All long lived processes in the CommonCrawl system are house under this directory.
  • org.commoncrawl.service.crawler - The crawler long running process (Consumes Crawl Lists, writes content to HDFS).
  • org.commoncrawl.service.crawlhistory - A service that manages a crawler's crawl state in a BloomFilter.
  • org.commoncrawl.service.directory - A barebones service used to store and subscribe to lists via a path.
  • org.commoncrawl.service.dns - CommonCrawl DNS Service (used by crawlers to queue up DNS requests).
  • org.commoncrawl.service.listcrawler - A different type of list crawler that supports dynamic uploading a crawling of very large lists of URLS.
  • org.commoncrawl.service.pagerank - PageRank Master / Slave implementations (and related code) used to compute PageRank across the graph.
  • org.commoncrawl.service.parser - The beginnings of a distributed parser service that Crawlers can use to do on demand link extraction.
  • org.commoncrawl.service.queryserver - The (deprecated) crawl metadata service.
  • org.commoncrawl.service.statscollector - Service that receives crawl stats.
  • org.commoncrawl.util - The catch-all repository of Utility classes used by the CommonCrawl system.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Contributors

Ahad Rana (ahad at commoncrawl.org)

About

The CommonCrawl Crawler Engine and Related MapReduce code

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 65.3%
  • C++ 19.9%
  • C 14.2%
  • Shell 0.3%
  • HTML 0.3%
  • Python 0.0%