Skip to content

shenrubulijie/storm-crawler

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

storm-crawler

A collection of resources for building low-latency, scalable web crawlers on Storm available under Apache License.

How to use

As a Maven dependency

Available from Maven Central with :

<dependency>
    <groupId>com.digitalpebble</groupId>
    <artifactId>storm-crawler-core</artifactId>
    <version>0.6</version>
</dependency>

Running in local mode

To get started with storm-crawler, it's recommended that you run the CrawlTopology in local mode.

NOTE: These instructions assume that you have Maven installed.

First, clone the project from github:

git clone https://github.com/DigitalPebble/storm-crawler

Then :

cd core
mvn clean compile exec:java -Dstorm.topology=com.digitalpebble.storm.crawler.CrawlTopology -Dexec.args="-conf crawler-conf.yaml -local"

to run the demo CrawlTopology in local mode.

On a Storm cluster

Alternatively, generate an uberjar:

mvn clean package

and then submit the topology with storm jar:

storm jar target/storm-crawler-core-0.7-SNAPSHOT.jar  com.digitalpebble.storm.crawler.CrawlTopology -conf crawler-conf.yaml

to run it in distributed mode.

Getting help

Mailing list : http://groups.google.com/group/digitalpebble

Or use the tag storm-crawler on stackoverflow.

alt tag

YourKit supports open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of YourKit Java Profiler and YourKit .NET Profiler, innovative and intelligent tools for profiling Java and .NET applications.

About

Web crawler SDK based on Apache Storm

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 98.5%
  • HTML 1.5%