Skip to content

fysoft2006/storm-crawler

 
 

Repository files navigation

storm-crawler

Build Status

A collection of resources for building low-latency, scalable web crawlers on Apache Storm available under Apache License.

How to use

As a Maven dependency

Available from Maven Central with :

<dependency>
    <groupId>com.digitalpebble.stormcrawler</groupId>
    <artifactId>storm-crawler-core</artifactId>
    <version>0.8</version>
</dependency>

Running in local mode

To get started with storm-crawler, it's recommended that you run the CrawlTopology in local mode.

NOTE: These instructions assume that you have Maven installed.

First, clone the project from github:

git clone https://github.com/DigitalPebble/storm-crawler

Then :

cd core
mvn clean compile exec:java -Dstorm.topology=com.digitalpebble.storm.crawler.CrawlTopology -Dexec.args="-conf crawler-conf.yaml -local"

to run the demo CrawlTopology in local mode.

On a Storm cluster

Alternatively, generate an uberjar:

mvn clean package

and then submit the topology with storm jar:

storm jar target/storm-crawler-core-0.9-SNAPSHOT.jar  com.digitalpebble.storm.crawler.CrawlTopology -conf crawler-conf.yaml

to run it in distributed mode.

Getting help

Mailing list : [http://groups.google.com/group/digitalpebble]

Or use the tag storm-crawler on stackoverflow.

DigitalPebble Ltd provide commercial support and consulting for Storm-Crawler.

Thanks

alt tag

YourKit supports open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of YourKit Java Profiler and YourKit .NET Profiler, innovative and intelligent tools for profiling Java and .NET applications.

About

Web crawler SDK based on Apache Storm

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 98.6%
  • HTML 1.4%