Skip to content

wdxxl/storm-crawler

 
 

Repository files navigation

storm-crawler

Build Status

A collection of resources for building low-latency, scalable web crawlers on Apache Storm available under Apache License.

How to use

NOTE: These instructions assume that you have Maven installed.

As a Maven dependency

You can use the dependencies from Maven Central by adding :

<dependency>
    <groupId>com.digitalpebble.stormcrawler</groupId>
    <artifactId>storm-crawler-core</artifactId>
    <version>0.8</version>
</dependency>

to the POM file of your project.

Maven archetype

Alternatively you can also generate a brand new StormCrawler-based project using :

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=0.8

This will not only create a fully formed project containing a POM with the dependency above but also the default resource files, a default CrawlTopology class and a configuration file. You can then compile and run the topology following the instructions below.

Running in local mode

To get started with storm-crawler, it's recommended that you run the CrawlTopology in local mode.

First, clone the project from github:

git clone https://github.com/DigitalPebble/storm-crawler

Then :

cd core
mvn clean compile exec:java -Dstorm.topology=com.digitalpebble.storm.crawler.CrawlTopology -Dexec.args="-conf crawler-conf.yaml -local"

to run the demo CrawlTopology in local mode.

On a Storm cluster

Alternatively, generate an uberjar:

mvn clean package

and then submit the topology with storm jar:

storm jar target/storm-crawler-core-0.9-SNAPSHOT.jar  com.digitalpebble.storm.crawler.CrawlTopology -conf crawler-conf.yaml

to run it in distributed mode.

Getting help

Mailing list : [http://groups.google.com/group/digitalpebble]

Or use the tag storm-crawler on stackoverflow.

DigitalPebble Ltd provide commercial support and consulting for Storm-Crawler.

Thanks

alt tag

YourKit supports open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of YourKit Java Profiler and YourKit .NET Profiler, innovative and intelligent tools for profiling Java and .NET applications.

About

Web crawler SDK based on Apache Storm

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 98.6%
  • HTML 1.4%