A collection of resources for building low-latency, scalable web crawlers on Apache Storm available under Apache License.
Available from Maven Central with :
<dependency>
<groupId>com.digitalpebble.stormcrawler</groupId>
<artifactId>storm-crawler-core</artifactId>
<version>0.8</version>
</dependency>
To get started with storm-crawler, it's recommended that you run the CrawlTopology in local mode.
NOTE: These instructions assume that you have Maven installed.
First, clone the project from github:
git clone https://github.com/DigitalPebble/storm-crawler
Then :
cd core
mvn clean compile exec:java -Dstorm.topology=com.digitalpebble.storm.crawler.CrawlTopology -Dexec.args="-conf crawler-conf.yaml -local"
to run the demo CrawlTopology in local mode.
Alternatively, generate an uberjar:
mvn clean package
and then submit the topology with storm jar
:
storm jar target/storm-crawler-core-0.9-SNAPSHOT.jar com.digitalpebble.storm.crawler.CrawlTopology -conf crawler-conf.yaml
to run it in distributed mode.
Mailing list : [http://groups.google.com/group/digitalpebble]
Or use the tag storm-crawler on stackoverflow.
DigitalPebble Ltd provide commercial support and consulting for Storm-Crawler.
YourKit supports open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of YourKit Java Profiler and YourKit .NET Profiler, innovative and intelligent tools for profiling Java and .NET applications.