A collection of resources for building low-latency, scalable web crawlers on Apache Storm available under Apache License.
NOTE: These instructions assume that you have Maven installed.
You can use the dependencies from Maven Central by adding :
<dependency>
<groupId>com.digitalpebble.stormcrawler</groupId>
<artifactId>storm-crawler-core</artifactId>
<version>0.8</version>
</dependency>
to the POM file of your project.
Alternatively you can also generate a brand new StormCrawler-based project using :
mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=0.8
This will not only create a fully formed project containing a POM with the dependency above but also the default resource files, a default CrawlTopology class and a configuration file. You can then compile and run the topology following the instructions below.
To get started with storm-crawler, it's recommended that you run the CrawlTopology in local mode.
First, clone the project from github:
git clone https://github.com/DigitalPebble/storm-crawler
Then :
cd core
mvn clean compile exec:java -Dstorm.topology=com.digitalpebble.storm.crawler.CrawlTopology -Dexec.args="-conf crawler-conf.yaml -local"
to run the demo CrawlTopology in local mode.
Alternatively, generate an uberjar:
mvn clean package
and then submit the topology with storm jar
:
storm jar target/storm-crawler-core-0.9-SNAPSHOT.jar com.digitalpebble.storm.crawler.CrawlTopology -conf crawler-conf.yaml
to run it in distributed mode.
Mailing list : [http://groups.google.com/group/digitalpebble]
Or use the tag storm-crawler on stackoverflow.
DigitalPebble Ltd provide commercial support and consulting for Storm-Crawler.
YourKit supports open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of YourKit Java Profiler and YourKit .NET Profiler, innovative and intelligent tools for profiling Java and .NET applications.