GitHub - 0607060123/sparqlcrawlserver: SPARQL Crawl Server

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
SparqlCrawlServer		SparqlCrawlServer
lib		lib
CHANGES		CHANGES
README		README
build.xml		build.xml
example-settings.groovy		example-settings.groovy

Repository files navigation

SparqlCrawlServer
=================

Version: 1.5
Author: netEstate GmbH, http://www.netestate.de/, brunni@netestate.de
License: Apache License version 2 <http://www.apache.org/licenses/>

Sparql Crawl Server is a Java Web application that crawls a list of URLs 
containing RDF and executes a SPARQL query on the resulting graphs.  
Every document gets its own named graph with the final document URL as 
graph name. Already crawled documents are cached in memory and on disk.

The crawler supports RDF/XML, RDFa, N3 and Turtle and will open a maximum 
of one HTTP connection per server IP. It respects the Robot Exclusion Standard.

HTTP POST request parameters (UTF-8 encoded):

  query: The SPARQL query.
  graphs: A comma separated list of URLs in <>: <url1>,<url2>,...
  timeout: number of seconds to wait for the RDF documents.

HTTP Response: application/sparql-results+json

If the timeout is reached, the SPARQL query will be executed with the current
results. RDF from unfinished HTTP requests is added to the cache when finished.

Build instructions:

Configure the path to your settings file in 

 SparqlCrawlServer/WebContent/WEB-INF/web.xml.

Build by invoking ant.
Minimum Java version is 7.0, minimum Servlet version is 3.0.

Deploy the file SparqlCrawlServer.war in your Servlet container (e.g. Tomcat 7).

The context-path of the servlet is /sparqlcrawl (so, normally the
endpoint URL is http://<your-server>/SparqlCrawlServer/sparqlcrawl).

Settings file:

The settings file is a groovy file, see example-settings.groovy.

  cacheLifetimeOk: number of seconds to cache crawled RDF documents.
  cacheLifetimeContentTooLong: number of seconds to cache 'content-too-long'
                               crawl results. (see maxRdfXmlSize)
  cacheLifetimeHttpError: number of seconds to cache HTTP error crawl results.
  cacheLifetimeNetworkError: number of seconds to cache network error crawl
                             results.
  cacheLifetimeNotAllowedByRobots: number of seconds to cache 'disallowed'
                                   crawl results.
  cacheLifetimeParseError: number of seconds to cache RDF parse error crawl
                           results.
  cacheLifetimeWrongContentType: number of seconds to cache wrong content type
                                 error crawl results.
  
  datasetCacheLifetimeComplete: number of seconds to cache completely crawled
                                document lists in memory.
  datasetCacheLifetimeIncomplete: number of seconds to cache incompletely
                                  crawled document lists in memory.
  
  logFilePath: path to the log file
  crawlThreads: number of threads used for crawling.
  crawlThreadStackSize: stack size for crawl threads
  dnsThreads: number of threads for dns lookup
  dnsThreadStackSize: stack size for dns lookup threads
  dnsTimeout: DNS timeout
  maxRdfXmlSize: maximum size of RDF documents
  maxRobotsAge: maximum cache time of robots.txt files.
  maxRobotsSize: maximum allowed size of a robots.txt file.
  modelCachePath: path to disk-cache-folder
  socketTimeout: socket timeout
  unixOs: is the underlying os unix like? (true/false)
  userAgent: user-agent header value for crawl requests