GitHub - Shredder13/WebCrawler

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
html_test		html_test
src		src
403.html		403.html
Lab2 - 2016.pdf		Lab2 - 2016.pdf
bonus.txt		bonus.txt
compile.bat		compile.bat
config.ini		config.ini
crawler_form.html		crawler_form.html
execResult.html		execResult.html
index.html		index.html
javax.mail.jar		javax.mail.jar
lab2_302650072_200845758.zip		lab2_302650072_200845758.zip
name_20160101_200000.html		name_20160101_200000.html
readme.txt		readme.txt
run.bat		run.bat

Repository files navigation

Our webserver is implemented as described below:

- Our webserver index.html have changed and now contains the crawler-form fields and crawling history. The form method is POST, and action is "execResult.html".
- it also contains a placeholder for the crawling history which happens dynamically.
- if a request to "execResult.html" or any crawling-result page is missing the "referer: localhost" header, we return "403 Forbidden" error.
- execResult.html loads its content dynamically. it has 2 place holders, one for the message (success or failure) and another for the crawling history.
- The crawler functionality begins in the WebServerHttpResponse class, when recognizing that "execResult.html" page is called.
	The response is according to the crawler state, or if there was any error (success or failure).
- WebCrawler.start() method is where the flow begins.
	PortScanner.java - A multi-threaded port scanner is started if requested.
	WebCrawler.handleRespectRobots() - filling a "black-list" and "white-list" for crawling, and sending them to the downloaders queue.

- The downloaders-analyzers loop is handled as follows:
	- Our threadpool from Lab1 has a job-queue, and have maxThreads.
	- So for making 10 downloaders, we start a threadpool (downloadersPool) with maxThreads = 10.
	- same thing with analyzers - called analyzersPool.
	- When a downloader finished a download (using CrawlerHttpConnection.java), if its HTML file it push an "AnalyzerTask" to the analyzersPool.
	- When an analyzer finish parsing an HTML file (using HtmlParser.java) it pushs the content it found, each link as "DownloaderTask" to the downloadersPool.
	- When both threadpools are empty, the process finishes.
		Note: we used a counter to know how much analyzers & downloaders are alive. The number is increased when a DownloaderTask constructor is called (before it is pushed to the downlaodersPool),
			and decreased when the task is DONE, after it submits a task to the AnalyzersPool. That way we assure that the process isn't finished too early.
		Note2: same thing happens for analyzers counter.

- To perform network connection, the class CrawlerHttpConnection is used.
- The WebServer holds a CrawlerData instance - which aggregates the crawling statistics.
- NOTE: we count pages & images that return HTTP 200 OK only.
- When the crawling is finished, a statistics page is created (look at StatisticsPageBuilder.java), and an email is sent using javax.mail API (jar included in sources & serverroot folder).

About

No description, website, or topics provided.

Readme

Activity

0 stars

1 watching

0 forks

Report repository

Releases

No releases published

Packages

No packages published

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

html_test

html_test

src

src

403.html

403.html

Lab2 - 2016.pdf

Lab2 - 2016.pdf

bonus.txt

bonus.txt

compile.bat

compile.bat

config.ini

config.ini

crawler_form.html

crawler_form.html

execResult.html

execResult.html

index.html

index.html

javax.mail.jar

javax.mail.jar

lab2_302650072_200845758.zip

lab2_302650072_200845758.zip

name_20160101_200000.html

name_20160101_200000.html

readme.txt

readme.txt

run.bat

run.bat

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

Shredder13/WebCrawler

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages