GitHub - aafi/Grumper-Search-Engine

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.settings		.settings
conf		conf
examples		examples
lib		lib
src		src
target		target
.classpath		.classpath
.project		.project
1test.sh		1test.sh
README		README
Report.pdf		Report.pdf
build.xml		build.xml
test.sh		test.sh

Repository files navigation

1) the full names and SEAS login names of all the project members
Aayushi Dwivedi - aayushi
Ankit Mishra - mankit
Anwesha Das - anwesha
Deepti Panuganti - pdeepti

2) a description of all features implemented

Web crawler - We have implemented a distributed, incremental web crawler that follows robots exclusion protocol and contains a web UI that allows users to monitor the status of all the crawler nodes currently working and add seed urls at any time.

Indexer - We have implemented an EMR based MapReduce style indexer that creates inverted index for the crawled document corpus for unigrams, bigrams, trigrams and metadata.

PageRank - We have implemented an EMR based MapReduce style pagerank system that is self iterative in nature.

Search Engine - We have implemented a servlet based web search engine that allows users to search queries and shows paginated results along with previews on hover.

3) any extra credit claimed

Web Crawler - We have implemented message digesting. We use Sha-1 hashing to digest the content of the web document. Further, we have split our document store between dynamo and S3 to allow url -> hash and hash -> document look up.

Indexer - We have implemented metadata indexing in our indexer.

PageRank - We have implemented an iterative PageRank algorithm that uses dynamo to store results between iterations to allow result usage between iterations.

Search Engine - We have designed our front end to allow users to see the preview of the page for any search result.

4) a list of source files included

Web Crawler - The code for the crawler can be found on the branch anwesha in edu.upenn.cis455.project.crawler package.

Indexer -

PageRank- The pagerank code is available on the branch pagerankfinal in edu.upenn.cis455.pagerank package.

Search Engine - The code for search engine can be found on the indexerdb branch in edu.upenn.cis455.project.searchengine package.

All the database accessors and related classes are in edu.upenn.cis455.project.storage package on the respective branch.

5) detailed instructions on how to install and run the project.

Web Crawler - The crawler can be executed by copying the master.war and worker.war servlet into jetty/webapps directory. It needs an aws credential file in the same folder which should have the same format as the default credential file generated by aws sdk. The worker war should be adjusted to give the correct ip:port combination of the master servlet. The setup can be controlled using the master servlet ui available at ip:port/master/status, where ip:port is the ip and port combination of the master.

Indexer - There are four indexer jobs : unigram, bigram, trigram and metadata. You can find the most uptodate code on indexerdb branch.
Following are the steps required to run a indexer:
Create a table in DynamoDB (Name must be Unigram, Bigram, Trigram, Metadata respectively)
Create the jar for respective indexer job using the .jardesc files
Copy the jar to a S3 bucket
Create a cluster and provide the location of this jar in Custom Jar field, add the input buckets name as one of the parameters and also specify an output bucket name ( data is not written to S3)
Run the Step.
Indexed data will be written to table you create in step 1.

PageRank - Page rank is run through an EMR controller PageRankEmrController.java available in package edu.upenn.cis455.project.emrcontroller on branch emrfinal. The pagerank module can be executed by running the PageRankEmrController jar on an ec2 node. It automatically created dynamo table, adjusts capacity, creates cluster, merges input data, executes the emr job, terminates cluster and reduces the capacity of the dynamo table.

Search Engine - The search engine can be executed by copying the searchengine.war to the jetty/webapps folder along with the aws credentials file. The html pages (grumper.html and results.html), along with the CSS stylesheets and images need to be inside the jetty/webapps/root directory. To launch the search engine on the browser, the url is http://52.90.111.118:8080/results.html.

About

No description, website, or topics provided.

Readme

Activity

0 stars

1 watching

0 forks

Report repository

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.settings

.settings

conf

conf

examples

examples

lib

lib

src

src

target

target

.classpath

.classpath

.project

.project

1test.sh

1test.sh

README

README

Report.pdf

Report.pdf

build.xml

build.xml

test.sh

test.sh

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

aafi/Grumper-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages