Skip to content

aafi/Grumper-Search-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

1) the full names and SEAS login names of all the project members
Aayushi Dwivedi - aayushi
Ankit Mishra - mankit
Anwesha Das - anwesha
Deepti Panuganti - pdeepti


 2) a description of all features implemented

Web crawler - We have implemented a distributed, incremental web crawler that follows robots exclusion protocol and contains a web UI that allows users to monitor the status of all the crawler nodes currently working and add seed urls at any time.

Indexer -  We have implemented an EMR based MapReduce style indexer that creates inverted index for the crawled document corpus for unigrams, bigrams, trigrams and metadata.

PageRank - We have implemented an EMR based MapReduce style pagerank system that is self iterative in nature.

Search Engine - We have implemented a servlet based web search engine that allows users to search queries and shows paginated results along with previews on hover.

 3) any extra credit claimed

Web Crawler - We have implemented message digesting. We use Sha-1 hashing to digest the content of the web document. Further, we have split our document store between dynamo and S3 to allow url -> hash and hash -> document look up.

Indexer - We have implemented metadata indexing in our indexer.

PageRank - We have implemented an iterative PageRank algorithm that uses dynamo to store results between iterations to allow result usage between iterations.

Search Engine - We have designed our front end to allow users to see the preview of the page for any search result.

 4) a list of source files included 

Web Crawler - The code for the crawler can be found on the branch anwesha in edu.upenn.cis455.project.crawler package.

Indexer - 

PageRank- The pagerank code is available on the branch pagerankfinal in edu.upenn.cis455.pagerank package.

Search Engine - The code for search engine can be found on the indexerdb branch in edu.upenn.cis455.project.searchengine package.

All the database accessors and related classes are in edu.upenn.cis455.project.storage package on the respective branch.

 5) detailed instructions on how to install and run the project. 

Web Crawler - The crawler can be executed by copying the master.war and worker.war servlet into jetty/webapps directory. It needs an aws credential file in the same folder which should have the same format as the default credential file generated by aws sdk. The worker war should be adjusted to give the correct ip:port combination of the master servlet. The setup can be controlled using the master servlet ui available at ip:port/master/status, where ip:port is the ip and port combination of the master.

Indexer - There are four indexer jobs : unigram, bigram, trigram and metadata. You can find the most uptodate code on indexerdb branch.
Following are the steps required to run a indexer:
Create a table in DynamoDB (Name must be Unigram, Bigram, Trigram, Metadata respectively)
Create the jar for respective indexer job using the .jardesc files
Copy the jar to a S3 bucket
Create a cluster and provide the location of this jar in Custom Jar field, add the input buckets name as one of the parameters and also specify an output bucket name ( data is not written to S3)
Run the Step.
Indexed data will be written to table you create in step 1.

PageRank - Page rank is run through an EMR controller PageRankEmrController.java available in package edu.upenn.cis455.project.emrcontroller on branch emrfinal. The pagerank module can be executed by running the PageRankEmrController jar on an ec2 node. It automatically created dynamo table, adjusts capacity, creates cluster, merges input data, executes the emr job, terminates cluster and reduces the capacity of the dynamo table.

Search Engine - The search engine can be executed by copying the searchengine.war to the jetty/webapps folder along with the aws credentials file. The html pages (grumper.html and results.html), along with the CSS stylesheets and images need to be inside the jetty/webapps/root directory. To launch the search engine on the browser, the url is http://52.90.111.118:8080/results.html.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published