Skip to content

Jeremy-WEI/Mini-Google

Repository files navigation

#Mini-Google

  • A peer-to-peer implementation of a Googe-style search engine hosted on Amazon’s AWS ecosystem.

  • Our search engine returns relevant results as determined by both our implementation of the PageRank algorithm and a word index generated from over 200,000 web documents. We use the Okapi BM25 algorithm to provide a relevancy score for candidate documents, and combine it with query terms’ word positions in candidate documents, the candidate documents’ PageRank and the candidate documents’ domains’ Alexa ranking to generate a final ranking for our search results.

  • Our search engine contains four major components:

    • Distributed Crawler: consists of multiple worker nodes, for crawling documents, and a master node, to which the worker nodes report their status. Each crawler is responsible for a defined subset of domains (based on the Java-assigned hash code of each URL’s domain), and is responsible for redistributing URLs directly to other crawlers.

    • Indexer: takes documents from the crawler and create a lexicon, inverted index, as well as other information (e.g. word position, hit type, etc.). We implemented different indexers for different type of documents, including pdf, html, xml and txt. The indexer runs on Amazon EMR.

    • PageRank: implemented in the map-reduce version and runs on Amazon EMR. The PageRank of each URL is calculated using a representation of the link structure generated by the crawler. We also add extra pre-processing and post-processing to handle sink and hog.

    • Search Engine: creates database for inverted index, PageRank value, Alexa Ranking and document content. Use multiple phases to get search results, e.g. parse query, tf-idf ranking, word position checking, PageRank & Alexa Rank integrating, etc. We also display preview of each returning result.

About

A Google-style Search Engine

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages