HadoopWikiProject

The objective of this project is to analyze public Wikimedia logs to determine different traffic load as well as popularity patterns using Apache Hadoop. Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers. The dataset for this project comprises of three days’ worth (January 1st, 2012 to January 3rd, 2012) of Wikimedia log entries, which translates to about 5.6GB of compressed data and about 20GB after decompression.

The goals of this project are:
1.) Perform temporal analysis on total number of requests per hour.
2.) Find the most popular Wikimedia project based on total views per hour per project.
3.) Find the top 10 most popular pages during a given day.
4.) Find the top 10 pages that returned the most content during a given day.
5.) Determine whether this data obeys Zipf’s law in terms of popularity.

The "testwikilogs" folder contains a small snapshot of the data used for this project.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Goal 1		Goal 1
Goal 2		Goal 2
Goal 3		Goal 3
Goal 4		Goal 4
Goal 5		Goal 5
testwikilogs		testwikilogs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Goal 1

Goal 1

Goal 2

Goal 2

Goal 3

Goal 3

Goal 4

Goal 4

Goal 5

Goal 5

testwikilogs

testwikilogs

README.md

README.md

Repository files navigation

HadoopWikiProject

About

Releases

Packages

Languages

anantunc/HadoopWikiProject

Folders and files

Latest commit

History

Repository files navigation

HadoopWikiProject

About

Resources

Stars

Watchers

Forks

Languages