Skip to content

therealshah/Thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

This is git repo to my Thesis. Here all of the code and datasets are presented and open. I have also attached my Thesis in this repo, incase anyone wanted to read it :).

Abstract

There are many cases in which a file is created with only minimal modifications from the previous version. This may occur in versioned document sets such as Wikipedia, where a newer version is created by inserting or deleting a paragraph, fixing spelling issues, or even simply correcting a grammar error. Instead of storing the newer version of the file in the entirety, it would be less expensive to store only the pieces of the file that are missing. In this paper, we examine and explain the current algorithms that are used to detect document similarity between files. The algorithms we examine in this paper are Karp-Rabin, Winnowing, TDDD, and 2Min. We run the algorithms on various datasets, such as the Internet Archive, gcc and emacs source files, and randomly generated files, to determine which algorithm finds the most document similarity. We run timing experiments to determine the speed of each of the algorithms. We also present two new algorithms, by making modifications of the 2Min algorithm, which outperform the original in finding more document similarity.

About

This is the work I'm doing for my Master's Thesis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published