Skip to content

pthairu/find-duplicates

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

find-duplicates

Playing with various implementations of bloom filters ...

Cassandra's bloom filter implementation works best for my use case as it handles sets of larger cardinality (64 * 2**32-1) by using OpenBitSet which is also faster than BitSet.

Quick start

  1. Build with maven

    mvn compile assembly:single

  2. Define bloom filter size and false positive probability properties

    bloom.elements=2000000000

    bloom.probability=0.0001

  3. Also need jamm.jar to see mem sizes (https://github.com/jbellis/jamm)

  4. Run it

    java -javaagent:./jamm-0.2.6-SNAPSHOT.jar -jar find-duplicates-jar-with-dependencies.jar -c <PROPERTIES FILE> -d <BASE DATA DIR> -r <FILE REGEX>

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages