Playing with various implementations of bloom filters ...
Cassandra's bloom filter implementation works best for my use case as it handles sets of larger cardinality (64 * 2**32-1) by using OpenBitSet which is also faster than BitSet.
-
Build with maven
mvn compile assembly:single
-
Define bloom filter size and false positive probability properties
bloom.elements=2000000000
bloom.probability=0.0001
-
Also need jamm.jar to see mem sizes (https://github.com/jbellis/jamm)
-
Run it
java -javaagent:./jamm-0.2.6-SNAPSHOT.jar -jar find-duplicates-jar-with-dependencies.jar -c <PROPERTIES FILE> -d <BASE DATA DIR> -r <FILE REGEX>