A Java library of similarity and distance metrics e.g. Levenshtein distance and Cosine similarity. All similarity metrics return normalized values rather than unbounded similarity scores. Distance metrics return non-negative unbounded scores.
For a quick and easy use StringMetrics contains a collection of well known string metrics.
String str1 = "This is a sentence. It is made of words";
String str2 = "This sentence is similar. It has almost the same words";
StringMetric metric = StringMetrics.cosineSimilarity();
float result = metric.compare(str1, str2); //0.4767
The StringMetricBuilder is a convenience tool to build string metrics. Any class implementing StringMetric, ListMetric, SetMetric or MultisetMetric can be used to build a string metric. The builder supports simplification, tokenization, token-filtering, token-transformation, and caching. For usage see the examples section.
For a terse syntax use import static org.simmetrics.builders.StringMetricBuilder.with;
String str1 = "This is a sentence. It is made of words";
String str2 = "This sentence is similar. It has almost the same words";
StringMetric metric =
with(new CosineSimilarity<String>())
.simplify(Simplifiers.toLowerCase(Locale.ENGLISH))
.simplify(Simplifiers.replaceNonWord())
.tokenize(Tokenizers.whitespace())
.build();
float result = metric.compare(str1, str2); //0.5720