Skip to content

gwhiteside/simmetrics

 
 

Repository files navigation

Maven Central Build Status Coverage Status

SimMetrics

A Java library of similarity and distance metrics e.g. Levenshtein distance and Cosine similarity. All similarity metrics return normalized values rather than unbounded similarity scores. Distance metrics return non-negative unbounded scores.

Usage

For a quick and easy use StringMetrics contains a collection of well known string metrics.

	String str1 = "This is a sentence. It is made of words";
	String str2 = "This sentence is similar. It has almost the same words";
	
	StringMetric metric = StringMetrics.cosineSimilarity();
	
	float result = metric.compare(str1, str2); //0.4767

The StringMetricBuilder is a convenience tool to build string metrics. Any class implementing StringMetric, ListMetric, SetMetric or MultisetMetric can be used to build a string metric. The builder supports simplification, tokenization, token-filtering, token-transformation, and caching. For usage see the examples section.

For a terse syntax use import static org.simmetrics.builders.StringMetricBuilder.with;

	String str1 = "This is a sentence. It is made of words";
	String str2 = "This sentence is similar. It has almost the same words";

	StringMetric metric =
			with(new CosineSimilarity<String>())
			.simplify(Simplifiers.toLowerCase(Locale.ENGLISH))
			.simplify(Simplifiers.replaceNonWord())
			.tokenize(Tokenizers.whitespace())
			.build();

	float result = metric.compare(str1, str2); //0.5720

About

Similarity or Distance Metrics, e.g. Levenshtein, for Java

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%