Skip to content

Text Similarity Metrics

GAICo provides several text similarity metrics to evaluate the similarity between text inputs.

These metrics are useful for tasks such as text comparison, duplicate detection, and semantic similarity analysis.

gaico.metrics.text_similarity_metrics.JaccardSimilarity

Bases: TextualMetric

Jaccard Similarity implementation for text similarity using the formula: J(A, B) = |A ∩ B| / |A ∪ B|

Supports calculation for individual sentence pairs and for batches of sentences.

Functions

__init__

__init__(**kwargs: Any)

Initialize the Jaccard Similarity metric.

gaico.metrics.text_similarity_metrics.CosineSimilarity

Bases: TextualMetric

Cosine Similarity implementation for text similarity using cosine_similarity from scikit-learn. The class also uses the CountVectorizer from scikit-learn to convert text to vectors.

Supports calculation for individual sentence pairs and for batches of sentences.

Functions

__init__

__init__(**kwargs: Any)

Initialize the Cosine Similarity metric.

Parameters:

Name Type Description Default
kwargs Any

Parameters for CountVectorizer. 'seed' is extracted for BaseMetric.

{}

gaico.metrics.text_similarity_metrics.LevenshteinDistance

Bases: TextualMetric

This class provides methods to calculate Levenshtein Distance for individual sentence pairs and for batches of sentences. It uses the distance and ratio functions from the Levenshtein package.

Functions

__init__

__init__(calculate_ratio: bool = True, **kwargs: Any)

Initialize the Levenshtein Distance metric.

Parameters:

Name Type Description Default
calculate_ratio bool

Whether to calculate the ratio of the distance to the length of the longer string, defaults to True.

True
kwargs Any

Additional parameters for BaseMetric, such as 'seed'.

{}

gaico.metrics.text_similarity_metrics.SequenceMatcherSimilarity

Bases: TextualMetric

This class calculates similarity ratio between texts using the ratio() method from difflib.SequenceMatcher, which returns a float in the range [0, 1] indicating how similar the sequences are.

Supports calculation for individual sentence pairs and for batches of sentences.

Functions

__init__

__init__(**kwargs: Any)

Initialize the SequenceMatcher Similarity metric