Skip to content

Text Similarity Metrics

GAICo provides several text similarity metrics to evaluate the similarity between text inputs.

These metrics are useful for tasks such as text comparison, duplicate detection, and semantic similarity analysis.

gaico.metrics.text_similarity_metrics.JaccardSimilarity

Bases: TextualMetric

Jaccard Similarity implementation for text similarity using the formula: J(A, B) = |A ∩ B| / |A ∪ B|

Supports calculation for individual sentence pairs and for batches of sentences.

__init__

__init__()

Initialize the Jaccard Similarity metric.

gaico.metrics.text_similarity_metrics.CosineSimilarity

Bases: TextualMetric

Cosine Similarity implementation for text similarity using cosine_similarity from scikit-learn. The class also uses the CountVectorizer from scikit-learn to convert text to vectors.

Supports calculation for individual sentence pairs and for batches of sentences.

__init__

__init__(**kwargs)

Initialize the Cosine Similarity metric.

Parameters:

Name Type Description Default
kwargs Any

Parameters for the CountVectorizer

{}

gaico.metrics.text_similarity_metrics.LevenshteinDistance

Bases: TextualMetric

This class provides methods to calculate Levenshtein Distance for individual sentence pairs and for batches of sentences. It uses the distance and ratio functions from the Levenshtein package.

__init__

__init__(calculate_ratio=True)

Initialize the Levenshtein Distance metric.

Parameters:

Name Type Description Default
calculate_ratio bool

Whether to calculate the ratio of the distance to the length of the longer string, defaults to True.

True

gaico.metrics.text_similarity_metrics.SequenceMatcherSimilarity

Bases: TextualMetric

This class calculates similarity ratio between texts using the ratio() method from difflib.SequenceMatcher, which returns a float in the range [0, 1] indicating how similar the sequences are.

Supports calculation for individual sentence pairs and for batches of sentences.

__init__

__init__()

Initialize the SequenceMatcher Similarity metric