Text Similarity Metrics¶
GAICo provides several text similarity metrics to evaluate the similarity between text inputs.
These metrics are useful for tasks such as text comparison, duplicate detection, and semantic similarity analysis.
gaico.metrics.text_similarity_metrics.JaccardSimilarity ¶
Bases: TextualMetric
Jaccard Similarity implementation for text similarity using the formula: J(A, B) = |A ∩ B| / |A ∪ B|
Supports calculation for individual sentence pairs and for batches of sentences.
gaico.metrics.text_similarity_metrics.CosineSimilarity ¶
Bases: TextualMetric
Cosine Similarity implementation for text similarity using cosine_similarity
from scikit-learn.
The class also uses the CountVectorizer
from scikit-learn to convert text to vectors.
Supports calculation for individual sentence pairs and for batches of sentences.
__init__ ¶
__init__(**kwargs)
Initialize the Cosine Similarity metric.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
kwargs
|
Any
|
Parameters for the CountVectorizer |
{}
|
gaico.metrics.text_similarity_metrics.LevenshteinDistance ¶
Bases: TextualMetric
This class provides methods to calculate Levenshtein Distance for individual sentence pairs and for batches of sentences.
It uses the distance
and ratio
functions from the Levenshtein
package.
__init__ ¶
__init__(calculate_ratio=True)
Initialize the Levenshtein Distance metric.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
calculate_ratio
|
bool
|
Whether to calculate the ratio of the distance to the length of the longer string, defaults to True. |
True
|
gaico.metrics.text_similarity_metrics.SequenceMatcherSimilarity ¶
Bases: TextualMetric
This class calculates similarity ratio between texts using the ratio() method from difflib.SequenceMatcher, which returns a float in the range [0, 1] indicating how similar the sequences are.
Supports calculation for individual sentence pairs and for batches of sentences.