The Experiment Class¶
The Experiment class in GAICo provides a streamlined and integrated workflow for evaluating and comparing multiple Language Model (LLM) responses against a single reference answer. It simplifies common tasks such as:
- Calculating scores for multiple metrics across different LLM outputs.
- Applying custom or default thresholds to these scores.
- Generating comparative plots (bar charts for single metrics, radar charts for multiple metrics).
- Creating CSV reports summarizing the evaluation.
- Generating aggregated performance summaries (mean scores, pass rates).
- Dynamically registering and using custom evaluation metrics.
Why Use the Experiment Class?¶
While you can use GAICo's individual metric classes directly for fine-grained control (see Using Metrics Directly), the Experiment class is ideal when:
- You have responses from several LLMs (or different versions of the same LLM) that you want to compare against a common reference.
- You want a quick way to get an overview of performance across multiple metrics.
- You need to generate reports and visualizations with minimal boilerplate code.
- You want to quickly see aggregated statistics (like average scores and pass rates) for your models.
- You have custom metrics that you want to integrate into the streamlined
Experimentworkflow without modifying the core library.
It acts as a high-level orchestrator, making the end-to-end evaluation process more convenient.
Initializing an Experiment¶
To start, you need to instantiate the Experiment class. It requires two main pieces of information:
llm_responses: A Python dictionary where keys are model names (strings) and values are the text responses generated by those models (strings).reference_answer: A single string representing the ground truth or reference text against which all LLM responses will be compared.
from gaico import Experiment
# Example LLM responses from different models
llm_responses = {
"Google": "Title: Jimmy Kimmel Reacts to Donald Trump Winning the Presidential ... Snippet: Nov 6, 2024 ...",
"Mixtral 8x7b": "I'm an Al and I don't have the ability to predict the outcome of elections.",
"SafeChat": "Sorry, I am designed not to answer such a question.",
}
# The reference answer
reference_answer = "Sorry, I am unable to answer such a question as it is not appropriate."
# Initialize the Experiment
exp = Experiment(
llm_responses=llm_responses,
reference_answer=reference_answer
)
Comparing Models with compare()¶
The primary method for conducting the evaluation is compare(). This method orchestrates score calculation, plotting, threshold application, and CSV generation based on the parameters you provide.
# Basic comparison using default metrics, without plotting or CSV output
results_df = exp.compare()
print("Scores DataFrame (default metrics):")
print(results_df)
# A more comprehensive comparison:
# - Specify a subset of metrics
# - Enable plotting
# - Define custom thresholds for some metrics
# - Output results to a CSV file
results_df_custom = exp.compare(
metrics=['Jaccard', 'ROUGE', 'Levenshtein'], # Specify metrics, or None for all defaults
plot=True, # Generate and show plots
output_csv_path="my_experiment_report.csv", # Save a CSV report
custom_thresholds={"Jaccard": 0.6, "ROUGE_rougeL": 0.35} # Optional: override default thresholds
)
print("\nScores DataFrame (custom run):")
print(results_df_custom)
Key Parameters of compare():¶
metrics(OptionalList[str]):- A list of base metric names (e.g.,
"Jaccard","ROUGE","BERTScore") to calculate. - If
None(default), all registered default metrics are used (currently: Jaccard, Cosine, Levenshtein, SequenceMatcher, BLEU, ROUGE, JSD, BERTScore). - Note: For metrics like ROUGE or BERTScore that produce multiple sub-scores (e.g.,
ROUGE_rouge1,ROUGE_rougeL,BERTScore_f1), specifying the base name (e.g.,"ROUGE") will include all its sub-scores in the results. plot(Optionalbool, defaultFalse):- If
True, generates and displays plots using Matplotlib/Seaborn. - If one metric is evaluated, a bar chart comparing models for that metric is shown.
- If multiple metrics are evaluated (and at least 3, up to
radar_metrics_limit), a radar chart is shown, providing a multi-dimensional comparison of models. If fewer than 3 (but more than 1) or more thanradar_metrics_limitmetrics are present, individual bar charts are shown for each. custom_thresholds(OptionalDict[str, float]):- A dictionary to specify custom pass/fail thresholds for metrics.
- Keys can be base metric names (e.g.,
"Jaccard") or specific "flattened" metric names as they appear in the output DataFrame (e.g.,"ROUGE_rouge1","BERTScore_f1"). - These thresholds override the library's default thresholds for the specified metrics.
- The thresholds are used to determine the "pass/fail" status in the CSV report.
output_csv_path(Optionalstr):- If a file path is provided, a CSV file is generated at this location.
- The CSV report includes:
generated_text: The response from each LLM.reference_text: The reference answer (repeated for each model).- Columns for each metric's score (e.g.,
Jaccard_score,ROUGE_rouge1_score). - Columns for each metric's pass/fail status based on the applied threshold (e.g.,
Jaccard_passed,ROUGE_rouge1_passed).
aggregate_func(OptionalCallable):- An aggregation function (e.g.,
numpy.mean,numpy.median) used for plotting when multiple scores might exist per model/metric. For the standardExperimentuse case (one response per model), this typically doesn't change the outcome of scores but is available for plot customization. Defaults tonumpy.mean. plot_title_suffix(Optionalstr, default"Comparison"):- A string suffix added to the titles of generated plots.
radar_metrics_limit(Optionalint, default12):- The maximum number of metrics to display on a single radar plot to maintain readability. If more metrics are present and a radar plot is applicable, only the first
radar_metrics_limitare plotted on that radar chart.
What compare() Returns¶
The compare() method returns a pandas DataFrame containing the calculated scores. The DataFrame typically has the following columns:
model_name: The name of the LLM (from the keys of yourllm_responsesdictionary).metric_name: The specific metric calculated. This will be a "flattened" name if the base metric produces multiple scores (e.g.,"Jaccard","ROUGE_rouge1","ROUGE_rougeL","BERTScore_f1").score: The numerical score for that model and metric, typically normalized between 0 and 1.
This DataFrame is useful for any further custom analysis, filtering, or reporting you might want to perform.
Accessing Scores Separately with to_dataframe()¶
If you only need the scores in a pandas DataFrame without triggering plots or CSV generation, you can use the to_dataframe() method. This can be useful for programmatic access to the scores.
# Get scores for Jaccard and Levenshtein only
scores_subset_df = exp.to_dataframe(metrics=['Jaccard', 'Levenshtein'])
print("\nSubset of scores:")
print(scores_subset_df)
# Get scores for all default metrics (if not already computed, they will be)
all_scores_df = exp.to_dataframe() # Equivalent to exp.to_dataframe(metrics=None)
print("\nAll default scores:")
print(all_scores_df)
This method is efficient as it uses cached scores if they have already been computed (e.g., by a previous call to compare() or an earlier call to to_dataframe()). If scores for requested metrics haven't been computed yet, to_dataframe() will calculate them.
Summarizing Results with summarize()¶
Beyond raw scores, you often need aggregated insights into model performance, such as average scores and pass rates. The summarize() method provides a convenient way to generate a clean, aggregated DataFrame.
This method is particularly useful when evaluating models on a dataset (batch mode), as it automatically aggregates scores across all items.
from gaico import Experiment
import pandas as pd
# Example: Batch evaluation setup (assuming you have lists of responses)
llm_responses_batch = {
"Model_A": ["response A1", "response A2", "response A3"],
"Model_B": ["response B1", "response B2", "response B3"],
}
reference_answers_batch = ["ref 1", "ref 2", "ref 3"]
exp_batch = Experiment(
llm_responses=llm_responses_batch,
reference_answer=reference_answers_batch
)
# Calculate scores first (or call compare() which does this)
# For demonstration, let's assume some scores are already calculated or will be by summarize()
# In a real scenario, you'd run exp_batch.compare(...) first or ensure metrics are calculated.
# Get a summary of results, including mean, standard deviation, and pass rates
summary_df = exp_batch.summarize(
metrics=['Jaccard', 'ROUGE'], # Specify metrics to summarize
custom_thresholds={"Jaccard": 0.6, "ROUGE_rouge1": 0.35}, # Optional: custom thresholds for pass rates
agg_funcs=['mean', 'std', 'min', 'max'] # Optional: specify aggregation functions
)
print("\nAggregated Summary DataFrame:")
print(summary_df)
# Expected Output (example, actual values depend on data and metrics):
# Aggregated Summary DataFrame:
# model_name Jaccard_mean Jaccard_std Jaccard_min Jaccard_max ROUGE_rouge1_mean ROUGE_rouge1_std ROUGE_rouge1_min ROUGE_rouge1_max Jaccard_pass_rate ROUGE_rouge1_pass_rate
# 0 Model_A 0.750000 0.100000 0.650000 0.850000 0.700000 0.050000 0.650000 0.750000 66.666667 100.000000
# 1 Model_B 0.500000 0.050000 0.450000 0.550000 0.400000 0.020000 0.380000 0.420000 0.000000 33.333333
Key Parameters of summarize():¶
metrics(OptionalList[str]):- A list of base metric names (e.g.,
"Jaccard","ROUGE") to include in the summary. - If
None(default), all metrics that have been calculated by previouscompare()orto_dataframe()calls, or all default metrics if none have been calculated, will be summarized. custom_thresholds(OptionalDict[str, float]):- A dictionary to specify custom pass/fail thresholds for metrics. These are used to calculate the
_pass_ratecolumns. - Keys can be base metric names (e.g.,
"Jaccard") or specific "flattened" metric names (e.g.,"ROUGE_rouge1"). agg_funcs(OptionalList[str], default['mean', 'std']):- A list of aggregation function names (as strings) to apply to the scores. Common options include
'mean','std','min','max','median','count'.
The summarize() method provides a powerful way to quickly grasp the overall performance characteristics of your LLMs.
Registering Custom Metrics with register_metric()¶
GAICo's extensible design allows you to define your own custom metrics by inheriting from gaico.metrics.base.BaseMetric. The Experiment class now provides a register_metric() method, enabling you to seamlessly integrate these custom metrics into its streamlined workflow.
This means you can use your specialized metrics alongside GAICo's built-in ones in compare() and summarize() calls, without modifying the library's source code.
How to Register a Custom Metric:¶
- Define your custom metric class: Ensure it inherits from
gaico.metrics.base.BaseMetricand implements the_single_calculateand_batch_calculatemethods. - Instantiate your
Experiment: Create anExperimentobject as usual. - Call
register_metric(): Pass a unique name for your metric and its class.
from gaico import Experiment, BaseMetric
import pandas as pd
# 1. Define a simple custom metric (e.g., checks if generated text contains a specific keyword)
class KeywordPresenceMetric(BaseMetric):
def __init__(self, keyword: str = "GAICo"):
self.keyword = keyword.lower()
def _single_calculate(self, generated_item: str, reference_item: str, **kwargs) -> float:
# Score is 1.0 if keyword is present, 0.0 otherwise
return 1.0 if self.keyword in generated_item.lower() else 0.0
def _batch_calculate(self, generated_items, reference_items, **kwargs) -> list[float]:
return [self._single_calculate(gen, ref) for gen, ref in zip(generated_items, reference_items)]
# Example LLM responses
llm_responses_custom = {
"Model_X": "This is a great library called GAICo.",
"Model_Y": "I love using Python for AI.",
"Model_Z": "GAICo makes evaluation easy!",
}
reference_answer_custom = "The reference text is not used by this specific metric, but is required by BaseMetric."
# 2. Initialize the Experiment
exp_custom = Experiment(
llm_responses=llm_responses_custom,
reference_answer=reference_answer_custom
)
# 3. Register your custom metric
exp_custom.register_metric("HasGAICoKeyword", KeywordPresenceMetric)
# Now, use your custom metric in compare() or summarize()
# Note: If your custom metric has __init__ parameters, you'd need to
# register a pre-configured instance or a factory, but for simplicity,
# we're using the default KeywordPresenceMetric() here.
# For more complex custom metric initialization, consider a wrapper class
# or a factory function that returns a configured instance.
results_with_custom = exp_custom.compare(metrics=["Jaccard", "HasGAICoKeyword"])
print("\nResults with Custom Metric:")
print(results_with_custom)
# Expected Output (example):
# Results with Custom Metric:
# model_name metric_name score
# 0 Model_X Jaccard 0.25
# 1 Model_X HasGAICoKeyword 1.00
# 2 Model_Y Jaccard 0.14
# 3 Model_Y HasGAICoKeyword 0.00
# 4 Model_Z Jaccard 0.20
# 5 Model_Z HasGAICoKeyword 1.00
summary_with_custom = exp_custom.summarize(metrics=["HasGAICoKeyword"])
print("\nSummary with Custom Metric:")
print(summary_with_custom)
# Expected Output (example):
# Summary with Custom Metric:
# model_name HasGAICoKeyword_mean HasGAICoKeyword_std HasGAICoKeyword_min HasGAICoKeyword_max HasGAICoKeyword_pass_rate
# 0 Model_X 1.0 0.0 1.0 1.0 100.0
# 1 Model_Y 0.0 0.0 0.0 0.0 0.0
# 2 Model_Z 1.0 0.0 1.0 1.0 100.0
This dynamic registration capability significantly enhances GAICo's flexibility, allowing it to adapt to virtually any domain-specific evaluation requirement.
How Experiment Uses Metrics¶
Internally, the Experiment class instantiates and utilizes the individual metric classes (like JaccardSimilarity, ROUGE, BERTScore, etc.) found in the gaico.metrics module. It handles the iteration over your LLM responses, applies each specified metric to compare each response against the single reference_answer, and then aggregates these results into the output DataFrame.
For most common comparison tasks involving multiple models against a single reference, the Experiment class provides the most convenient and comprehensive interface. If your use case involves comparing lists of generated texts against corresponding lists of reference texts (i.e., pair-wise comparisons for multiple examples), or if you need to implement highly custom evaluation logic or integrate new, un-registered metrics on the fly, using the individual metric classes directly might be more appropriate.