The Experiment Class¶
The Experiment
class in GAICo provides a streamlined and integrated workflow for evaluating and comparing multiple Language Model (LLM) responses against a single reference answer. It simplifies common tasks such as:
- Calculating scores for multiple metrics across different LLM outputs.
- Applying custom or default thresholds to these scores.
- Generating comparative plots (bar charts for single metrics, radar charts for multiple metrics).
- Creating CSV reports summarizing the evaluation.
Why Use the Experiment
Class?¶
While you can use GAICo's individual metric classes directly for fine-grained control (see Using Metrics Directly), the Experiment
class is ideal when:
- You have responses from several LLMs (or different versions of the same LLM) that you want to compare against a common reference.
- You want a quick way to get an overview of performance across multiple metrics.
- You need to generate reports and visualizations with minimal boilerplate code.
It acts as a high-level orchestrator, making the end-to-end evaluation process more convenient.
Initializing an Experiment¶
To start, you need to instantiate the Experiment
class. It requires two main pieces of information:
llm_responses
: A Python dictionary where keys are model names (strings) and values are the text responses generated by those models (strings).reference_answer
: A single string representing the ground truth or reference text against which all LLM responses will be compared.
from gaico import Experiment
# Example LLM responses from different models
llm_responses = {
"Google": "Title: Jimmy Kimmel Reacts to Donald Trump Winning the Presidential ... Snippet: Nov 6, 2024 ...",
"Mixtral 8x7b": "I'm an Al and I don't have the ability to predict the outcome of elections.",
"SafeChat": "Sorry, I am designed not to answer such a question.",
}
# The reference answer
reference_answer = "Sorry, I am unable to answer such a question as it is not appropriate."
# Initialize the Experiment
exp = Experiment(
llm_responses=llm_responses,
reference_answer=reference_answer
)
Comparing Models with compare()
¶
The primary method for conducting the evaluation is compare()
. This method orchestrates score calculation, plotting, threshold application, and CSV generation based on the parameters you provide.
# Basic comparison using default metrics, without plotting or CSV output
results_df = exp.compare()
print("Scores DataFrame (default metrics):")
print(results_df)
# A more comprehensive comparison:
# - Specify a subset of metrics
# - Enable plotting
# - Define custom thresholds for some metrics
# - Output results to a CSV file
results_df_custom = exp.compare(
metrics=['Jaccard', 'ROUGE', 'Levenshtein'], # Specify metrics, or None for all defaults
plot=True, # Generate and show plots
output_csv_path="my_experiment_report.csv", # Save a CSV report
custom_thresholds={"Jaccard": 0.6, "ROUGE_rougeL": 0.35} # Optional: override default thresholds
)
print("\nScores DataFrame (custom run):")
print(results_df_custom)
Key Parameters of compare()
:¶
metrics
(OptionalList[str]
):- A list of base metric names (e.g.,
"Jaccard"
,"ROUGE"
,"BERTScore"
) to calculate. - If
None
(default), all registered default metrics are used (currently: Jaccard, Cosine, Levenshtein, SequenceMatcher, BLEU, ROUGE, JSD, BERTScore). - Note: For metrics like ROUGE or BERTScore that produce multiple sub-scores (e.g.,
ROUGE_rouge1
,ROUGE_rougeL
,BERTScore_f1
), specifying the base name (e.g.,"ROUGE"
) will include all its sub-scores in the results.
- A list of base metric names (e.g.,
plot
(Optionalbool
, defaultFalse
):- If
True
, generates and displays plots using Matplotlib/Seaborn. - If one metric is evaluated, a bar chart comparing models for that metric is shown.
- If multiple metrics are evaluated (and at least 3, up to
radar_metrics_limit
), a radar chart is shown, providing a multi-dimensional comparison of models. If fewer than 3 (but more than 1) or more thanradar_metrics_limit
metrics are present, individual bar charts are shown for each.
- If
custom_thresholds
(OptionalDict[str, float]
):- A dictionary to specify custom pass/fail thresholds for metrics.
- Keys can be base metric names (e.g.,
"Jaccard"
) or specific "flattened" metric names as they appear in the output DataFrame (e.g.,"ROUGE_rouge1"
,"BERTScore_f1"
). - These thresholds override the library's default thresholds for the specified metrics.
- The thresholds are used to determine the "pass/fail" status in the CSV report.
output_csv_path
(Optionalstr
):- If a file path is provided, a CSV file is generated at this location.
- The CSV report includes:
generated_text
: The response from each LLM.reference_text
: The reference answer (repeated for each model).- Columns for each metric's score (e.g.,
Jaccard_score
,ROUGE_rouge1_score
). - Columns for each metric's pass/fail status based on the applied threshold (e.g.,
Jaccard_passed
,ROUGE_rouge1_passed
).
aggregate_func
(OptionalCallable
):- An aggregation function (e.g.,
numpy.mean
,numpy.median
) used for plotting when multiple scores might exist per model/metric. For the standardExperiment
use case (one response per model), this typically doesn't change the outcome of scores but is available for plot customization. Defaults tonumpy.mean
.
- An aggregation function (e.g.,
plot_title_suffix
(Optionalstr
, default"Comparison"
):- A string suffix added to the titles of generated plots.
radar_metrics_limit
(Optionalint
, default12
):- The maximum number of metrics to display on a single radar plot to maintain readability. If more metrics are present and a radar plot is applicable, only the first
radar_metrics_limit
are plotted on that radar chart.
- The maximum number of metrics to display on a single radar plot to maintain readability. If more metrics are present and a radar plot is applicable, only the first
What compare()
Returns¶
The compare()
method returns a pandas DataFrame containing the calculated scores. The DataFrame typically has the following columns:
model_name
: The name of the LLM (from the keys of yourllm_responses
dictionary).metric_name
: The specific metric calculated. This will be a "flattened" name if the base metric produces multiple scores (e.g.,"Jaccard"
,"ROUGE_rouge1"
,"ROUGE_rougeL"
,"BERTScore_f1"
).score
: The numerical score for that model and metric, typically normalized between 0 and 1.
This DataFrame is useful for any further custom analysis, filtering, or reporting you might want to perform.
Accessing Scores Separately with to_dataframe()
¶
If you only need the scores in a pandas DataFrame without triggering plots or CSV generation, you can use the to_dataframe()
method. This can be useful for programmatic access to the scores.
# Get scores for Jaccard and Levenshtein only
scores_subset_df = exp.to_dataframe(metrics=['Jaccard', 'Levenshtein'])
print("\nSubset of scores:")
print(scores_subset_df)
# Get scores for all default metrics (if not already computed, they will be)
all_scores_df = exp.to_dataframe() # Equivalent to exp.to_dataframe(metrics=None)
print("\nAll default scores:")
print(all_scores_df)
compare()
or an earlier call to to_dataframe()
). If scores for requested metrics haven't been computed yet, to_dataframe()
will calculate them.
How Experiment
Uses Metrics¶
Internally, the Experiment
class instantiates and utilizes the individual metric classes (like JaccardSimilarity
, ROUGE
, BERTScore
, etc.) found in the gaico.metrics
module. It handles the iteration over your LLM responses, applies each specified metric to compare each response against the single reference_answer
, and then aggregates these results into the output DataFrame.
For most common comparison tasks involving multiple models against a single reference, the Experiment
class provides the most convenient and comprehensive interface. If your use case involves comparing lists of generated texts against corresponding lists of reference texts (i.e., pair-wise comparisons for multiple examples), or if you need to implement highly custom evaluation logic or integrate new, un-registered metrics on the fly, using the individual metric classes directly might be more appropriate.