Skip to content

The Experiment Class

The Experiment class in GAICo provides a streamlined and integrated workflow for evaluating and comparing multiple Language Model (LLM) responses against a single reference answer. It simplifies common tasks such as:

  • Calculating scores for multiple metrics across different LLM outputs.
  • Applying custom or default thresholds to these scores.
  • Generating comparative plots (bar charts for single metrics, radar charts for multiple metrics).
  • Creating CSV reports summarizing the evaluation.

Why Use the Experiment Class?

While you can use GAICo's individual metric classes directly for fine-grained control (see Using Metrics Directly), the Experiment class is ideal when:

  • You have responses from several LLMs (or different versions of the same LLM) that you want to compare against a common reference.
  • You want a quick way to get an overview of performance across multiple metrics.
  • You need to generate reports and visualizations with minimal boilerplate code.

It acts as a high-level orchestrator, making the end-to-end evaluation process more convenient.

Initializing an Experiment

To start, you need to instantiate the Experiment class. It requires two main pieces of information:

  1. llm_responses: A Python dictionary where keys are model names (strings) and values are the text responses generated by those models (strings).
  2. reference_answer: A single string representing the ground truth or reference text against which all LLM responses will be compared.
from gaico import Experiment

# Example LLM responses from different models
llm_responses = {
    "Google": "Title: Jimmy Kimmel Reacts to Donald Trump Winning the Presidential ... Snippet: Nov 6, 2024 ...",
    "Mixtral 8x7b": "I'm an Al and I don't have the ability to predict the outcome of elections.",
    "SafeChat": "Sorry, I am designed not to answer such a question.",
}

# The reference answer
reference_answer = "Sorry, I am unable to answer such a question as it is not appropriate."

# Initialize the Experiment
exp = Experiment(
    llm_responses=llm_responses,
    reference_answer=reference_answer
)

Comparing Models with compare()

The primary method for conducting the evaluation is compare(). This method orchestrates score calculation, plotting, threshold application, and CSV generation based on the parameters you provide.

# Basic comparison using default metrics, without plotting or CSV output
results_df = exp.compare()
print("Scores DataFrame (default metrics):")
print(results_df)

# A more comprehensive comparison:
# - Specify a subset of metrics
# - Enable plotting
# - Define custom thresholds for some metrics
# - Output results to a CSV file
results_df_custom = exp.compare(
    metrics=['Jaccard', 'ROUGE', 'Levenshtein'], # Specify metrics, or None for all defaults
    plot=True,                                  # Generate and show plots
    output_csv_path="my_experiment_report.csv", # Save a CSV report
    custom_thresholds={"Jaccard": 0.6, "ROUGE_rougeL": 0.35} # Optional: override default thresholds
)

print("\nScores DataFrame (custom run):")
print(results_df_custom)

Key Parameters of compare():

  • metrics (Optional List[str]):
    • A list of base metric names (e.g., "Jaccard", "ROUGE", "BERTScore") to calculate.
    • If None (default), all registered default metrics are used (currently: Jaccard, Cosine, Levenshtein, SequenceMatcher, BLEU, ROUGE, JSD, BERTScore).
    • Note: For metrics like ROUGE or BERTScore that produce multiple sub-scores (e.g., ROUGE_rouge1, ROUGE_rougeL, BERTScore_f1), specifying the base name (e.g., "ROUGE") will include all its sub-scores in the results.
  • plot (Optional bool, default False):
    • If True, generates and displays plots using Matplotlib/Seaborn.
    • If one metric is evaluated, a bar chart comparing models for that metric is shown.
    • If multiple metrics are evaluated (and at least 3, up to radar_metrics_limit), a radar chart is shown, providing a multi-dimensional comparison of models. If fewer than 3 (but more than 1) or more than radar_metrics_limit metrics are present, individual bar charts are shown for each.
  • custom_thresholds (Optional Dict[str, float]):
    • A dictionary to specify custom pass/fail thresholds for metrics.
    • Keys can be base metric names (e.g., "Jaccard") or specific "flattened" metric names as they appear in the output DataFrame (e.g., "ROUGE_rouge1", "BERTScore_f1").
    • These thresholds override the library's default thresholds for the specified metrics.
    • The thresholds are used to determine the "pass/fail" status in the CSV report.
  • output_csv_path (Optional str):
    • If a file path is provided, a CSV file is generated at this location.
    • The CSV report includes:
      • generated_text: The response from each LLM.
      • reference_text: The reference answer (repeated for each model).
      • Columns for each metric's score (e.g., Jaccard_score, ROUGE_rouge1_score).
      • Columns for each metric's pass/fail status based on the applied threshold (e.g., Jaccard_passed, ROUGE_rouge1_passed).
  • aggregate_func (Optional Callable):
    • An aggregation function (e.g., numpy.mean, numpy.median) used for plotting when multiple scores might exist per model/metric. For the standard Experiment use case (one response per model), this typically doesn't change the outcome of scores but is available for plot customization. Defaults to numpy.mean.
  • plot_title_suffix (Optional str, default "Comparison"):
    • A string suffix added to the titles of generated plots.
  • radar_metrics_limit (Optional int, default 12):
    • The maximum number of metrics to display on a single radar plot to maintain readability. If more metrics are present and a radar plot is applicable, only the first radar_metrics_limit are plotted on that radar chart.

What compare() Returns

The compare() method returns a pandas DataFrame containing the calculated scores. The DataFrame typically has the following columns:

  • model_name: The name of the LLM (from the keys of your llm_responses dictionary).
  • metric_name: The specific metric calculated. This will be a "flattened" name if the base metric produces multiple scores (e.g., "Jaccard", "ROUGE_rouge1", "ROUGE_rougeL", "BERTScore_f1").
  • score: The numerical score for that model and metric, typically normalized between 0 and 1.

This DataFrame is useful for any further custom analysis, filtering, or reporting you might want to perform.

Accessing Scores Separately with to_dataframe()

If you only need the scores in a pandas DataFrame without triggering plots or CSV generation, you can use the to_dataframe() method. This can be useful for programmatic access to the scores.

# Get scores for Jaccard and Levenshtein only
scores_subset_df = exp.to_dataframe(metrics=['Jaccard', 'Levenshtein'])
print("\nSubset of scores:")
print(scores_subset_df)

# Get scores for all default metrics (if not already computed, they will be)
all_scores_df = exp.to_dataframe() # Equivalent to exp.to_dataframe(metrics=None)
print("\nAll default scores:")
print(all_scores_df)
This method is efficient as it uses cached scores if they have already been computed (e.g., by a previous call to compare() or an earlier call to to_dataframe()). If scores for requested metrics haven't been computed yet, to_dataframe() will calculate them.

How Experiment Uses Metrics

Internally, the Experiment class instantiates and utilizes the individual metric classes (like JaccardSimilarity, ROUGE, BERTScore, etc.) found in the gaico.metrics module. It handles the iteration over your LLM responses, applies each specified metric to compare each response against the single reference_answer, and then aggregates these results into the output DataFrame.

For most common comparison tasks involving multiple models against a single reference, the Experiment class provides the most convenient and comprehensive interface. If your use case involves comparing lists of generated texts against corresponding lists of reference texts (i.e., pair-wise comparisons for multiple examples), or if you need to implement highly custom evaluation logic or integrate new, un-registered metrics on the fly, using the individual metric classes directly might be more appropriate.