Skip to content

The Experiment Class

The Experiment class in GAICo provides a streamlined and integrated workflow for evaluating and comparing multiple Language Model (LLM) responses against a single reference answer. It simplifies common tasks such as:

  • Calculating scores for multiple metrics across different LLM outputs.
  • Applying custom or default thresholds to these scores.
  • Generating comparative plots (bar charts for single metrics, radar charts for multiple metrics).
  • Creating CSV reports summarizing the evaluation.
  • Generating aggregated performance summaries (mean scores, pass rates).
  • Dynamically registering and using custom evaluation metrics.

Why Use the Experiment Class?

While you can use GAICo's individual metric classes directly for fine-grained control (see Using Metrics Directly), the Experiment class is ideal when:

  • You have responses from several LLMs (or different versions of the same LLM) that you want to compare against a common reference.
  • You want a quick way to get an overview of performance across multiple metrics.
  • You need to generate reports and visualizations with minimal boilerplate code.
  • You want to quickly see aggregated statistics (like average scores and pass rates) for your models.
  • You have custom metrics that you want to integrate into the streamlined Experiment workflow without modifying the core library.

It acts as a high-level orchestrator, making the end-to-end evaluation process more convenient.

Initializing an Experiment

To start, you need to instantiate the Experiment class. It requires two main pieces of information:

  1. llm_responses: A Python dictionary where keys are model names (strings) and values are the text responses generated by those models (strings).
  2. reference_answer: A single string representing the ground truth or reference text against which all LLM responses will be compared.
from gaico import Experiment

# Example LLM responses from different models
llm_responses = {
    "Google": "Title: Jimmy Kimmel Reacts to Donald Trump Winning the Presidential ... Snippet: Nov 6, 2024 ...",
    "Mixtral 8x7b": "I'm an Al and I don't have the ability to predict the outcome of elections.",
    "SafeChat": "Sorry, I am designed not to answer such a question.",
}

# The reference answer
reference_answer = "Sorry, I am unable to answer such a question as it is not appropriate."

# Initialize the Experiment
exp = Experiment(
    llm_responses=llm_responses,
    reference_answer=reference_answer
)

Comparing Models with compare()

The primary method for conducting the evaluation is compare(). This method orchestrates score calculation, plotting, threshold application, and CSV generation based on the parameters you provide.

# Basic comparison using default metrics, without plotting or CSV output
results_df = exp.compare()
print("Scores DataFrame (default metrics):")
print(results_df)

# A more comprehensive comparison:
# - Specify a subset of metrics
# - Enable plotting
# - Define custom thresholds for some metrics
# - Output results to a CSV file
results_df_custom = exp.compare(
    metrics=['Jaccard', 'ROUGE', 'Levenshtein'], # Specify metrics, or None for all defaults
    plot=True,                                  # Generate and show plots
    output_csv_path="my_experiment_report.csv", # Save a CSV report
    custom_thresholds={"Jaccard": 0.6, "ROUGE_rougeL": 0.35} # Optional: override default thresholds
)

print("\nScores DataFrame (custom run):")
print(results_df_custom)

Key Parameters of compare():

  • metrics (Optional List[str]):
  • A list of base metric names (e.g., "Jaccard", "ROUGE", "BERTScore") to calculate.
  • If None (default), all registered default metrics are used (currently: Jaccard, Cosine, Levenshtein, SequenceMatcher, BLEU, ROUGE, JSD, BERTScore).
  • Note: For metrics like ROUGE or BERTScore that produce multiple sub-scores (e.g., ROUGE_rouge1, ROUGE_rougeL, BERTScore_f1), specifying the base name (e.g., "ROUGE") will include all its sub-scores in the results.
  • plot (Optional bool, default False):
  • If True, generates and displays plots using Matplotlib/Seaborn.
  • If one metric is evaluated, a bar chart comparing models for that metric is shown.
  • If multiple metrics are evaluated (and at least 3, up to radar_metrics_limit), a radar chart is shown, providing a multi-dimensional comparison of models. If fewer than 3 (but more than 1) or more than radar_metrics_limit metrics are present, individual bar charts are shown for each.
  • custom_thresholds (Optional Dict[str, float]):
  • A dictionary to specify custom pass/fail thresholds for metrics.
  • Keys can be base metric names (e.g., "Jaccard") or specific "flattened" metric names as they appear in the output DataFrame (e.g., "ROUGE_rouge1", "BERTScore_f1").
  • These thresholds override the library's default thresholds for the specified metrics.
  • The thresholds are used to determine the "pass/fail" status in the CSV report.
  • output_csv_path (Optional str):
  • If a file path is provided, a CSV file is generated at this location.
  • The CSV report includes:
    • generated_text: The response from each LLM.
    • reference_text: The reference answer (repeated for each model).
    • Columns for each metric's score (e.g., Jaccard_score, ROUGE_rouge1_score).
    • Columns for each metric's pass/fail status based on the applied threshold (e.g., Jaccard_passed, ROUGE_rouge1_passed).
  • aggregate_func (Optional Callable):
  • An aggregation function (e.g., numpy.mean, numpy.median) used for plotting when multiple scores might exist per model/metric. For the standard Experiment use case (one response per model), this typically doesn't change the outcome of scores but is available for plot customization. Defaults to numpy.mean.
  • plot_title_suffix (Optional str, default "Comparison"):
  • A string suffix added to the titles of generated plots.
  • radar_metrics_limit (Optional int, default 12):
  • The maximum number of metrics to display on a single radar plot to maintain readability. If more metrics are present and a radar plot is applicable, only the first radar_metrics_limit are plotted on that radar chart.

What compare() Returns

The compare() method returns a pandas DataFrame containing the calculated scores. The DataFrame typically has the following columns:

  • model_name: The name of the LLM (from the keys of your llm_responses dictionary).
  • metric_name: The specific metric calculated. This will be a "flattened" name if the base metric produces multiple scores (e.g., "Jaccard", "ROUGE_rouge1", "ROUGE_rougeL", "BERTScore_f1").
  • score: The numerical score for that model and metric, typically normalized between 0 and 1.

This DataFrame is useful for any further custom analysis, filtering, or reporting you might want to perform.

Accessing Scores Separately with to_dataframe()

If you only need the scores in a pandas DataFrame without triggering plots or CSV generation, you can use the to_dataframe() method. This can be useful for programmatic access to the scores.

# Get scores for Jaccard and Levenshtein only
scores_subset_df = exp.to_dataframe(metrics=['Jaccard', 'Levenshtein'])
print("\nSubset of scores:")
print(scores_subset_df)

# Get scores for all default metrics (if not already computed, they will be)
all_scores_df = exp.to_dataframe() # Equivalent to exp.to_dataframe(metrics=None)
print("\nAll default scores:")
print(all_scores_df)

This method is efficient as it uses cached scores if they have already been computed (e.g., by a previous call to compare() or an earlier call to to_dataframe()). If scores for requested metrics haven't been computed yet, to_dataframe() will calculate them.

Summarizing Results with summarize()

Beyond raw scores, you often need aggregated insights into model performance, such as average scores and pass rates. The summarize() method provides a convenient way to generate a clean, aggregated DataFrame.

This method is particularly useful when evaluating models on a dataset (batch mode), as it automatically aggregates scores across all items.

from gaico import Experiment
import pandas as pd

# Example: Batch evaluation setup (assuming you have lists of responses)
llm_responses_batch = {
    "Model_A": ["response A1", "response A2", "response A3"],
    "Model_B": ["response B1", "response B2", "response B3"],
}
reference_answers_batch = ["ref 1", "ref 2", "ref 3"]

exp_batch = Experiment(
    llm_responses=llm_responses_batch,
    reference_answer=reference_answers_batch
)

# Calculate scores first (or call compare() which does this)
# For demonstration, let's assume some scores are already calculated or will be by summarize()
# In a real scenario, you'd run exp_batch.compare(...) first or ensure metrics are calculated.

# Get a summary of results, including mean, standard deviation, and pass rates
summary_df = exp_batch.summarize(
    metrics=['Jaccard', 'ROUGE'], # Specify metrics to summarize
    custom_thresholds={"Jaccard": 0.6, "ROUGE_rouge1": 0.35}, # Optional: custom thresholds for pass rates
    agg_funcs=['mean', 'std', 'min', 'max'] # Optional: specify aggregation functions
)
print("\nAggregated Summary DataFrame:")
print(summary_df)

# Expected Output (example, actual values depend on data and metrics):
# Aggregated Summary DataFrame:
#   model_name  Jaccard_mean  Jaccard_std  Jaccard_min  Jaccard_max  ROUGE_rouge1_mean  ROUGE_rouge1_std  ROUGE_rouge1_min  ROUGE_rouge1_max  Jaccard_pass_rate  ROUGE_rouge1_pass_rate
# 0    Model_A      0.750000     0.100000     0.650000     0.850000           0.700000          0.050000          0.650000          0.750000          66.666667               100.000000
# 1    Model_B      0.500000     0.050000     0.450000     0.550000           0.400000          0.020000          0.380000          0.420000           0.000000                33.333333

Key Parameters of summarize():

  • metrics (Optional List[str]):
  • A list of base metric names (e.g., "Jaccard", "ROUGE") to include in the summary.
  • If None (default), all metrics that have been calculated by previous compare() or to_dataframe() calls, or all default metrics if none have been calculated, will be summarized.
  • custom_thresholds (Optional Dict[str, float]):
  • A dictionary to specify custom pass/fail thresholds for metrics. These are used to calculate the _pass_rate columns.
  • Keys can be base metric names (e.g., "Jaccard") or specific "flattened" metric names (e.g., "ROUGE_rouge1").
  • agg_funcs (Optional List[str], default ['mean', 'std']):
  • A list of aggregation function names (as strings) to apply to the scores. Common options include 'mean', 'std', 'min', 'max', 'median', 'count'.

The summarize() method provides a powerful way to quickly grasp the overall performance characteristics of your LLMs.

Registering Custom Metrics with register_metric()

GAICo's extensible design allows you to define your own custom metrics by inheriting from gaico.metrics.base.BaseMetric. The Experiment class now provides a register_metric() method, enabling you to seamlessly integrate these custom metrics into its streamlined workflow.

This means you can use your specialized metrics alongside GAICo's built-in ones in compare() and summarize() calls, without modifying the library's source code.

How to Register a Custom Metric:

  1. Define your custom metric class: Ensure it inherits from gaico.metrics.base.BaseMetric and implements the _single_calculate and _batch_calculate methods.
  2. Instantiate your Experiment: Create an Experiment object as usual.
  3. Call register_metric(): Pass a unique name for your metric and its class.
from gaico import Experiment, BaseMetric
import pandas as pd

# 1. Define a simple custom metric (e.g., checks if generated text contains a specific keyword)
class KeywordPresenceMetric(BaseMetric):
    def __init__(self, keyword: str = "GAICo"):
        self.keyword = keyword.lower()

    def _single_calculate(self, generated_item: str, reference_item: str, **kwargs) -> float:
        # Score is 1.0 if keyword is present, 0.0 otherwise
        return 1.0 if self.keyword in generated_item.lower() else 0.0

    def _batch_calculate(self, generated_items, reference_items, **kwargs) -> list[float]:
        return [self._single_calculate(gen, ref) for gen, ref in zip(generated_items, reference_items)]

# Example LLM responses
llm_responses_custom = {
    "Model_X": "This is a great library called GAICo.",
    "Model_Y": "I love using Python for AI.",
    "Model_Z": "GAICo makes evaluation easy!",
}
reference_answer_custom = "The reference text is not used by this specific metric, but is required by BaseMetric."

# 2. Initialize the Experiment
exp_custom = Experiment(
    llm_responses=llm_responses_custom,
    reference_answer=reference_answer_custom
)

# 3. Register your custom metric
exp_custom.register_metric("HasGAICoKeyword", KeywordPresenceMetric)

# Now, use your custom metric in compare() or summarize()
# Note: If your custom metric has __init__ parameters, you'd need to
# register a pre-configured instance or a factory, but for simplicity,
# we're using the default KeywordPresenceMetric() here.
# For more complex custom metric initialization, consider a wrapper class
# or a factory function that returns a configured instance.

results_with_custom = exp_custom.compare(metrics=["Jaccard", "HasGAICoKeyword"])
print("\nResults with Custom Metric:")
print(results_with_custom)

# Expected Output (example):
# Results with Custom Metric:
#   model_name      metric_name  score
# 0    Model_X          Jaccard   0.25
# 1    Model_X  HasGAICoKeyword   1.00
# 2    Model_Y          Jaccard   0.14
# 3    Model_Y  HasGAICoKeyword   0.00
# 4    Model_Z          Jaccard   0.20
# 5    Model_Z  HasGAICoKeyword   1.00

summary_with_custom = exp_custom.summarize(metrics=["HasGAICoKeyword"])
print("\nSummary with Custom Metric:")
print(summary_with_custom)

# Expected Output (example):
# Summary with Custom Metric:
#   model_name  HasGAICoKeyword_mean  HasGAICoKeyword_std  HasGAICoKeyword_min  HasGAICoKeyword_max  HasGAICoKeyword_pass_rate
# 0    Model_X                   1.0                  0.0                  1.0                  1.0                      100.0
# 1    Model_Y                   0.0                  0.0                  0.0                  0.0                        0.0
# 2    Model_Z                   1.0                  0.0                  1.0                  1.0                      100.0

This dynamic registration capability significantly enhances GAICo's flexibility, allowing it to adapt to virtually any domain-specific evaluation requirement.

How Experiment Uses Metrics

Internally, the Experiment class instantiates and utilizes the individual metric classes (like JaccardSimilarity, ROUGE, BERTScore, etc.) found in the gaico.metrics module. It handles the iteration over your LLM responses, applies each specified metric to compare each response against the single reference_answer, and then aggregates these results into the output DataFrame.

For most common comparison tasks involving multiple models against a single reference, the Experiment class provides the most convenient and comprehensive interface. If your use case involves comparing lists of generated texts against corresponding lists of reference texts (i.e., pair-wise comparisons for multiple examples), or if you need to implement highly custom evaluation logic or integrate new, un-registered metrics on the fly, using the individual metric classes directly might be more appropriate.