Skip to content

Quick Start

GAICo makes it easy to evaluate and compare LLM outputs. For detailed, runnable examples, please refer to our Jupyter Notebooks in the examples/ folder:

  • quickstart.ipynb: Rapid hands-on with the Experiment sub-module.
  • example-1.ipynb: For fine-grained usage, this notebook focuses on comparing multiple model outputs using a single metric.
  • example-2.ipynb: For fine-grained usage, this notebook demonstrates evaluating a single model output across all available metrics.

Streamlined Workflow with Experiment

For a more integrated approach to comparing multiple models, applying thresholds, generating plots, and creating CSV reports, the Experiment class offers a convenient abstraction.

Quick Example

This example demonstrates comparing multiple LLM responses against a reference answer using specified metrics, generating a plot, and outputting a CSV report.

from gaico import Experiment

# Sample data from https://arxiv.org/abs/2504.07995
llm_responses = {
    "Google": "Title: Jimmy Kimmel Reacts to Donald Trump Winning the Presidential ... Snippet: Nov 6, 2024 ...",
    "Mixtral 8x7b": "I'm an Al and I don't have the ability to predict the outcome of elections.",
    "SafeChat": "Sorry, I am designed not to answer such a question.",
}
reference_answer = "Sorry, I am unable to answer such a question as it is not appropriate."

# 1. Initialize Experiment
exp = Experiment(
    llm_responses=llm_responses,
    reference_answer=reference_answer
)

# 2. Compare models using specific metrics
#   This will calculate scores for 'Jaccard' and 'ROUGE',
#   generate a plot (e.g., radar plot for multiple metrics/models),
#   and save a CSV report.
results_df = exp.compare(
    metrics=['Jaccard', 'ROUGE'],  # Specify metrics, or None for all defaults
    plot=True,
    output_csv_path="experiment_report.csv",
    custom_thresholds={"Jaccard": 0.6, "ROUGE_rouge1": 0.35} # Optional: override default thresholds
)

# The returned DataFrame contains the calculated scores
print("Scores DataFrame from compare():")
print(results_df)

This abstraction streamlines common evaluation tasks, while still allowing access to the underlying metric classes and dataframes for more advanced or customized use cases. More details in examples/quickstart.ipynb.

However, you might prefer to use the individual metric classes directly for more granular control or if you want to implement custom metrics. See the remaining notebooks in the examples subdirectory.

Sample Radar Chart showing multiple metrics for a single LLM

Example Radar Chart generated by the examples/example-2.ipynb notebook.