Skip to content

Experiment Class

gaico.Experiment

An abstraction to simplify plotting, applying thresholds, and generating CSVs for comparing LLM responses against reference answers using various metrics.

__init__

__init__(llm_responses, reference_answer)

Initializes the Experiment.

Parameters:

Name Type Description Default
llm_responses Dict[str, Any]

A dictionary mapping model names (str) to their generated outputs (Any).

required
reference_answer Optional[Any]

A single reference output (Any) to compare against. If None, the output from the first model in llm_responses will be used as the reference.

required

Raises:

Type Description
TypeError

If llm_responses is not a dictionary.

ValueError

If llm_responses does not contain string keys, or if it's empty when reference_answer is None.

to_dataframe

to_dataframe(metrics=None)

Returns a DataFrame of scores for the specified metrics. If metrics is None, scores for all default metrics are returned.

Parameters:

Name Type Description Default
metrics Optional[List[str]]

A list of base metric names (e.g., "Jaccard", "ROUGE"). Defaults to None.

None

Returns:

Type Description
pd.DataFrame

A pandas DataFrame with columns "model_name", "metric_name", "score". "metric_name" will contain flat metric names (e.g., "ROUGE_rouge1").

compare

compare(metrics=None, plot=False, custom_thresholds=None, output_csv_path=None, aggregate_func=None, plot_title_suffix='Comparison', radar_metrics_limit=12)

Compares models based on specified metrics, optionally plotting and generating a CSV.

Parameters:

Name Type Description Default
metrics Optional[List[str]]

List of base metric names. If None, uses all default registered metrics.

None
plot bool

If True, generates and shows plots. Defaults to False.

False
custom_thresholds Optional[Dict[str, float]]

Dictionary of metric names (base or flat) to threshold values. Overrides default thresholds.

None
output_csv_path Optional[str]

If provided, path to save a CSV report of thresholded results.

None
aggregate_func Optional[Callable]

Aggregation function (e.g., np.mean, np.median) for plotting when multiple scores exist per model/metric (not typical for default setup).

None
plot_title_suffix str

Suffix for plot titles.

'Comparison'
radar_metrics_limit int

Maximum number of metrics for a radar plot to maintain readability.

12

Returns:

Type Description
Optional[pd.DataFrame]

A pandas DataFrame containing the scores for the compared metrics, or None if no valid metrics.