Skip to content

Experiment Class

gaico.Experiment

An abstraction to simplify plotting, applying thresholds, and generating CSVs for comparing LLM responses against reference answers using various metrics.

__init__

__init__(llm_responses, reference_answer)

Initializes the Experiment for single or batch evaluation.

Parameters:

Name Type Description Default
llm_responses Dict[str, Any]

A dictionary mapping model names (str) to their generated outputs. For batch evaluation, values should be lists of outputs. e.g., {"ModelA": ["resp1", "resp2"], "ModelB": ["resp1", "resp2"]}

required
reference_answer Optional[Any]

A single reference output or a list of references for batch evaluation. If None, the output(s) from the first model will be used as the reference.

required

Raises:

Type Description
TypeError

If llm_responses is not a dictionary.

ValueError

If inputs are inconsistent (e.g., mixing single and list-like responses).

to_dataframe

to_dataframe(metrics=None)

Returns a DataFrame of scores for the specified metrics. If metrics is None, scores for all default metrics are returned.

Parameters:

Name Type Description Default
metrics Optional[List[str]]

A list of base metric names (e.g., "Jaccard", "ROUGE"). Defaults to None.

None

Returns:

Type Description
pd.DataFrame

A pandas DataFrame with columns "model_name", "metric_name", "score". "metric_name" will contain flat metric names (e.g., "ROUGE_rouge1").

register_metric

register_metric(name, metric_class)

Registers a custom metric class for use in this Experiment instance. This allows users to extend GAICo with their own custom metrics and use them seamlessly with the Experiment's compare() and summarize() methods.

Parameters:

Name Type Description Default
name str

The name to refer to this metric by (e.g., "MyCustomMetric").

required
metric_class type[BaseMetric]

The class (must inherit from BaseMetric).

required

Raises:

Type Description
TypeError

If metric_class is not a subclass of gaico.BaseMetric.

summarize

summarize(metrics=None, custom_thresholds=None, agg_funcs=None)

Calculates and returns a summary DataFrame with aggregated scores and pass rates for each model and metric.

Parameters:

Name Type Description Default
metrics Optional[List[str]]

List of base metric names to include in the summary. If None, uses all metrics that have been calculated or can be calculated.

None
custom_thresholds Optional[Dict[str, float]]

Optional dictionary mapping flat metric names (e.g., "Jaccard", "ROUGE_rouge1") or base metric names (e.g., "ROUGE") to custom threshold values. If provided, these will override default thresholds for pass rate calculation.

None
agg_funcs Optional[List[str]]

List of aggregation functions (as strings, e.g., 'mean', 'std', 'min', 'max') to apply to scores. Defaults to ['mean', 'std'].

None

Returns:

Type Description
pd.DataFrame

A summary DataFrame with aggregated scores and pass rates. Columns will include 'model_name', and then aggregated score columns (e.g., 'Jaccard_mean', 'ROUGE_rouge1_std') and pass rate columns (e.g., 'Jaccard_pass_rate').

compare

compare(metrics=None, plot=False, custom_thresholds=None, output_csv_path=None, aggregate_func=None, plot_title_suffix='Comparison', radar_metrics_limit=12)

Compares models based on specified metrics, optionally plotting and generating a CSV. Handles both single-item and batch (dataset) evaluations.

Parameters:

Name Type Description Default
metrics Optional[List[str]]

List of base metric names. If None, uses all default registered metrics.

None
plot bool

If True, generates and shows plots. For batch data, plots are aggregated.

False
custom_thresholds Optional[Dict[str, float]]

Dictionary of metric names to threshold values.

None
output_csv_path Optional[str]

If provided, path to save a detailed CSV report.

None
aggregate_func Optional[Callable]

Aggregation function (e.g., np.mean) for plotting batch results.

None
plot_title_suffix str

Suffix for plot titles.

'Comparison'
radar_metrics_limit int

Maximum number of metrics for a radar plot.

12

Returns:

Type Description
pd.DataFrame

A pandas DataFrame containing the detailed scores for all items.