Experiment Class¶

gaico.Experiment ¶

An abstraction to simplify plotting, applying thresholds, and generating CSVs for comparing LLM responses against reference answers using various metrics.

init ¶

__init__(llm_responses, reference_answer)

Initializes the Experiment for single or batch evaluation.

Parameters:

Name	Type	Description	Default
`llm_responses`	`Dict[str, Any]`	A dictionary mapping model names (str) to their generated outputs. For batch evaluation, values should be lists of outputs. e.g., {"ModelA": ["resp1", "resp2"], "ModelB": ["resp1", "resp2"]}	required
`reference_answer`	`Optional[Any]`	A single reference output or a list of references for batch evaluation. If None, the output(s) from the first model will be used as the reference.	required

Raises:

Type	Description
`TypeError`	If llm_responses is not a dictionary.
`ValueError`	If inputs are inconsistent (e.g., mixing single and list-like responses).

to_dataframe ¶

to_dataframe(metrics=None)

Returns a DataFrame of scores for the specified metrics. If metrics is None, scores for all default metrics are returned.

Parameters:

Name	Type	Description	Default
`metrics`	`Optional[List[str]]`	A list of base metric names (e.g., "Jaccard", "ROUGE"). Defaults to None.	`None`

Returns:

Type	Description
`pd.DataFrame`	A pandas DataFrame with columns "model_name", "metric_name", "score". "metric_name" will contain flat metric names (e.g., "ROUGE_rouge1").

register_metric ¶

register_metric(name, metric_class)

Registers a custom metric class for use in this Experiment instance. This allows users to extend GAICo with their own custom metrics and use them seamlessly with the Experiment's compare() and summarize() methods.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name to refer to this metric by (e.g., "MyCustomMetric").	required
`metric_class`	`type[BaseMetric]`	The class (must inherit from BaseMetric).	required

Raises:

Type	Description
`TypeError`	If metric_class is not a subclass of gaico.BaseMetric.

summarize ¶

summarize(metrics=None, custom_thresholds=None, agg_funcs=None)

Calculates and returns a summary DataFrame with aggregated scores and pass rates for each model and metric.

Parameters:

Name	Type	Description	Default
`metrics`	`Optional[List[str]]`	List of base metric names to include in the summary. If None, uses all metrics that have been calculated or can be calculated.	`None`
`custom_thresholds`	`Optional[Dict[str, float]]`	Optional dictionary mapping flat metric names (e.g., "Jaccard", "ROUGE_rouge1") or base metric names (e.g., "ROUGE") to custom threshold values. If provided, these will override default thresholds for pass rate calculation.	`None`
`agg_funcs`	`Optional[List[str]]`	List of aggregation functions (as strings, e.g., 'mean', 'std', 'min', 'max') to apply to scores. Defaults to ['mean', 'std'].	`None`

Returns:

Type	Description
`pd.DataFrame`	A summary DataFrame with aggregated scores and pass rates. Columns will include 'model_name', and then aggregated score columns (e.g., 'Jaccard_mean', 'ROUGE_rouge1_std') and pass rate columns (e.g., 'Jaccard_pass_rate').

compare ¶

compare(metrics=None, plot=False, custom_thresholds=None, output_csv_path=None, aggregate_func=None, plot_title_suffix='Comparison', radar_metrics_limit=12)

Compares models based on specified metrics, optionally plotting and generating a CSV. Handles both single-item and batch (dataset) evaluations.

Parameters:

Name	Type	Description	Default
`metrics`	`Optional[List[str]]`	List of base metric names. If None, uses all default registered metrics.	`None`
`plot`	`bool`	If True, generates and shows plots. For batch data, plots are aggregated.	`False`
`custom_thresholds`	`Optional[Dict[str, float]]`	Dictionary of metric names to threshold values.	`None`
`output_csv_path`	`Optional[str]`	If provided, path to save a detailed CSV report.	`None`
`aggregate_func`	`Optional[Callable]`	Aggregation function (e.g., np.mean) for plotting batch results.	`None`
`plot_title_suffix`	`str`	Suffix for plot titles.	`'Comparison'`
`radar_metrics_limit`	`int`	Maximum number of metrics for a radar plot.	`12`

Returns:

Type	Description
`pd.DataFrame`	A pandas DataFrame containing the detailed scores for all items.

Experiment Class¶

gaico.Experiment ¶

__init__ ¶

to_dataframe ¶

register_metric ¶

summarize ¶

compare ¶

init ¶