Experiment Class¶
gaico.Experiment ¶
An abstraction to simplify plotting, applying thresholds, and generating CSVs for comparing LLM responses against reference answers using various metrics.
__init__ ¶
__init__(llm_responses, reference_answer)
Initializes the Experiment for single or batch evaluation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
llm_responses
|
Dict[str, Any]
|
A dictionary mapping model names (str) to their generated outputs. For batch evaluation, values should be lists of outputs. e.g., {"ModelA": ["resp1", "resp2"], "ModelB": ["resp1", "resp2"]} |
required |
reference_answer
|
Optional[Any]
|
A single reference output or a list of references for batch evaluation. If None, the output(s) from the first model will be used as the reference. |
required |
Raises:
| Type | Description |
|---|---|
TypeError
|
If llm_responses is not a dictionary. |
ValueError
|
If inputs are inconsistent (e.g., mixing single and list-like responses). |
to_dataframe ¶
to_dataframe(metrics=None)
Returns a DataFrame of scores for the specified metrics. If metrics is None, scores for all default metrics are returned.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics
|
Optional[List[str]]
|
A list of base metric names (e.g., "Jaccard", "ROUGE"). Defaults to None. |
None
|
Returns:
| Type | Description |
|---|---|
pd.DataFrame
|
A pandas DataFrame with columns "model_name", "metric_name", "score". "metric_name" will contain flat metric names (e.g., "ROUGE_rouge1"). |
register_metric ¶
register_metric(name, metric_class)
Registers a custom metric class for use in this Experiment instance.
This allows users to extend GAICo with their own custom metrics
and use them seamlessly with the Experiment's compare() and summarize() methods.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name to refer to this metric by (e.g., "MyCustomMetric"). |
required |
metric_class
|
type[BaseMetric]
|
The class (must inherit from BaseMetric). |
required |
Raises:
| Type | Description |
|---|---|
TypeError
|
If metric_class is not a subclass of gaico.BaseMetric. |
summarize ¶
summarize(metrics=None, custom_thresholds=None, agg_funcs=None)
Calculates and returns a summary DataFrame with aggregated scores and pass rates for each model and metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics
|
Optional[List[str]]
|
List of base metric names to include in the summary. If None, uses all metrics that have been calculated or can be calculated. |
None
|
custom_thresholds
|
Optional[Dict[str, float]]
|
Optional dictionary mapping flat metric names (e.g., "Jaccard", "ROUGE_rouge1") or base metric names (e.g., "ROUGE") to custom threshold values. If provided, these will override default thresholds for pass rate calculation. |
None
|
agg_funcs
|
Optional[List[str]]
|
List of aggregation functions (as strings, e.g., 'mean', 'std', 'min', 'max') to apply to scores. Defaults to ['mean', 'std']. |
None
|
Returns:
| Type | Description |
|---|---|
pd.DataFrame
|
A summary DataFrame with aggregated scores and pass rates. Columns will include 'model_name', and then aggregated score columns (e.g., 'Jaccard_mean', 'ROUGE_rouge1_std') and pass rate columns (e.g., 'Jaccard_pass_rate'). |
compare ¶
compare(metrics=None, plot=False, custom_thresholds=None, output_csv_path=None, aggregate_func=None, plot_title_suffix='Comparison', radar_metrics_limit=12)
Compares models based on specified metrics, optionally plotting and generating a CSV. Handles both single-item and batch (dataset) evaluations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics
|
Optional[List[str]]
|
List of base metric names. If None, uses all default registered metrics. |
None
|
plot
|
bool
|
If True, generates and shows plots. For batch data, plots are aggregated. |
False
|
custom_thresholds
|
Optional[Dict[str, float]]
|
Dictionary of metric names to threshold values. |
None
|
output_csv_path
|
Optional[str]
|
If provided, path to save a detailed CSV report. |
None
|
aggregate_func
|
Optional[Callable]
|
Aggregation function (e.g., np.mean) for plotting batch results. |
None
|
plot_title_suffix
|
str
|
Suffix for plot titles. |
'Comparison'
|
radar_metrics_limit
|
int
|
Maximum number of metrics for a radar plot. |
12
|
Returns:
| Type | Description |
|---|---|
pd.DataFrame
|
A pandas DataFrame containing the detailed scores for all items. |