Experiment Class¶
gaico.Experiment ¶
An abstraction to simplify plotting, applying thresholds, and generating CSVs for comparing LLM responses against reference answers using various metrics.
__init__ ¶
__init__(llm_responses, reference_answer)
Initializes the Experiment.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
llm_responses
|
Dict[str, Any]
|
A dictionary mapping model names (str) to their generated outputs (Any). |
required |
reference_answer
|
Optional[Any]
|
A single reference output (Any) to compare against. If None, the output from the first model in |
required |
Raises:
Type | Description |
---|---|
TypeError
|
If llm_responses is not a dictionary. |
ValueError
|
If llm_responses does not contain string keys, or if it's empty when reference_answer is None. |
to_dataframe ¶
to_dataframe(metrics=None)
Returns a DataFrame of scores for the specified metrics. If metrics is None, scores for all default metrics are returned.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
metrics
|
Optional[List[str]]
|
A list of base metric names (e.g., "Jaccard", "ROUGE"). Defaults to None. |
None
|
Returns:
Type | Description |
---|---|
pd.DataFrame
|
A pandas DataFrame with columns "model_name", "metric_name", "score". "metric_name" will contain flat metric names (e.g., "ROUGE_rouge1"). |
compare ¶
compare(metrics=None, plot=False, custom_thresholds=None, output_csv_path=None, aggregate_func=None, plot_title_suffix='Comparison', radar_metrics_limit=12)
Compares models based on specified metrics, optionally plotting and generating a CSV.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
metrics
|
Optional[List[str]]
|
List of base metric names. If None, uses all default registered metrics. |
None
|
plot
|
bool
|
If True, generates and shows plots. Defaults to False. |
False
|
custom_thresholds
|
Optional[Dict[str, float]]
|
Dictionary of metric names (base or flat) to threshold values. Overrides default thresholds. |
None
|
output_csv_path
|
Optional[str]
|
If provided, path to save a CSV report of thresholded results. |
None
|
aggregate_func
|
Optional[Callable]
|
Aggregation function (e.g., np.mean, np.median) for plotting when multiple scores exist per model/metric (not typical for default setup). |
None
|
plot_title_suffix
|
str
|
Suffix for plot titles. |
'Comparison'
|
radar_metrics_limit
|
int
|
Maximum number of metrics for a radar plot to maintain readability. |
12
|
Returns:
Type | Description |
---|---|
Optional[pd.DataFrame]
|
A pandas DataFrame containing the scores for the compared metrics, or None if no valid metrics. |