Leaderboard | LLM-Planning

Classification Performance Comparison

This leaderboard compares automated approaches for categorizing LLM-planning papers. D₁ contains 126 papers (until Nov 2023), D₂ contains 47 papers (until Sep 2024). Single-label assigns one category per paper; multi-label allows multiple categories.

Key Finding: Decision Trees (DT) perform best among automated methods (F1: 0.349), but human-augmented classification achieves 0.83, demonstrating the value of expert review for identifying emerging categories like "Goal Decomposition" and "Replanning."

Classifier Name	Single-Label Setup D₁	Single-Label Setup D₂	Multi-Label Setup D₁	Multi-Label Setup D₂
SVM	0.222	0.346	0.123	0.280
DT	0.124	0.258	0.233	0.349
RF	0.117	0.213	0.044	0.215
BERT	0.049	0.043	0.102	0.069
SciBERT	0.000	0.013	0.102	0.150
Human-augmented	-	-	-	0.83