Classification Performance Comparison
This leaderboard compares automated approaches for categorizing LLM-planning papers. D₁ contains 126 papers (until Nov 2023), D₂ contains 47 papers (until Sep 2024). Single-label assigns one category per paper; multi-label allows multiple categories.
Key Finding: Decision Trees (DT) perform best among automated methods (F1: 0.349), but human-augmented classification achieves 0.83, demonstrating the value of expert review for identifying emerging categories like "Goal Decomposition" and "Replanning."
| Classifier Name | Single-Label Setup D₁ |
Single-Label Setup D₂ |
Multi-Label Setup D₁ |
Multi-Label Setup D₂ |
|---|---|---|---|---|
| SVM | 0.222 | 0.346 | 0.123 | 0.280 |
| DT | 0.124 | 0.258 | 0.233 | 0.349 |
| RF | 0.117 | 0.213 | 0.044 | 0.215 |
| BERT | 0.049 | 0.043 | 0.102 | 0.069 |
| SciBERT | 0.000 | 0.013 | 0.102 | 0.150 |
| Human-augmented | - | - | - | 0.83 |