Research Analysis · PDDL Planning · State Encoding Methods
On the Choice of Tokenizers for Generalized Planning via Learned Transition Models
A systematic evaluation of four graph-based state encoding methods — WL, Shortest Path, GraphBPE, SimHash — across four PDDL planning domains (Blocks, Gripper, Logistics, Visitall), two model architectures (LSTM, XGBoost), and two prediction modes (delta, state). All analyses focus on interpolation and extrapolation Solved%. Extrapolation is the primary measure of generalization to unseen problem scales.
Tokenizers
WLShortest PathGraphBPESimHash
Models & Modes
LSTMXGBoostΔ deltaS state
Highest Extrap (any config)
87.2%
Shortest Path on Visitall — robust across 3/4 configs. WL+XGBoost+Δ also reaches 87.2% on Visitall.
Only tokenizer on Blocks
WL
WL is the sole tokenizer achieving any nonzero score on Blocks (up to 45% extrap). All others: 0%.
Logistics extrap (all 16 configs)
0%
Every tokenizer × model × mode combination fails to extrapolate on Logistics.
GraphBPE Visitall extrap
0%
59.5% interpolation on Visitall but zero extrap — the starkest misleading interpolation in the study.
01 —
Tokenizer Overview: Coverage and Generalization
Two aggregate views of each tokenizer: mean Solved% across all 16 configs (4 domains × 2 models × 2 modes) for interpolation and extrapolation separately, plus breadth (how many of 16 configs achieve nonzero performance). This answers: which tokenizer is most broadly capable, and which is most reliable at out-of-distribution generalization?
All tokenizers · mean Interp and Extrap Solved%
Mean Solved% by Tokenizer
Averaged across all 16 configs per tokenizer (4 domains × 2 models × 2 modes). Interpolation vs Extrapolation shown side-by-side.
WL has the highest mean interpolation (70.0%) but Shortest Path leads on mean extrapolation (22.7% vs WL's 16.8%), suggesting Shortest Path generalizes more reliably out-of-distribution. GraphBPE and SimHash trail significantly on both metrics.
All tokenizers · breadth of nonzero configs
Config Breadth: Nonzero Solved% Counts
How many of 16 configs achieve nonzero performance. Reveals reliability vs peak-only performance.
WL has the widest interpolation coverage (13/16 configs nonzero) but Shortest Path achieves 5 nonzero extrap configs with much higher values. GraphBPE's 2 nonzero extrap configs are both the same domain (Gripper+LSTM), revealing narrow specialization.
02 —
Which Tokenizer Works Best for Which Domain?
The central question of the paper. Each cell in the matrix shows the best achievable Solved% for that (tokenizer, domain) pair — the ceiling across all 4 configs (2 models × 2 modes). Two matrices: one for interpolation, one for extrapolation. Together they reveal where each tokenizer's representations are sufficient versus structurally limited.
Best achievable across 4 configs per cell
Interpolation Matrix
Best Interp Solved% per (tokenizer × domain). Rows = tokenizers, columns = domains.
Best achievable across 4 configs per cell
Extrapolation Matrix ← primary metric
Best Extrap Solved% per (tokenizer × domain). Logistics column is zero for all tokenizers.
Tokenizer ranking per domain (extrap) with interp context
Domain-by-Domain Tokenizer Ranking
Ranked by best extrap Solved%. Best config (model+mode) shown in parentheses. ★ = domain winner.
03 —
Tokenizer Deep Dives: All Configs per Domain
For each tokenizer, all 16 individual configurations (4 domains × 2 models × 2 modes) are shown. Bars display both interpolation and extrapolation for every config, colored by prediction mode (Δ delta = sky blue, S state = orange), with model shown via bar pattern. This reveals the model×mode sensitivity within each tokenizer.
Tokenizer: WL
WL — All 16 Configurations
Bars show Interp (left, lighter) and Extrap (right, solid) per config. Colors = mode.
WL is the only tokenizer with nonzero Blocks performance. On Visitall, extreme model×mode variance: XGBoost+Δ reaches 87.2% extrap but LSTM+Δ only 3.2% — same tokenizer, same domain. Logistics achieves strong interpolation (up to 88.9%) but zero extrap universally.
Tokenizer: Shortest Path
Shortest Path — All 16 Configurations
The most consistent tokenizer on Visitall — all 4 configs cluster near 87% extrap.
Shortest Path is uniquely consistent on Visitall — 3 configs at 87.21%, 1 at 82.65%, variance <5 points. Completely fails on Blocks (0% everywhere). Gripper extrap only via LSTM+Δ (18.75%). Logistics: modest interpolation, zero extrap.
Tokenizer: GraphBPE
GraphBPE — All 16 Configurations
Highly LSTM-dependent. XGBoost achieves zero extrap across all domains.
GraphBPE is uniquely specialized for Gripper via LSTM — 43.75% extrap, the best gripper result in the study. Visitall shows a stark misleading signal: 59.46% interp but 0% extrap for all 4 configs. XGBoost is completely ineffective with GraphBPE across all domains.
Tokenizer: SimHash
SimHash — All 16 Configurations
Anomaly: Gripper+LSTM+Δ extrapolates (31.25%) despite zero interpolation.
SimHash exhibits the only extrap-without-interp case: Gripper+LSTM+Δ scores 31.25% extrap but 0% interp. This suggests SimHash's hash-based encoding occasionally aligns with test-distribution structure despite failing on near-distribution instances. Broadly weak — zero on Blocks and Logistics extrap.
04 —
The Interpolation–Extrapolation Gap by Tokenizer
High interpolation does not imply high extrapolation. This section quantifies the gap for each (tokenizer × domain) pair using the best achievable configs for each split. A large gap signals a tokenizer whose representations fit near-distribution instances but fail to capture the structural regularities needed for out-of-distribution generalization.
Best Interp vs Best Extrap per (tokenizer × domain) — scatter
Interpolation vs Extrapolation: Best Achievable Configs
Each point = one (tokenizer × domain) pair. Points on the dashed diagonal would mean perfect interp→extrap transfer. Distance below diagonal = generalization gap. Colored by tokenizer.
Two clusters emerge: (1) bottom-left: failed pairs (both metrics zero); (2) top-right: successful pairs. Critically, several points fall along the left axis — high interp but near-zero extrap. The Logistics+WL point (89% interp, 0% extrap) and Visitall+GraphBPE point (59% interp, 0% extrap) are the most misleading. Only Visitall+Shortest Path and Visitall+WL approach the diagonal.
WL · gap per domain
WL Gap
Shortest Path · gap per domain
Shortest Path Gap
GraphBPE · gap per domain
GraphBPE Gap
SimHash · gap per domain
SimHash Gap
Shortest Path has the smallest gap on Visitall (12.8%) — the best interp→extrap transfer in the study. WL's Logistics gap (88.9%) is the largest, signaling that high interpolation performance on Logistics is structurally uninformative about generalization. GraphBPE's Visitall gap (59.5%) represents 59 interpolation points that never convert to extrapolation.
05 —
Model and Mode Sensitivity per Tokenizer
The choice of model (LSTM vs XGBoost) and prediction mode (delta vs state) interacts strongly with tokenizer. This section asks: for each tokenizer, which model+mode combination generalizes best? And crucially: does the tokenizer impose a constraint on which model or mode can be used?
LSTM vs XGBoost · mean extrap by tokenizer
Model Sensitivity by Tokenizer
Mean extrap across all 4 domains and 2 modes per (tokenizer × model). Reveals which tokenizers impose a model constraint.
GraphBPE and SimHash are LSTM-dependent — XGBoost achieves near-zero extrap with both. WL strongly favors XGBoost+Δ for extrapolation. Shortest Path is model-agnostic — both LSTM and XGBoost reach 87%+ on Visitall.
Delta vs State mode · mean extrap by tokenizer
Mode Sensitivity by Tokenizer
Mean extrap across all 4 domains and 2 models per (tokenizer × mode). Delta and state modes behave very differently for WL.
WL is highly mode-sensitive — XGBoost+Δ reaches 87.2% on Visitall but XGBoost+S drops to 8.2% (same domain, same model, same tokenizer). Shortest Path is mode-robust. GraphBPE is mode-neutral (both modes equally functional with LSTM).
WL: extrap by model+mode per domain
WL
Shortest Path: extrap by model+mode
Shortest Path
GraphBPE: extrap by model+mode
GraphBPE
SimHash: extrap by model+mode
SimHash
Key per-tokenizer model×mode findings: WL — XGBoost+Δ on Visitall (87.2%) vs LSTM+Δ on same (3.2%): 84pt gap, largest in study. Shortest Path — all 4 configs tightly cluster on Visitall (82–87%), most stable tokenizer. GraphBPE — LSTM-only effective (XGBoost extrap = 0 for all 8 configs). SimHash — only LSTM achieves any extrap; anomalous 31.25% extrap with 0% interp on Gripper+LSTM+Δ.
06 —
Complete Configuration Heatmap
All 64 individual configurations. Organized by domain, then by tokenizer (ranked best extrap first within each domain), then sorted by extrap descending within each tokenizer. Every result shown without collapsing. The ★ marks the best extrap config per tokenizer per domain.