Research Analysis · PDDL Planning · State Encoding Methods

On the Choice of Tokenizers for Generalized Planning via Learned Transition Models

A systematic evaluation of four graph-based state encoding methods — WL, Shortest Path, GraphBPE, SimHash — across four PDDL planning domains (Blocks, Gripper, Logistics, Visitall), two model architectures (LSTM, XGBoost), and two prediction modes (delta, state). All analyses focus on interpolation and extrapolation Solved%. Extrapolation is the primary measure of generalization to unseen problem scales.

Tokenizers

WL Shortest Path GraphBPE SimHash

Models & Modes

LSTM XGBoost Δ delta S state

Highest Extrap (any config)

87.2%

Shortest Path on Visitall — robust across 3/4 configs. WL+XGBoost+Δ also reaches 87.2% on Visitall.

Only tokenizer on Blocks

WL is the sole tokenizer achieving any nonzero score on Blocks (up to 45% extrap). All others: 0%.

Logistics extrap (all 16 configs)

Every tokenizer × model × mode combination fails to extrapolate on Logistics.

GraphBPE Visitall extrap

59.5% interpolation on Visitall but zero extrap — the starkest misleading interpolation in the study.

01 —

Tokenizer Overview: Coverage and Generalization

Two aggregate views of each tokenizer: mean Solved% across all 16 configs (4 domains × 2 models × 2 modes) for interpolation and extrapolation separately, plus breadth (how many of 16 configs achieve nonzero performance). This answers: which tokenizer is most broadly capable, and which is most reliable at out-of-distribution generalization?

All tokenizers · mean Interp and Extrap Solved%

Mean Solved% by Tokenizer

Averaged across all 16 configs per tokenizer (4 domains × 2 models × 2 modes). Interpolation vs Extrapolation shown side-by-side.

WL has the highest mean interpolation (70.0%) but Shortest Path leads on mean extrapolation (22.7% vs WL's 16.8%), suggesting Shortest Path generalizes more reliably out-of-distribution. GraphBPE and SimHash trail significantly on both metrics.

All tokenizers · breadth of nonzero configs

Config Breadth: Nonzero Solved% Counts

How many of 16 configs achieve nonzero performance. Reveals reliability vs peak-only performance.

WL has the widest interpolation coverage (13/16 configs nonzero) but Shortest Path achieves 5 nonzero extrap configs with much higher values. GraphBPE's 2 nonzero extrap configs are both the same domain (Gripper+LSTM), revealing narrow specialization.

02 —

Which Tokenizer Works Best for Which Domain?

The central question of the paper. Each cell in the matrix shows the best achievable Solved% for that (tokenizer, domain) pair — the ceiling across all 4 configs (2 models × 2 modes). Two matrices: one for interpolation, one for extrapolation. Together they reveal where each tokenizer's representations are sufficient versus structurally limited.

Best achievable across 4 configs per cell

Interpolation Matrix

Best Interp Solved% per (tokenizer × domain). Rows = tokenizers, columns = domains.

Best achievable across 4 configs per cell

Extrapolation Matrix ← primary metric

Best Extrap Solved% per (tokenizer × domain). Logistics column is zero for all tokenizers.

Tokenizer ranking per domain (extrap) with interp context

Domain-by-Domain Tokenizer Ranking

Ranked by best extrap Solved%. Best config (model+mode) shown in parentheses. ★ = domain winner.

03 —

Tokenizer Deep Dives: All Configs per Domain

For each tokenizer, all 16 individual configurations (4 domains × 2 models × 2 modes) are shown. Bars display both interpolation and extrapolation for every config, colored by prediction mode (Δ delta = sky blue, S state = orange), with model shown via bar pattern. This reveals the model×mode sensitivity within each tokenizer.

Tokenizer: WL

WL — All 16 Configurations

Bars show Interp (left, lighter) and Extrap (right, solid) per config. Colors = mode.

WL is the only tokenizer with nonzero Blocks performance. On Visitall, extreme model×mode variance: XGBoost+Δ reaches 87.2% extrap but LSTM+Δ only 3.2% — same tokenizer, same domain. Logistics achieves strong interpolation (up to 88.9%) but zero extrap universally.

Tokenizer: Shortest Path

Shortest Path — All 16 Configurations

The most consistent tokenizer on Visitall — all 4 configs cluster near 87% extrap.

Shortest Path is uniquely consistent on Visitall — 3 configs at 87.21%, 1 at 82.65%, variance <5 points. Completely fails on Blocks (0% everywhere). Gripper extrap only via LSTM+Δ (18.75%). Logistics: modest interpolation, zero extrap.

Tokenizer: GraphBPE

GraphBPE — All 16 Configurations

Highly LSTM-dependent. XGBoost achieves zero extrap across all domains.

GraphBPE is uniquely specialized for Gripper via LSTM — 43.75% extrap, the best gripper result in the study. Visitall shows a stark misleading signal: 59.46% interp but 0% extrap for all 4 configs. XGBoost is completely ineffective with GraphBPE across all domains.

Tokenizer: SimHash

SimHash — All 16 Configurations

Anomaly: Gripper+LSTM+Δ extrapolates (31.25%) despite zero interpolation.

SimHash exhibits the only extrap-without-interp case: Gripper+LSTM+Δ scores 31.25% extrap but 0% interp. This suggests SimHash's hash-based encoding occasionally aligns with test-distribution structure despite failing on near-distribution instances. Broadly weak — zero on Blocks and Logistics extrap.

04 —

The Interpolation–Extrapolation Gap by Tokenizer

High interpolation does not imply high extrapolation. This section quantifies the gap for each (tokenizer × domain) pair using the best achievable configs for each split. A large gap signals a tokenizer whose representations fit near-distribution instances but fail to capture the structural regularities needed for out-of-distribution generalization.

Best Interp vs Best Extrap per (tokenizer × domain) — scatter

Interpolation vs Extrapolation: Best Achievable Configs

Each point = one (tokenizer × domain) pair. Points on the dashed diagonal would mean perfect interp→extrap transfer. Distance below diagonal = generalization gap. Colored by tokenizer.

Two clusters emerge: (1) bottom-left: failed pairs (both metrics zero); (2) top-right: successful pairs. Critically, several points fall along the left axis — high interp but near-zero extrap. The Logistics+WL point (89% interp, 0% extrap) and Visitall+GraphBPE point (59% interp, 0% extrap) are the most misleading. Only Visitall+Shortest Path and Visitall+WL approach the diagonal.

WL · gap per domain

WL Gap

Shortest Path · gap per domain

Shortest Path Gap

GraphBPE · gap per domain

GraphBPE Gap

SimHash · gap per domain

SimHash Gap

Shortest Path has the smallest gap on Visitall (12.8%) — the best interp→extrap transfer in the study. WL's Logistics gap (88.9%) is the largest, signaling that high interpolation performance on Logistics is structurally uninformative about generalization. GraphBPE's Visitall gap (59.5%) represents 59 interpolation points that never convert to extrapolation.

05 —

Model and Mode Sensitivity per Tokenizer

The choice of model (LSTM vs XGBoost) and prediction mode (delta vs state) interacts strongly with tokenizer. This section asks: for each tokenizer, which model+mode combination generalizes best? And crucially: does the tokenizer impose a constraint on which model or mode can be used?

LSTM vs XGBoost · mean extrap by tokenizer

Model Sensitivity by Tokenizer

Mean extrap across all 4 domains and 2 modes per (tokenizer × model). Reveals which tokenizers impose a model constraint.

GraphBPE and SimHash are LSTM-dependent — XGBoost achieves near-zero extrap with both. WL strongly favors XGBoost+Δ for extrapolation. Shortest Path is model-agnostic — both LSTM and XGBoost reach 87%+ on Visitall.

Delta vs State mode · mean extrap by tokenizer

Mode Sensitivity by Tokenizer

Mean extrap across all 4 domains and 2 models per (tokenizer × mode). Delta and state modes behave very differently for WL.

WL is highly mode-sensitive — XGBoost+Δ reaches 87.2% on Visitall but XGBoost+S drops to 8.2% (same domain, same model, same tokenizer). Shortest Path is mode-robust. GraphBPE is mode-neutral (both modes equally functional with LSTM).

WL: extrap by model+mode per domain

Shortest Path: extrap by model+mode

Shortest Path

GraphBPE: extrap by model+mode

GraphBPE

SimHash: extrap by model+mode

SimHash

Key per-tokenizer model×mode findings: WL — XGBoost+Δ on Visitall (87.2%) vs LSTM+Δ on same (3.2%): 84pt gap, largest in study. Shortest Path — all 4 configs tightly cluster on Visitall (82–87%), most stable tokenizer. GraphBPE — LSTM-only effective (XGBoost extrap = 0 for all 8 configs). SimHash — only LSTM achieves any extrap; anomalous 31.25% extrap with 0% interp on Gripper+LSTM+Δ.

06 —

Complete Configuration Heatmap

All 64 individual configurations. Organized by domain, then by tokenizer (ranked best extrap first within each domain), then sorted by extrap descending within each tokenizer. Every result shown without collapsing. The ★ marks the best extrap config per tokenizer per domain.

Solved% → 0% 1–24% 25–59% 60–89% 90–100%