Debugging Code World Models

GPT-5 vs CWM accuracy comparison on S5 permutation tracking Composition accuracy by depth

Figure 1: (Left) Accuracy on $S_5$ permutation tracking across sequence lengths (8–128 swaps). CWM baseline generates both commands and states; CWM+TF (teacher forcing) receives ground-truth commands and only predicts states. GPT-5 degrades rapidly while CWM+TF maintains high accuracy, showing the model can track state when given correct actions. (Right) Composition accuracy on nested string functions (depths 1–5). Observed accuracy falls far below the theoretical baseline (dashed) computed from atomic accuracy. Flattening nested calls into sequential assignments (red) provides modest improvement but cannot close the gap.

1. World Models

World models (WMs) are a framework for prediction (simulation) and planning: you learn a model of how an environment evolves, then you can "roll it forward" to see the result of an action. WMs were first popularized in images/videos, such as Genie [1] in game-like settings where an agent interacts with an environment; the promise is that the agent can learn new abilities by interacting with the learned simulator, enabling more open-ended behavior (e.g., exploration, long-horizon planning, skill acquisition).

Recently, there have been several works that treat language models trained on text as world models, including for text-based games [3], general game playing [4], and code generation guided by MCTS [5][6]. The key difference from a standard language model is the training interface: an explicit action + state format. For example, a code world model (CWM) [2] is trained on traces that look like:

action = ⟨code / command⟩
state = ⟨runtime variable values⟩

This turns code execution into a supervised "simulate the next state" problem. The CWM paper reports that this kind of training improves downstream performance on software-engineering style benchmarks (e.g., SWE-bench).

The Connection to State-Tracking

This format is closely related to what the model-architecture community calls state tracking, often studied through finite-state automata and long-horizon length generalization [7][8]. Theoretical work has shown that log-precision transformers face fundamental limitations in simulating finite automata [7], while recurrent architectures can represent these transition systems more naturally. Recent work on linear RNNs and state space models [9] demonstrates that careful initialization of recurrence eigenvalues enables reliable state tracking over long sequences. These findings suggest that the choice of architecture, and its alignment with the supervision structure, determines whether models can faithfully track latent state.

World modeling should solve state-tracking. But Transformers can't track state. CWM is a Transformer-based world model. Something has to give...

  • Where does it break? On which code patterns does CWM's state-tracking fail?
  • Why? What mechanism explains both its successes and its failures?
  • Can we fix it? Can we recover accuracy by engineering around the limitations?
  • At what cost? What are the efficiency implications for training and inference?

In this post, we use a state-tracking lens to answer these questions.

2. Evaluation on Code Benchmarks

We evaluated CWM on two standard code execution benchmarks: CruxEval-O [11] (output prediction) and HumanEval [12] (function execution), as well as the Nesting dataset [10] that probes compositional execution of nested string manipulation functions (e.g., upper(replace(strip(x)))). Table 1 summarizes the results.

Table 1: CWM accuracy on code execution benchmarks

Benchmark Samples Baseline Accuracy After Intervention
CruxEval-O 800 85.1% 90.4%
HumanEval 723 91.4%
Nesting (depth=2) 100 75% 78%
Nesting (depth=3) 100 58% 63%
Nesting (depth=4) 100 39% 43%
Nesting (depth=5) 100 25% 28%

These baselines are strong but reveal systematic failure patterns. Error analysis (detailed in the CruxEval, HumanEval, and Nesting reports) identifies three primary failure modes: compositionality (nested expressions hiding intermediate values), tokenization discontinuity (string operations breaking token boundaries), and trace truncation (loops generating traces that exceed the 32K token limit). We examine each below.

Figure 2 illustrates the first two challenges and a potential resolution strategy. On the left, nested function calls hide intermediate values that CWM never sees during trace generation. On the right, string operations require character-level iteration that changes tokenization entirely. In both cases, decomposing the expression into explicit steps can expose these hidden states, restoring the dense supervision CWM relies on. We explore these interventions in Section 4.

🔗 Composition
Original Code
nums = [1, 2, 3, 2, 1] n = 2 output = [] output.append((nums.count(n), n))
Decomposed
_t0 = nums.count(n) # hidden! _t1 = (_t0, n) # hidden! output.append(_t1)
Issue: Nested calls hide intermediate values (_t0, _t1) — effective reveal spacing > 1
📝 String Manipulation
Original Code
text = "abracadabra" char = "c" pos = text.index(char)
Explicit Loop
_t0 = list(text) _t1 = -1 for _t2 in range(len(_t0)): if _t0[_t2] == char: _t1 = _t2 break pos = _t1
Issue: String→list conversion changes tokenization; loop iterations create many hidden states

Figure 2: Two challenges in real code and potential decomposition strategies. Left: Composition hides intermediate results in nested expressions. Right: String manipulation requires tokenization changes and implicit iteration.

2.1 Challenge: Compositionality

As shown in Figure 2 (left), when commands are composed (nested within other commands or combined into compound expressions), the reveal spacing effectively becomes greater than one. The intermediate states between sub-operations are never exposed to the model.

Experimental setup: Using the library of 25 deterministic string-manipulation functions from the Nesting dataset [10], we designed prompts for CWM and first evaluated on atomic (depth-1) calls. We selected the 15 functions achieving ≥90% atomic accuracy (average: 95.3%), then generated 100 test samples per depth for compositions from depth 1 to 5 (e.g., depth 3 = func_A(func_B(func_C(x)))).

Atomic Function Accuracy

Figure 3: CWM atomic (depth-1) accuracy across all 25 string-manipulation functions. Functions achieving ≥90% accuracy (above green dashed line) were selected for the composition experiment, yielding 15 high-accuracy functions with average accuracy of 95.3%.

Results: At depth 5, observed accuracy (25%) is 50 percentage points below the theoretical baseline (78.6%) computed from atomic accuracy (0.953⁵). This gap grows super-linearly with depth (Figure 1, right), indicating compositional execution introduces errors beyond simple error propagation.

2.2 Challenge: Trace Truncation

Dense state supervision has a hidden cost: token explosion. When code contains loops, each iteration generates a full state snapshot. For O(n²) algorithms or large inputs, traces can exceed the 32K token limit, causing truncation before the final output is reached.

In CruxEval, 6 samples failed because their traces exceeded the 32K token limit. An additional 17 samples failed on loop/counter errors, where the model lost track during long iterations even without truncation. Together, these reveal a fundamental tension: the same dense supervision that enables reliable state-tracking also creates traces too long to process.

2.3 Challenge: Tokenization Discontinuity

String manipulation creates a challenge unrelated to supervision density: minor semantic changes can cause dramatic shifts in tokenization. A single character edit might transform a string from one token into five or more, as illustrated in Figure 4.

Tokenization Discontinuity in String State-Tracking
Initial State:
s = "abcdefghijklmnopqrstuvwxyz"
Tokenization:
[alphabet_26] 1 token
1 Replace 'x''X'
New State:
s = "abcdefghijklmnopqrstuvwXyz"
Tokenization:
[abcdefghijklmnop] [qrst] [uvw] [X] [yz] 5 tokens
2 Insert space at position 13
New State:
s = "abcdefghijklm nopqrstuvwXyz"
Tokenization:
[abcdefghijkl] [m] [ nop] [qrst] [uvw] [X] [yz] 7 tokens

Figure 4: Tokenization discontinuity in string state-tracking. Minor edits (changing one character, inserting a space) cause radical changes in BPE token segmentation—from a single token to 5–7 tokens with entirely different boundaries. This makes string operations unpredictable for the model.

This is problematic for state-tracking: two strings that differ by a single character may have completely different token representations. The model cannot rely on surface-level similarity to predict the next state, making string operations a consistent source of errors.

Key Insight: These challenges have different roots. Compositionality and trace truncation are both consequences of dense supervision: the first creates hidden states, the second creates traces too long to process. Tokenization discontinuity is a separate problem with how BPE tokenizers represent strings, independent of supervision strategy.

3. A State-Tracking Lens

Why does CWM succeed at state-tracking? The answer lies in its training format: dense state supervision. CWM tracks the state of all variables after each command—what we call reveal spacing = 1. Every operation is followed by a complete state snapshot, creating extremely dense supervision signal.

This raises a natural question: can we achieve reliable state-tracking with sparser supervision? To answer this, we need a controlled benchmark that isolates state-tracking from other code complexities.

3.1 The Shell Game Benchmark

Imagine the classic shell game: three cups labeled A, B, C contain objects 1, 2, 3. A dealer shuffles them by swapping pairs of cups, and you must track which object ends up where. This is exactly what state-tracking requires: maintaining a mental model of system state through a sequence of operations.

$S_n$ in code can be constructed using a variable assignment problem where $n$ variables are initialized with random numbers and then have their values swapped in different commands. We can use print statements for partial reveal that could be used as supervision signals.

S3 Shell Game and Code Implementation

Figure 5: (Left) The shell game with three cups representing $S_3$ permutations: each swap changes which object is under which cup. (Right) The equivalent $S_n$ problem as Python code: variable assignments and swaps mirror the cup movements, where tracking the final values requires faithful state-tracking through each operation.

3.2 Evaluation Setup: GPT-5 vs CWM

We evaluated both GPT-5 and CWM on $S_5$ permutation sequences with varying lengths: $N \in \{8, 16, 32, 64, 128\}$ swap operations (Figure 1, left). The task: predict the final values of all five variables.

GPT-5 Setup: We use a structured chat format with a system prompt instructing the model to act as a Python code execution tracer. The model outputs final values in the format a=X,b=X,c=X,d=X,e=X. We use a maximum token limit of 16,384 and evaluate via exact string matching.
View Full Prompts
System Prompt

You are a Python code execution tracer. Your task is to trace through Python code that performs variable assignments and swaps, then determine the final values of ALL variables.

## Task Description

Given a Python function that:

  1. Initializes 5 variables (a, b, c, d, e) with integer values
  2. Performs a series of simultaneous variable swaps (e.g., a, b, c, d, e = c, e, b, a, d)

You must trace through all the operations step by step and provide the final values of ALL five variables.

## Example

Code:

def execute_repl_trace():
    a = 1
    b = 2
    c = 3
    d = 4
    e = 5
    a, b, c, d, e = c, e, b, a, d
    a, b, c, d, e = e, b, c, d, a

def main():
    execute_repl_trace()

Step-by-step trace:

  1. Initial: a=1, b=2, c=3, d=4, e=5
  2. After a, b, c, d, e = c, e, b, a, d: a=3, b=5, c=2, d=1, e=4
  3. After a, b, c, d, e = e, b, c, d, a: a=4, b=5, c=2, d=1, e=3

Answer: a=4,b=5,c=2,d=1,e=3

## Instructions

  • Trace through each assignment carefully
  • Remember that tuple unpacking in Python happens simultaneously (all right-hand values are evaluated before any assignment)
  • Provide the final values of ALL variables in the format: a=X,b=X,c=X,d=X,e=X
  • Do not include any explanation, just the comma-separated values
User Prompt (Example with 8 swap operations)

Trace through the following Python code and provide the final values of ALL variables.

def execute_repl_trace():
    """Execute the REPL trace operations."""
    a = 8
    b = 4
    c = 7
    d = 8
    e = 7
    a, b, c, d, e = c, e, b, a, d
    a, b, c, d, e = e, b, c, d, a
    a, b, c, d, e = b, e, a, c, d
    a, b, c, d, e = a, b, e, d, c
    a, b, c, d, e = b, c, e, a, d
    a, b, c, d, e = e, a, c, b, d
    a, b, c, d, e = a, e, c, b, d
    a, b, c, d, e = b, d, e, c, a

def main():
    execute_repl_trace()

What are the final values of all variables? Provide in the format: a=X,b=X,c=X,d=X,e=X

CWM Setup: Uses the model's native trace format with specialized tokens (<|trace_context_start|>, <|frame_sep|>, <|action_sep|>). Unlike GPT-5 which only predicts final values, CWM generates a complete execution trace with explicit variable states in JSON format at each step.
View Full Format
CWM Input Format
<|begin_of_text|><|trace_context_start|>
def execute_repl_trace():
    """Execute the REPL trace operations."""
    a = 8
    b = 4
    c = 7
    d = 8
    e = 7
    a, b, c, d, e = c, e, b, a, d
    a, b, c, d, e = e, b, c, d, a
    a, b, c, d, e = b, e, a, c, d
    a, b, c, d, e = a, b, e, d, c
    a, b, c, d, e = b, c, e, a, d
    a, b, c, d, e = e, a, c, b, d
    a, b, c, d, e = a, e, c, b, d
    a, b, c, d, e = b, d, e, c, a
    print(f"c = {c}")

def main(): # << START_OF_TRACE
    execute_repl_trace()
<|frame_sep|>
CWM Output Format (Execution Trace)(abbreviated)
<|call_sep|>{}<|action_sep|>def main(): # << START_OF_TRACE<|frame_sep|>
<|call_sep|>{"a":8,"b":4,"c":7,"d":8,"e":7}<|action_sep|>def execute_repl_trace():<|frame_sep|>
<|line_sep|>{"a":7,"b":4,"c":8,"d":8,"e":7}<|action_sep|>a, b, c, d, e = c, e, b, a, d<|frame_sep|>
...
<|line_sep|>{"a":7,"b":7,"c":8,"d":8,"e":4}<|action_sep|>print(f"c = {c}")<|frame_sep|>
<|return_sep|>{"a":7,"b":7,"c":8,"d":8,"e":4}<|action_sep|>print(f"c = {c}")<|arg_sep|>"None"<|frame_sep|>

3.3 CWM Hallucinates Commands

As sequence length increases, CWM's accuracy drops. But why? Inspecting the errors reveals a surprising failure mode: the model hallucinates commands.

Rather than making small mistakes in state updates, CWM often generates an incorrect next command, predicting a swap that wasn't in the original sequence. Once this happens, all subsequent states become wrong because they're computed from a corrupted history (Figure 6).

CWM Baseline

Prompt:
a, b, c = 6, 6, 8
a, b, c = b, c, a
a, b, c = b, a, c
Completion:
<|frame_sep|> {"a": 6, "b": 6, "c": 8}
<|action_sep|> a, b, c = b, c, a
<|frame_sep|> {"a": 6, "b": 8, "c": 6}
<|action_sep|> a, b, c = c, a, b ← hallucinated!
<|frame_sep|> {"a": 6, "b": 6, "c": 8} ❌

CWM + Teacher Forcing

Prompt:
a, b, c = 6, 6, 8
a, b, c = b, c, a
a, b, c = b, a, c
Completion:
<|frame_sep|> {"a": 6, "b": 6, "c": 8}
<|action_sep|> a, b, c = b, c, a
<|frame_sep|> {"a": 6, "b": 8, "c": 6}
<|action_sep|> a, b, c = b, a, c ← forced
<|frame_sep|> {"a": 8, "b": 6, "c": 6} ✓

Figure 6: Action hallucination in CWM. (Left) Baseline: the model hallucinates command 2 (c, a, b instead of b, a, c), producing incorrect state. (Right) Teacher forcing: injecting ground-truth commands at each step yields correct state, demonstrating the model can track state when given correct actions.

3.4 Teacher Forcing: CWM Can Track State

To decouple command generation from state tracking, we use teacher forcing: we feed the model the correct commands at each step, and only ask it to predict the resulting state. With teacher forcing, the correct intermediate state is injected after each command, breaking the error chain.

With teacher forcing, CWM maintains high accuracy even at 128 commands! This mechanism explains the large accuracy gap observed at longer trace lengths: at 64+ commands, baseline accuracy drops to 0% while teacher forcing maintains ~90%.

This tells us that when commands are correct, the model can reliably propagate state over long horizons. The dominant failure in the baseline setting is generating incorrect commands, not an inability to update state.

Key Insight: CWM failures on long sequences are dominated by action hallucination (producing incorrect commands), not state-tracking errors. Teacher forcing isolates state propagation and shows the model can track state reliably when given correct actions.

3.5 Dense Supervision: Mechanism and Trade-offs

The teacher forcing results confirm it: CWM's state-tracking ability comes from dense supervision, not architectural innovation. With reveal spacing = 1, the model never has to track state for more than one step. The explicit state reveals act as "checkpoints" that reset any accumulated error.

Table 2: Supervision density comparison

Approach Reveal Spacing Supervision Density Token Cost
Standard LLM ∞ (no reveals) None Low
Sparse reveals 4-8 Low Medium
CWM 1 Maximum High

3.6 The Trade-off

Dense supervision works, but it comes at a cost:

The deeper question: Can we achieve reliable state-tracking with sparser supervision? This is where architecture starts to matter...

3.7 Architecture Matters for Sparse Supervision

CWM's success with dense supervision (reveal spacing = 1) raises a natural question: what happens when we can't afford such dense state reveals? This is where architecture becomes critical.

We trained models from scratch on $S_5$ permutation traces with varying reveal spacings, comparing Transformers against linear RNNs with different eigenvalue ranges (Table 3):

Table 3: Architecture performance under sparse supervision

Architecture Reveal Spacing = 1 Reveal Spacing = 4 Reveal Spacing = 8
Transformer ✅ Good ⚠️ Degrades ❌ Fails
DeltaNet [0,1] ✅ Good ⚠️ Degrades ❌ Fails
DeltaNet [-1,1] ✅ Good ✅ Good ✅ Good

The key finding: DeltaNet with extended eigenvalues [-1,1] can learn state-tracking even with sparse supervision, and extrapolates to longer sequences than seen during training. Transformers and standard DeltaNet [0,1] both collapse as reveal spacing increases.

Takeaway: Architecture determines whether state-tracking generalizes under sparse supervision. CWM succeeds by avoiding this challenge entirely (dense reveals), but more efficient approaches require architectures designed for state-tracking.

4. Interventions

Given the identified challenges, we tested targeted code transformations to expose hidden intermediate states. These interventions aim to restore the dense supervision that CWM relies on.

4.1 Expression Decomposition (CruxEval)

We decomposed nested expressions into sequential assignments with explicit temporary variables, exposing intermediate values. For example, output.append((nums.count(n), n)) becomes three separate lines with _t0 = nums.count(n), _t1 = (_t0, n), etc.

Result: Of 119 failed samples, 37 were recovered (31% recovery rate), improving accuracy from 85.1% to 89.8% (+4.7pp).

4.2 String Decomposition (CruxEval)

For single-character string operations (e.g., text.index(char)), we converted to explicit character-level loops. This makes each character position visible during trace generation, bypassing BPE tokenization issues.

Result: An additional 5 unique samples recovered, bringing total accuracy to 90.4% (+5.3pp from baseline). However, loop-based decomposition risks token explosion, as 5 samples hit the 32K token limit.

5. Discussion & Conclusion

Transformer-based CWMs achieve strong baseline performance (85% on CruxEval, 91% on HumanEval) by using dense state supervision (full state reveals at every step). This sidesteps the need for sophisticated state-tracking mechanisms: the model never needs to track state across more than one operation. However, this approach has clear limitations: nested expressions hide intermediates, causing super-linear accuracy degradation; string operations break token boundaries unpredictably; and long loops cause trace truncation for O(n²) algorithms or large inputs.

The key insight is that expressivity is not learnability. When supervision becomes sparse—through composition, long loops, or tokenization artifacts—performance degrades predictably. Future code world models will need architectures designed for state-tracking under realistic code complexity.

The Path Forward

  1. Hybrid architectures: Combine transformer's associative recall with linear RNN's state-tracking
  2. Tokenization stability: Design tokenizers that maintain consistent boundaries across minor string edits

References

  1. Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, et al. Genie: Generative Interactive Environments. 2024. arXiv:2402.15391 [cs.LG].
  2. FAIR CodeGen Team, Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, et al. CWM: An Open-Weights LLM for Research on Code Generation with World Models. 2025. arXiv:2510.02387 [cs.SE].
  3. Minsoo Kim, Yeonjoon Jung, Dohyeon Lee, Seung-won Hwang. PLM-based World Models for Text-based Games. EMNLP 2022, pp. 1324–1341.
  4. Wolfgang Lehrach, Daniel Hennes, Miguel Lazaro-Gredilla, Xinghua Lou, Carter Wendelken, Zun Li, Antoine Dedieu, Jordi Grau-Moya, Marc Lanctot, Atil Iscen, et al. Code World Models for General Game Playing. 2025. arXiv:2510.04542.
  5. Nicola Dainese, Matteo Merler, Minttu Alakuijala, Pekka Marttinen. Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search. NeurIPS 2024, vol. 37, pp. 60429–60474.
  6. Hao Tang, Darren Key, Kevin Ellis. WorldCoder: A Model-Based LLM Agent for Building World Models by Writing Code and Interacting with the Environment. NeurIPS 2024, vol. 37, pp. 70148–70212.
  7. William Merrill, Ashish Sabharwal. The Parallelism Tradeoff: Limitations of Log-Precision Transformers. NeurIPS 2023. arXiv:2207.00729.
  8. Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang. Transformers Learn Shortcuts to Automata. ICLR 2023. arXiv:2210.10749.
  9. Antonio Orvieto, Samuel L. Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, Soham De. Resurrecting Recurrent Neural Networks for Long Sequences. ICML 2023. arXiv:2303.06349.
  10. Lifan Yuan, Weize Chen, Yuchen Zhang, Ganqu Cui, Hanbin Wang, Ziming You, Ning Ding, Zhiyuan Liu, Maosong Sun, Hao Peng. From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones. 2025. arXiv:2509.25123.
  11. Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida I. Wang. CruxEval: A Benchmark for Code Reasoning, Understanding and Execution. 2024. arXiv:2401.03065.
  12. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating Large Language Models Trained on Code. 2021. arXiv:2107.03374.

Resources

Citation

@article{authors2026statetracking,
  title={Understanding State-Tracking in Linear RNNs for Code Execution},
  author={...},
  journal={ICLR},
  year={2026}
}