About This Research
The Paper
This blog accompanies our paper "Debugging code world models" (arXiv:2602.07672), which studies Code World Models (CWMs) through two complementary perspectives: local semantic execution and long-horizon state tracking.
Key Contributions
- We characterize two dominant failure regimes: token-budget exhaustion and string-valued state brittleness
- We show long-horizon degradation is dominated by incorrect action generation (action hallucination)
- We show teacher forcing isolates state propagation and yields strong long-horizon accuracy
- We connect string failures to subword tokenization instability rather than program structure
Resources
Contact
For questions about this research, please contact the authors.