Long-Context Policy Problem: Explained with the Game of Pokémon

Eashan Vytla

On February 15th, 2024, Google Gemini released the 1-million-token context window model. Although the quality of Gemini’s ability to handle such a large context window is up for debate, it is still remarkable and pushed many excited engineers to start building. One of those engineers, Joel Zhang, put Gemini to the real test: Pokémon.

In Joel’s experiment, an emulator would feed the game’s screen and internal state to Gemini, and Gemini would output the next button press. This viral project showcases behaviors that are critical to using these LLM-based systems for robotics problems. By bridging the gap between Pokémon and robotics, I aim to highlight a fundamental limitation: Vision-Language-Action (VLA) models struggle with tasks requiring reasoning about historical dependencies because they are trained and deployed under the Markov assumption — the idea that the future depends only on the present. However, many real-world robotics tasks are not Markovian — the actions heavily depend on the history of states. By connecting VLA’s to the “Gemini plays Pokémon experiments,” I aim to explain why we cannot simply append entire environment histories into the system to solve this bottleneck. This issue is currently being explored in the robot learning field and is known as the long-context policy problem.

What are VLAs?

Vision Language Action (VLA) models represent a significant frontier in embodied AI. Recent models like Physical Intelligence's PI-0 demonstrate unprecedented capabilities in complex, long-horizon tasks.

A VLA is a multi-modal end-to-end transformer-based policy network. Its primary function is to learn a direct mapping from proprioceptive, visual, and text inputs to joint-deltas (action commands). These actions are effectively a time-conditioned trajectory that a traditional inverse-kinematic solver and PID controller can follow for the next h timesteps, as shown in the diagram below.

However, a critical flaw underpins this architectural paradigm. Despite their advanced abilities, VLA models like PI-0 are limited by their reliance on the Markov assumption, which posits that the future is independent of the past given the present state. This assumption is a severe bottleneck that hinders the models’ ability to solve a vast range of real-world robotic problems.

The common practice for solving this problem in reinforcement learning is to condition the future action on a history of states rather than just the current state, as shown below. To do this, we need the model to parameterize the distribution below.

$p(a_t \mid x_t, x_{t-1}, \ldots, x_{t-n})$

To accomplish this, we can append the previous states to the context window of an LLM. Implementing this solution in Pokémon is a simple yet effective test to learn how an AI agent can interact with an environment like a robot would. Because VLAs use a VLM backbone, there are many connections between the two instances. Specifically, I will be diving into the various issues and augmentations discussed in “The Making of Gemini Plays Pokémon,” written by Joel Zhang.

Of these issues, I believe all of them are relevant to a robotic system, so I recommend you check out his blog post. However, the components I am most interested in are the Goals System, Summary System, and Map Memory.

Summary System

As Joel describes in his blog, his initial plan was to capitalize on the long context by appending the entire state observations and game history. However, as the prompt size increased, the model fell into a pattern-repeating mode, which prevents novel decision-making and exploration. In the discussion, Joel observes that when he clears the context, the model falls out of this repetitive loop. However, when clearing the context, the system also forgot key elements of the history. To solve this issue, he adds a summary system that would summarize the history after a certain number of states.

Because VLA systems are LLM-backed, once we graduate from pick-and-place challenges, we can reasonably expect to notice these same issues. Thus, if we simply append past state observation and environment history to the prompt of our VLA, we will likely fall into a repetitive loop of actions. Although images of environment history could produce a rich VLM-based summary, I am unclear on how rich a summary can be for low-level robotic observations and actions. This may be a critical failure point that needs to be properly benchmarked and addressed.

Goals System

Because the summary system requires a full wipe of the game history and observation states, the agent often starts forgetting intermediate goals. Joel solves this by appending to the prompt of the summarization system to output a hierarchy of the current goals: a primary goal, a secondary goal, and a tertiary goal.

This system is familiar, since the PI-0 follow-up (PI-0.5) uses a similar “goal head” during model training. The goal head was incorporated into the model’s training objective by assessing how well it infers the current attempted subtask conditioned on the visual input. This allows Physical Intelligence to achieve long-horizon task planning even though its system is Markov and does not involve a context. Although the instances are slightly different, the example provides strong evidence for the correlation between both problems and further supports the extrapolations in this discussion.

Map Memory

The most relevant component to robotics from the Gemini-Pokémon operating system is the map memory. The problem with Gemini is that although observation and environment history are provided, the model is unable to maintain a memory that is grounded in spatial relationships. Joel adds this capability with another augmentation designed to keep track of a fog-of-war style mini-map, serialized into a JSON string, which he appended to the prompt.

VLAs face a similar challenge in maintaining spatial memory. For example, if I need a robot to complete a task that requires an object from the garage, but the robot is in the kitchen. The agent must first use its memory of previously explored areas to navigate. If the object's location is unknown, it must explore logical areas, remembering this new spatial information for future recall. This is the crux of the long-context policy issue! Many researchers are attempting to solve this issue through various approaches which I have laid out in the next section.

Research Direction

There are a couple of directions currently being explored in the field to solve this issue:

Building augmentations with structured memory, such as semantic scene graphs. This approach aims to give the model an explicit, queryable database of its environment, moving beyond simple observation history to a more robust and relational understanding of the world.
Developing Meta-RL approaches that use reinforcement learning to train the agent on "how to think" about its history. The goal is to have the model learn an information bottleneck for memory management and utilization rather than being hand-designed.
Modifying training recipes, as demonstrated in the paper "Learning Long-Context Diffusion Policies via Past-Token Prediction." These methods alter the model's training process to better incentivize paying attention to and utilizing long-term historical information.
Comparing the architectural performance of models like Transformers against newer architectures like State Space Models (SSMs). This research explores whether different foundational model designs are inherently better at efficiently compressing and recalling information from long historical contexts.

I hope this journey through Pokémon provided a fun and comprehensive look at the long-context policy problem in robotics. I'm always looking to improve my ability to communicate complex topics, so any feedback on the writing itself is greatly appreciated.

Thanks for reading. If you're also working on or thinking about these problems, I'd love to connect—feel free to reach out at vytla [dot] 4 [at] osu [dot] edu!