Something for your inner 12-year-old to feel superior about, as published in Ars Technica. Anthropic’s Claude AI, pretty helpful at summarizing documents and parsing transcripts, is not able to operate at a college-graduate level (or even an elementary-school level) when it comes to advancing in the good old Game Boy game:
Last month, Anthropic presented its “Claude Plays Pokémon” experiment as a waypoint on the road to that predicted AGI future. It’s a project the company said shows “glimmers of AI systems that tackle challenges with increasing competence, not just through training but with generalized reasoning.” Anthropic made headlines by trumpeting how Claude 3.7 Sonnet’s “improved reasoning capabilities” let the company’s latest model make progress in the popular old-school Game Boy RPG in ways “that older models had little hope of achieving.”
While Claude models from just a year ago struggled even to leave the game’s opening area, Claude 3.7 Sonnet was able to make progress by collecting multiple in-game Gym Badges in a relatively small number of in-game actions. That breakthrough, Anthropic wrote, was because the “extended thinking” by Claude 3.7 Sonnet means the new model “plans ahead, remembers its objectives, and adapts when initial strategies fail” in a way that its predecessors didn’t. Those things, Anthropic brags, are “critical skills for battling pixelated gym leaders. And, we posit, in solving real-world problems too.”
…
In the weeks since Claude Plays Pokémon was first made public, thousands of Twitch viewers have watched Claude struggle to make consistent progress in the game. Despite long “thinking” pauses between each move—during which viewers can read printouts of the system’s simulated reasoning process—Claude frequently finds itself pointlessly revisiting completed towns, getting stuck in blind corners of the map for extended periods, or fruitlessly talking to the same unhelpful NPC over and over, to cite just a few examples of distinctly sub-human in-game performance.
…
For Claude Plays Pokémon, though, project developer and Anthropic employee David Hershey says he started with an unmodified, generalized Claude model that wasn’t specifically trained or tuned to play Pokémon games in any way.
“This is purely the various other things that [Claude] understands about the world being used to point at video games,” Hershey told Ars. “So it has a sense of a Pokémon. If you go to claude.ai and ask about Pokémon, it knows what Pokémon is based on what it’s read… If you ask, it’ll tell you there’s eight gym badges, it’ll tell you the first one is Brock… it knows the broad structure.”
In addition to directly monitoring certain key (emulated) Game Boy RAM addresses for game state information, Claude views and interprets the game’s visual output much like a human would. But despite recent advances in AI image processing, Hershey said Claude still struggles to interpret the low-resolution, pixelated world of a Game Boy screenshot as well as a human can. “Claude’s still not particularly good at understanding what’s on the screen at all,” he said. “You will see it attempt to walk into walls all the time.”
Hershey said he suspects Claude’s training data probably doesn’t contain many overly detailed text descriptions of “stuff that looks like a Game Boy screen.” This means that, somewhat surprisingly, if Claude were playing a game with “more realistic imagery, I think Claude would actually be able to see a lot better,” Hershey said.
…
The current model has a “context window” of 200,000 tokens, limiting the amount of relational information it can store in its “memory” at any one time. When the system’s ever-expanding knowledge base fills up this context window, Claude goes through an elaborate summarization process, condensing detailed notes on what it has seen, done, and learned so far into shorter text summaries that lose some of the fine-grained details.
This can mean that Claude “has a hard time keeping track of things for a very long time and really having a great sense of what it’s tried so far,” Hershey said.