Time (not a science magazine, but…) has a story on the Metaculus forecasting cup (not a scientific experiment, but…), which offers a $5,000 prize to members who successfully predict geopolitical events. Lately, in a new development, the human competitors have been losing to an AI:
Forecasters estimate the probabilities of the events occurring—a more informative guess than a simple “yes” or “no”—weeks to months in advance, often with remarkable accuracy. Metaculus users correctly predicted the date of the Russian invasion of Ukraine two weeks in advance and put a 90 percent chance of Roe v. Wade being overturned almost two months before it happened.
Still, one of the top 10 finishers in the Summer Cup, whose winners were announced Wednesday, was surprising even to the forecasters: an AI. “It’s actually kind of mind blowing,” says Toby Shevlane, CEO of Mantic, the recently-announced UK-based startup that developed the AI. When the competition opened in June, participants predicted that the top bot’s score would be 40% of the top human performers’ average. Instead, Mantic achieved over 80%.
…
But forecasting broad geopolitical questions is notoriously hard. Forecasts from top forecasters can take days and tens of thousands of dollars for a single question. For organizations like RAND, tracking multiple topics across many geopolitical zones, “it would take months to have human forecasters do an initial forecast on all those questions, let alone update them regularly,” said [Forecasting Initiave co-director Anthony] Vassalo.
…
The training that AIs receive is showing up in the rankings. In June, the top-ranked bot, built by Metaculus on top of OpenAI’s o1 reasoning model, came 25th in the cup. This time, Mantic is eighth out of 549 contestants—the first time a bot has placed in the top 10 in the competition series.
The result should be taken with a grain of salt, according to Ben Wilson, an engineer at Metaculus who runs comparisons of AIs and humans on forecasting challenges. The contest contains a relatively small sample of 60 questions. Moreover, most of the 600 contestants are amateurs, some of whom only predict a handful of questions in the tournament, leaving them with a low score.
Finally, the machines have an unfair advantage. Participants win points not only for accuracy, but also “coverage”—how early they make predictions, how many questions they make predictions on, and how often they update their estimates. An AI that is less accurate than human competitors can still do well in the rankings by constantly updating its estimates in response to emerging news, in a way that is unfeasible for humans.