If you’re a longtime Scientific American reader, or just a geek of a certain age, you’ll remember Martin Gardner’s “Mathematical Games” puzzle column. (It got taken over by Douglas Hofstadter at one point and became, via anagrams, “Metamagical Themas.”) For the most part, they’re pretty funny … if you’ve got a sense of humor that can find something to giggle about in a Fibonacci sequence, say. ChatGPT, on the other hand, passes IQ tests with flying colors, and can even make up jokes on command. But as the magazine has now discovered, the AI absolutely fails at finding solutions to puzzles made by Gardner or in his quirky, human style:
The third puzzle I tried came from a March 1964 Gardner column on prime numbers: “Using each of the nine digits once, and only once, form a set of three primes that have the lowest possible sum. For example, the set 941, 827 and 653 sum to 2,421, but this is far from minimal.”
A prime is a number that cannot be evenly divided by any number besides 1 and itself. It’s relatively easy to assess small primes, such as 3, 5, 7 and 11. But the larger a number gets, the more difficult it becomes to evaluate whether that number is prime or composite.
Gardner offered a particularly elegant solution the following month: “How can the nine digits be arranged to make three primes with the lowest possible sum? We first try numbers of three digits each. The end digits must be 1, 3, 7 or 9 (this is true of all primes greater than 5). We choose the last three, freeing 1 for a first digit. The lowest possible first digits of each number are 1, 2 and 4, which leaves 5, 6 and 8 for the middle digits. Among the 11 three-digit primes that fit these specifications it is not possible to find three that do not duplicate a digit. We turn next to first digits of 1, 2 and 5. This yields the unique answer 149 + 263 + 587 = 999.”
I was genuinely impressed by the AI’s first answer: 257, 683 and 941—all primes, representing all nine digits and summing to 1,881. This is a respectably low total, even though it’s higher than Gardner’s solution. But unfortunately, when I asked ChatGPT to explain its work, it offered a verbose path to a different solution: the numbers 109, 1,031 and 683—all primes but otherwise a poor fit for the prompt’s other requirements.
Upon being reminded of its initial answer, ChatGPT offered a daft explanation, including a claim that “we cannot use 1, 4, or 6 as the first digit of a three-digit prime, as the resulting numbers would be divisible by 3.” This is patently false: you can recognize numbers divisible by 3 because their digits total a number divisible by 3.
I attempted a pep talk, noting that there was a better solution and suggesting ChatGPT imagine it was a math professor, but it next offered 2, 3 and 749. It then stumbled to 359, 467 and 821—another valid trio of primes, totaling 1,647—better than its first solution but still not as elegant as Gardner’s.
Alas, it was the best I would get. Six more answers were riddled with nonprime numbers and missing or excess digits. And then ChatGPT once again offered 257, 683 and 941.
All these failures reflect what [University of Michigan computer scientist Kentaro] Toyama says is a key property of these sorts of AI systems. “ChatGPT excels at the humanlike,” he says. “It’s mastered the style of being linguistically human, but it doesn’t have explicit programming to do exactly the things that computers have so far been very good at, which is very recipelike, deductive logic.” It isn’t solving the problem, or necessarily even trying to—it’s just showing approximately what a solution might look like.
Throughout the attempts, I was also struck that nothing seemed to fluster the AI. But Toyama says that’s also a reflection of ChatGPT’s creation and the material it was fed. “The vast majority of the data it was trained on, you could imagine the average tone of all of that text—probably that average tone is quite confident,” he says.