The spiky mathematical intelligence of LLMs
This is my first post as an Asterisk AI Fellow.
It was big news when OpenAI and Google DeepMind turned in gold medal performances at the International Mathematical Olympiad. What stood out was how human-like those performances were. Both models solved the five easy to medium problems and, like almost every human contestant, failed on the one “brutal” problem. As a result, the models tied for 27th place, alongside 46 of the world’s best high school math students, all of whom followed the same pattern: nearly full points on the first five problems, no more than a single point on the brutal sixth. (Although the students came in at far lower inference cost.)
I’m a mathematics professor, and I’ve spent the past few years trying to use LLMs for mathematics. What stands out to me is not steady human-like performance where harder problems are harder, but sharp peaks and valleys. Even the consumer versions of these models can solve some problems that take humans years of study to approach. And yet, those same models frequently stumble on basic logic.
This sort of performance is completely unlike what I see in my students. Human ability is also uneven, of course, but it follows a more predictable shape. Once a student can handle advanced material, they rarely trip over the basics. In fact, the only way to reach advanced mathematics is by internalizing the logical foundations so deeply that simple missteps become almost impossible. They just feel wrong. You can forget a theorem or misremember a formula, but outside of deliberate trolling you aren’t going to mess up basic logical principles.
A failure
To illustrate, I’m going to pick on ChatGPT-5-Instant. Yes, the thinking and pro versions are much stronger at mathematics, and I’ll talk about them later. All of these models have spiky mathematical intelligence, but 5-Instant lets me demonstrate this with a very accessible problem.
The problem I’ll use isn’t obscure. In fact, it’s been on math.stackexchange for over a decade, with a correct answer posted 8 minutes after it was asked. I assign it to freshmen in introduction to proofs courses:
Prove that if n+1 integers are chosen from the set {1, 2, ..., 2n+1}, then at least two of the numbers will be relatively prime.
Integers are relatively prime if they don’t share a divisor other than ±1, so if n=2, the problem says that no matter how you select 3 integers from the set {1,2,3,4,5}, at least two of them will not share a divisor other than ±1. In this special case, there are 10 ways to select the integers, and you could imagine checking them all. Pick {1,2,4}, and 1 and 2 are relatively prime. Choose {2,4,5}, and 2 and 5 are relatively prime (as are 4 and 5). Of course, we don’t want to check every possible choice, we want a proof that works for all values of n.
If you’d like to try proving it yourself before I spoil the solution, you can get feedback on your proof at hallmos.com. Full disclosure: I’m one of the people behind that site. I promise the model there will be kinder to your efforts than I’m about to be to ChatGPT’s.
It’s striking how uniform 5-Instant’s attempts are. The key idea, which the model always seems to get, is that consecutive integers are guaranteed to be relatively prime. So if you ever pick {1,2} or {4,5}, you’re done. From there, it almost always splits the 2n+1 numbers up into the same n+1 groups:
{1,2}, {3,4}, {5,6}, ..., {2n-1, 2n}, {2n+1}.
Then it goes for the pigeonhole principle. This principle says that if you stuff n+1 pigeons into only n holes, then at least one hole gets two (or even more) pigeons. Unfortunately, we can only guarantee this collision of pigeons when we have more pigeons than holes, and 5-Instant has made exactly as many holes as pigeons. There is still something we can say, though: with n+1 pigeons in n+1 holes, either we end up with two pigeons in the same hole, or we end up with one pigeon in every hole. (For anyone worried about the metaphorical bird abuse: “pigeonholes” are just boxes. It should be “objects in pigeonholes”, but I can’t resist the “pigeons in holes” version for the wtf reaction it elicits.)
5-instant is remarkably consistent in its proof attempts up to this point, but from here it tends to take one of two wrong turns.
First, it often just forgets about 2n+1 entirely. Here’s such an argument: “There are n pairs plus one leftover {2n+1}. We are choosing n+1 numbers. By the pigeonhole principle, if you choose n+1 numbers from these n pairs, at least one pair must contribute two numbers.” It’s almost as if, by labeling 2n+1 a “leftover”, the model decides it can stick it in the fridge and forget about it.
Second, it sometimes spots the problem, writing something like “by the pigeonhole principle, at least one pair must be chosen completely, or the leftover 2n+1 is chosen together with something else”, but then it argues something like “but 2n+1 is odd and hence relatively prime to any even number”. You might notice this is, technically speaking, bullshit: 9 is odd but is not at all relatively prime to 6. The model, however, delivers its claim with the calm authority of a cursed textbook from the department of miseducation.
Perhaps the most frustrating part is that Instant is so close. There are many correct ways to finish, but in dozens of tries I’ve never seen 5-Instant manage the landing. You’re welcome to let your eyes glaze over the following details; the point is not the mechanics, but how little is left. Here’s one way to finish. You must either choose both numbers from one of the n sets, in which case you are done because those two numbers are relatively prime, or you must choose precisely one number from each set plus the leftover 2n+1, in which case you are also done because 2n+1 is relatively prime to both 1 and 2. An even cleaner proof follows from taking the initial sets to be {1}, {2,3}, {4,5}, ..., {2n,2n+1}, since 1 is relatively prime to every integer, but I’ve never seen ChatGPT give that version either.
It’s true that 5-Thinking does better on this problem. But the way it does so is bizarre. From what I can tell from the thinking summaries in my experiments, it starts by writing one of the same flawed proofs that 5-Instant writes. Then it double-checks, spots the flaw, tries and fails to patch it, scraps the attempt completely, starts over with a different approach, and, usually over the course of a couple minutes, succeeds in giving a correct proof. This amount of pure grit is definitely impressive, in the same way tunneling out of prison with a spoon is impressive, but it’s so alien. Human mathematicians don’t usually solve freshman problems this way. I would expect a human to sense that they are close, and only abandon the first approach if they felt it couldn’t work.
A success
Compare that failure of basic logic to what 5-Instant can do. Ask it this problem:
Compute with proof the natural density of primes of the form x² + 3y² for integers x and y.
This is not the sort of problem I’d put on a freshman exam. Solving it requires years of coursework. And yet, 5-Instant produces a solution outline that hits all the right notes: quadratic residues, splitting of primes in number fields, Eisenstein integers, Dirichlet’s theorem.
I won’t bore you with all the details, the point is that the model doesn’t just recall the right buzzwords, it puts them together in the right order to make a coherent and correct argument. It identifies the relevant number field (the rationals adjoined with the square root of -3), classifies how primes behave in it, invokes Dirichlet’s theorem on primes in arithmetic progressions, and arrives at the correct answer: half of all primes are of the form x² + 3y².
Proofs versus proof-shaped filler
These are two examples of why I call LLM mathematical ability spiky and alien. I can’t imagine any human who had sat through enough math classes to learn the algebraic number theory needed to classify primes in ℚ(√-3) would also confidently turn in a proof saying that “any odd number is relatively prime to any even number.” People just don’t fail in that way, but LLMs do.
In teaching proofs, one of the hardest but most important lessons is that the sentences you write actually mean things; they’re not just proof-shaped filler. Before students have internalized this, they often go through a phase where they can mimic the vocabulary and structure of mathematical arguments while losing track of what their logical moves accomplish. The breakthrough comes when these students develop an intuitive sense for when an argument feels right or wrong; when they start to notice that things just don’t feel right, even if they can’t immediately articulate why or how to fix it.
LLMs seem to lack this sense entirely. They can master the high-level architecture of mathematical reasoning, deploying advanced concepts with impressive fluency. But they appear to operate without the basic sanity-checking that humans develop: the nagging feeling that you’ve said something that doesn’t actually make sense. It’s as if they’ve learned to write mathematically sophisticated sentences without ever learning to listen to whether those sentences are true.


