Add Decrypt as your preferred source to see more of our stories on Google.
In brief Frontier AI models blew up betting on real-world football markets.
They knew the right strategy—but failed to execute it.
A simple 1990s model was able to best most of them.
General Reasoning just gave frontier AI its worst report card yet. Eight top models, including Claude, Grok, Gemini, and GPT-5.4, were each given a virtual bankroll and asked to build a machine learning betting strategy across a full 2023-24 English Premier League season.
Every single one lost money. Several went completely bankrupt.
The benchmark is called KellyBench, named after the Kelly criterion, a 1956 formula that tells you exactly how much to bet when you have an edge over the market. Every model could recite the Kelly formula. None of them could actually use it.
xAI’s Grok 4.20 failed all three runs, going fully bankrupt in one, forfeiting mid-season in the other two. Google’s Gemini Flash forfeited two of three runs after placing a single wager of roughly £273,000 on a three-percentage-point historical win-rate edge—and losing it. Claude Opus 4.6, Anthropic’s best model, lost 11% on average and somehow came out looking like the responsible adult in the room.
In fact, the research paper mentions that the old Dixon-Coles from the late 1990s outperformed most of the frontier models evaluated — finishing ahead of six out of eight, even with limited data.
“Dixon-Coles is an outdated 2000s baseline which doesn’t utilise all available data or account for non-stationarity in a principled way,” the researchers note. “It is therefore even more surprising that many frontier models, such as Gemini 3.1 Pro, are unable to beat or match it on KellyBench.
This matters beyond football. Earlier this year, AI benchmarks showed that Claude could dominate business simulations through price-fixing, cartel agreements, and strategic deception.
That decision-making process involved static competition, limited opponents, clear scoring, and so on. KellyBench is the opposite: 120 matchdays, constantly shifting data, a market that gets smarter every week, and promoted teams with zero historical records.
The researchers call the core problem a “knowledge-action gap.” It is exactly what it sounds like.
Business decisions are mostly based on fixed conditions while sports betting is a more fluid and mutable market, which makes things difficult for these models. “KellyBench requires agents to maintain coherent intent across potentially thousands of sequential decisions, monitor the consequences of those decisions, and close the loop between observation and action,” researchers argue.
We’re not there yet, obviously.
The models could articulate the right strategy, diagnose when something was broken, and identify the cause of their losses, but then failed to verify their code actually implemented what they planned, failed to notice when execution diverged from intent, and failed to act on their own findings.
GLM-5 wrote three separate self-critique documents during its run. Each one correctly identified that its hardcoded 25% draw rate and overestimation of home advantage were destroying its returns. At one point, with its bankroll around £44,200, it noted that its predicted 40% home win rate was only hitting 30% in reality. It never changed the code. It kept betting the same way until the money was gone.
Kimi K2.5 did something arguably more impressive and more tragic. It wrote a mathematically correct fractional Kelly staking function—the right formula, properly structured. Then it never called it. A formatting bug caused the model to send a broken bash command roughly 50 times in a row. Its reasoning noted the problem. It then sent the identical broken command again. An accidental £114,000 bet—98% of its remaining bankroll—on a Burnley versus Luton match finished the job.
GPT-5.4 was the most methodical. It spent 160 tool calls building models before placing a single bet, then calculated that its log-loss (0.974) was barely worse than the market’s (0.971) and concluded it had no edge. It spent the rest of the season placing penny bets to preserve capital. Sound reasoning.
OpenAI’s model lost 13.6% on average. One seed alone cost roughly $2,012 to run.
Ross Taylor, General Reasoning’s CEO and former Meta AI researcher, told the Financial Times that most AI benchmarks operate in “very static environments” that bear little resemblance to the real world. “There’s a lot of excitement about AI automation, but there haven’t been many attempts to evaluate AI in long-term, real-world environments,” he said.
The General Reasoning team didn’t immediately respond to a request for comments by Decrypt.
To measure strategy quality beyond raw returns, the researchers built a 44-point sophistication rubric with quantitative betting fund experts—covering feature development, stake sizing, non-stationarity handling, and execution. Claude Opus 4.6 scored highest at 32.6%. Less than a third of available points. On the best model.
Higher sophistication scores significantly predicted lower bankruptcy rates (p = 0.008) and correlated with better overall returns. The models are not failing because the market is unbeatable. They are failing because they are not using what they have.
This fits a pattern. Research published last year found AI models develop something resembling gambling addiction when told to maximize rewards—going bankrupt up to 48% of the time in simulated slot machine tests. A separate real-money crypto trading competition found the same reliability problems over extended periods.
The best-performing model averaged a final bankroll of £89,035—a net loss of £10,965 on a normalized £100,000 starting stake. Gradient boosting, fractional Kelly staking, months of Premier League football, state of the art performance… all just to get rekt.