Why I shelved the LLM trader

The setup

The pitch is simple enough that I felt dumb not building it. LLMs read fast. Markets price news. So a model with the right context should be catching mispricings before they close. I built it. The loop pulled news, filings, and macro through worldmonitor-mcp, handed everything to Gemini with the current position book, let the model propose actions, ran them through the same risk controller as the rest of trading-algo, logged everything.

It traded. Nothing blew up. The PnL also didn't clear any of the basic checks I run on every other adapter, which is the polite way to say it lost money slowly.

What the papers show

Once it became clear the in-house thing wasn't working, I went looking for confirmation that it should have been. Three problems came up.

First, the naive benchmark studies. The 2023–2025 papers that replace a momentum or buy-and-hold signal generator with an LLM mostly cluster around "small positive in narrow windows" or "indistinguishable from random." When the window extends or the universe widens, the positives collapse. The alpha claims that look real fail a Deflated Sharpe test. The ones that pass are run on windows the model has already seen.

Second, memorization. A 2026 model has read about the 2022 prices it's being asked to "trade" against. Performance on 2022 data is contaminated with post-event knowledge in a way that's nearly impossible to fully scrub. The proper test is a live forward window with strict information cutoffs, run long enough for the Deflated Sharpe to converge. I haven't seen a public study that does it properly.

Third, the structural problem. Even if an LLM compresses public information well, "public information compressed faster" is only an edge if you can act on it before everyone else running the same model. Citadel and Jane Street with colocated LLMs at microsecond latency might have something here. At retail API latency, where the round-trip to OpenAI is measured in seconds, you don't.

The most defensible paper I read makes a narrower claim: LLMs can label features — headline sentiment, regime classification — that then feed a traditional quant model. Fine. I buy that. But that's feature engineering for a quant model. It isn't an LLM trader.

The call

I had no defensible distribution over how well the Gemini trader would work. Just competing narratives, an unfavorable base rate, and an in-sample track record that didn't look like alpha. Half-Kelly needs a probability of being right. I didn't have one, so I sized it to zero instead of sizing it small. The code stays in trading_algo/llm/, unwired from the production loop. Decision rationale is in docs/TRADITIONAL_VS_AI_TRADING_VERDICT.md and docs/HOW_LLMS_WERE_USED_IN_RESEARCH.md if anyone wants to argue.

What would change my mind

I'm trying not to be the guy who calls every line of work I haven't done a dead end. So:

A published live forward test with strict information cutoffs, 18+ months, Deflated Sharpe with confidence intervals — not a backtest, not from someone selling the result.
A structural account of where the edge lives. "In regime X, with information advantage Y, against counterparty Z." Not "the model is smart."
Evidence the edge survives retail-tier costs. 2bp slippage and per-share commissions delete a lot of paper alpha.
The LLM adding something measurable on top of a well-engineered quant model — incremental contribution above the same pipeline minus the LLM, significant and replicable. This is the version I'd actually bet on.

What stayed

The LLM still runs the operational loop in trading-algo. Claude Code pulls data through MCP, runs the CLIs, reasons about state, proposes actions. Strategy comes from the 25 quant adapters and ATLAS. The LLM runs the loop; it doesn't pick the trades. Which is roughly where the conservative papers land too: orchestration and feature labels, fine; alpha, no.

mahimn · essay 01 · v1 · apr 2026 · corrections welcome at mahimn.patel.k@gmail.com