The metric is part of the model - a lesson from a trading side-project

For a few months I've been working on a small algorithmic trading project on the side — partly to learn ML on real (and brutally noisy) data, partly because finance is one of the places where measurement and model meet in interesting ways. The setup is unremarkable: a universe of 16 sector and broad-market ETFs (SPY, QQQ, DIA, the XL- sector series, IWM, IYY), per-ticker models that produce daily forecasts, and a rules-based execution layer with stop-losses on top.

A failing strategy that wasn't

The verdict, for the first few rounds of testing, was equally unremarkable. The strategy was failing.

The headline number I had been staring at was CAGR — compound annual growth rate, the standard "how much money did this make per year" metric. On CAGR, the strategy lost to buy-and-hold on every single ticker. Zero wins out of sixteen. In trading research, when your strategy can't beat the dumbest possible baseline (just hold the thing), you stop, write a post-mortem, and move on.

I was working on the post-mortem.

The accidental noticing

I want to be honest about how unglamorous the turning point was. I was looking at a results table with maybe ten columns — ticker, strategy CAGR, buy-and-hold CAGR, drawdowns, win rate, a few other things — and I noticed that the strategy's maximum drawdown column was much smaller than the buy-and-hold column. Not by 10%. By a factor of three or four on some tickers.

DIA, for instance: the strategy gave up some upside (8.1% CAGR vs 11.6% for buy-and-hold), but its worst drawdown was −6.1% versus −36.7% for buy-and-hold. The strategy was earning ~70% of DIA's return while taking ~17% of its worst-case pain. For a first-pass result, that felt like a number worth a second look.

I said something out loud, more thinking to myself than to the screen:

"if the gap to buy-and-hold has closed AND we lowered the losses — we have the same upside but with lower risk — this is an alpha!"

That sentence rearranged the entire project.

Why I almost missed it (and why the AI did too)

I'm going to be specific here, because I think it's the most useful part of this story for other people working with LLMs.

I'd been working on this project with Claude — my $100-a-month research assistant — doing a lot of the table-wrangling and analysis alongside me. When I asked it to help evaluate the strategy, it sorted the table on CAGR, summarised on CAGR, and recommended next experiments aimed at improving CAGR. It was, to be fair, doing exactly what I'd implicitly asked for. The column I sorted by became the column it optimised.

This is not a failure of the model so much as a property of the loop: when you and your tool are both anchored on the same metric, neither of you is in a position to notice that the metric is wrong. The reframe didn't come from cleverness, it came from a side-channel — me happening to read across the row instead of down the CAGR column.

If there's a generalisable lesson here for AI-assisted work, it's something like: the assistant amplifies whatever frame you start with. It will not, on its own, propose that you change the frame. That's still your job — at least with current models.

What changed when we switched the metric

The metric we switched to is Calmar — named after the California Managed Account Reports newsletter where it was first published by Terry W. Young in 1991. It's a simple ratio: CAGR ÷ max drawdown. It asks "how many percentage points of return are you earning per percentage point of worst-case pain?" It's been around forever — decades-old, textbook stuff.

Same trades. Same predictions. New column. New verdict:

View	Strategy wins (out of 16)	Best ticker	Best ratio vs buy-and-hold
CAGR	0	n/a	n/a
Calmar	12	DIA	4.16×

Five tickers showed Calmar ratios of 2× or more vs buy-and-hold (DIA, XLRE, XLU, SPY, XLY). At the portfolio level, the first cut looked even better: daily Sharpe around 1.27 out-of-sample versus 0.78 for buy-and-hold, and Calmar at 1.01 versus 0.39.

I want to be careful with those numbers — as it turned out, more careful than I first was. The test window includes the Covid crash of early 2020, where stop-losses look heroic, so some of the Calmar advantage is a single-event artifact rather than structure. And there was a subtler problem I hadn't caught yet, which I'll come back to. But even after the caveats, the order-of-magnitude conclusion held: this was a different strategy than I thought I had.

The investor table — same strategy, different products

Once we had the Calmar view, an obvious question came up: how can the same trades look this different on two metrics? And the answer is that they don't look different — they look the same. We're the ones looking different. Different investors are optimising for different things, and the trades meet each one differently.

Investor profile	Preferred metric	Verdict on this strategy
Aggressive growth (long horizon, can stomach drawdowns)	CAGR	Skip it — just buy and hold
Risk-managed institution (drawdown-constrained mandate)	Calmar	Real alpha (~2.6× better)
Retiree / capital preservation	Sortino / Calmar	Real alpha — the downside protection is the product
Levered hedge fund	Calmar + ability to lever	Real alpha — drawdown is a fraction of buy-and-hold's, so you could lever it up several times over to match the market's risk and beat its return
Long-only retail	Sharpe / Calmar	Real improvement

A 25-year-old indexing into retirement should still just buy SPY. A 65-year-old who can't afford a 35% drawdown five years before retirement is looking at a completely different proposition from the same trade sequence. A hedge fund that can lever is looking at yet a third one.

Nothing here is original. But living through it — having the same numbers swing from "throw this out" to "this is interesting" because the persona at the other end changed — was clarifying in a way that reading it in a textbook wouldn't have been.

The number I was staring at was also wrong

A few weeks after the reframe, I ran a look-ahead audit on the entry rule — the kind of check that's boring right up until it isn't. It found a leak: the threshold that decided when to trade had been computed using a quantile over the whole history, including days that, at decision time, hadn't happened yet. Small, easy to miss, and quietly flattering the results.

Fixing it took a chunk out of the headline, exactly as I'd feared in the caveats — portfolio Sharpe from ~1.27 to ~1.12, Calmar from ~1.01 to ~0.70, worst drawdown from −6.4% to −9.9%. Still ahead of buy-and-hold on the risk-adjusted metrics, but a soberer story; the per-ticker "12 of 16" table now needs re-running under the fixed rule before I'd stand behind it.

The part that matters: the metric lesson didn't move. Under CAGR the strategy still loses; under Calmar it still wins. I'd anchored on a number twice in one project — first on the wrong metric, then on an inflated value of the right one — and both times the fix came from reading against the grain, not from optimising harder.

What I'm taking from this beyond the trading project

A few things, in roughly the order they hit me:

1. The metric is part of the model. I had been thinking about it as: model → predictions → strategy → results, with metrics as a thin reporting layer at the end. But the metric defines the loss function of the whole pipeline, including the parts I'm doing in my head. Choosing CAGR as the headline silently told me to optimise for return; choosing Calmar would have silently told me to optimise for return-per-unit-of-pain. Same code, different project.

2. "Failure" is metric-relative. A strategy that loses on CAGR but wins on Calmar isn't a failed strategy. It's a strategy that produces a product nobody is asking that particular column to evaluate. Whether it's useful depends on whether anybody wants what it actually does, not whether it improves the metric you happen to be staring at.

3. AI assistants inherit the frame, they don't question it. This is going to keep biting me on other projects. The countermeasure I'm trying is to explicitly ask "what is this not good at?" and "what metric would make this look good?" as separate prompts, late in an analysis. Once. Just to break the anchor.

4. The throwaway sentence often does more than the structured analysis. The reframe came from one off-hand thought, not from a planned experiment. I don't have a great way to systematise that. But I'm trying to notice when something feels like an inconsistency in a table and write the inconsistency down before continuing.

A note on where I am with this

I'm not a quant. I'm a software engineer with a strong interest in finance, ML, and security, and this project is mostly an excuse to practice rigorous-ish thinking on data that won't let me get away with sloppy work — markets punish overfitting fast, which makes them a good gym.

There are still a dozen things to test before any of this hardens: AR(1)-matched placebos to check whether the model is doing real work or just exploiting autocorrelation, sub-period robustness to see how much of the Calmar advantage is concentrated in the 2020-Q1 Covid window, transaction-cost sensitivity past 6 bps, and re-running that per-ticker table under the corrected entry rule. Some of these will probably take further chunks out of the result — the look-ahead audit above already did. That's fine — the lesson about the yardstick is the part I expect to keep regardless.

tags: Trading, ML, Metrics