Feature Evidence vs. Backtest Evidence

Understand why a promising feature is not the same thing as a net-positive strategy after costs replay.

A strong feature and a strong backtest are related, but they are not interchangeable.

The short version

Feature evidence asks: does this feature carry information associated with future returns (empirical)?
Backtest evidence asks: if we turn that idea into rules, sizing, and costs, does it still survive as a strategy?

That distinction matters because many weak products collapse both into a single story.

What feature evidence is trying to prove

Feature evidence is about whether the underlying idea carries information before execution rules are layered on top.

At its simplest, the evidence layer is asking:

IC = corr(signal_t, return_t+h)

If that correlation is positive and stable out of sample, the feature may contain real empirical forward-association.

Examples:

out-of-sample correlation between the feature and future return
fold stability across validation windows
whether the feature is degrading over time
whether the feature still behaves acceptably across regimes

In the current workspace surface, the customer-facing evidence block focuses on:

OOS IC
IC IR
Positive folds
decay or warning context

You can think about the stability part as:

IC IR = mean(IC across folds) / stddev(IC across folds)

That is why a feature can have a positive average IC but still fail to inspire confidence if its fold-to-fold behavior is too erratic.

What a backtest is trying to prove

A backtest is about implementation quality, not just raw empirical forward-association.

It asks what happens when the system has to trade the idea with:

entries and exits
position sizing
execution cost assumptions
replayed market conditions
a specific historical window

In simplified form, a backtest is asking something closer to:

Strategy PnL ~= gross excess - fees - slippage - implementation drag

Excess before and after implementation costs

Why backtests can look weaker than feature evidence

A real strategy has to survive execution costs, not just carry forward-associated excess on paper.

Gross excess+12.4 bps

Fees-2.1 bps

Slippage-2.9 bps

Latency-1.4 bps

Gross excess

+12.4 bps

Net excess after costs

+6.0 bps

Simplified relationship

net excess = gross excess - fees - slippage - implementation drag

That is why a backtest produces strategy outputs such as:

PnL
drawdown
trade count
period behavior

Why a feature can look good while a backtest looks weak

This is a normal and important failure mode.

Common reasons include:

the feature is real, but too weak after costs
the implementation trades too often
the sizing is too aggressive
the strategy only works in one regime and collapses elsewhere
the execution assumptions are too optimistic

Why a backtest can look good while the feature evidence is weak

This is also possible, and it is dangerous.

Common reasons include:

too much parameter freedom
lucky period selection
overfitting to one market phase
a strategy rule set that looks net-positive after costs without strong underlying evidence

A good-looking backtest without strong evidence should be treated as fragile. A strong feature without a convincing backtest should be treated as incomplete.

How Statly separates the two

The product should keep these layers honest:

Layer	Main question
Feature evidence	Does the candidate appear to carry information associated with future returns (empirical)?
Backtest	Does the excess vs benchmark implementation survive historical replay?
Paper	Does the idea still behave coherently under current market conditions?
Live	Does the deployment remain trustworthy when real capital is exposed?

The research surface should never pretend that one layer fully replaces the others.