Statly Docs
Research

Feature Evidence vs. Backtest Evidence

Understand why a promising feature is not the same thing as a net-positive strategy after costs replay.

A strong feature and a strong backtest are related, but they are not interchangeable.

The short version

  • Feature evidence asks: does this feature carry information associated with future returns (empirical)?
  • Backtest evidence asks: if we turn that idea into rules, sizing, and costs, does it still survive as a strategy?

That distinction matters because many weak products collapse both into a single story.

What feature evidence is trying to prove

Feature evidence is about whether the underlying idea carries information before execution rules are layered on top.

At its simplest, the evidence layer is asking:

IC = corr(signal_t, return_t+h)

If that correlation is positive and stable out of sample, the feature may contain real empirical forward-association.

Examples:

  • out-of-sample correlation between the feature and future return
  • fold stability across validation windows
  • whether the feature is degrading over time
  • whether the feature still behaves acceptably across regimes

In the current workspace surface, the customer-facing evidence block focuses on:

  • OOS IC
  • IC IR
  • Positive folds
  • decay or warning context

You can think about the stability part as:

IC IR = mean(IC across folds) / stddev(IC across folds)

That is why a feature can have a positive average IC but still fail to inspire confidence if its fold-to-fold behavior is too erratic.

What a backtest is trying to prove

A backtest is about implementation quality, not just raw empirical forward-association.

It asks what happens when the system has to trade the idea with:

  • entries and exits
  • position sizing
  • execution cost assumptions
  • replayed market conditions
  • a specific historical window

In simplified form, a backtest is asking something closer to:

Strategy PnL ~= gross excess - fees - slippage - implementation drag
Excess before and after implementation costs
Why backtests can look weaker than feature evidence
A real strategy has to survive execution costs, not just carry forward-associated excess on paper.
Gross excess+12.4 bps
Fees-2.1 bps
Slippage-2.9 bps
Latency-1.4 bps
Gross excess
+12.4 bps
Net excess after costs
+6.0 bps
Simplified relationship
net excess = gross excess - fees - slippage - implementation drag

That is why a backtest produces strategy outputs such as:

  • PnL
  • drawdown
  • trade count
  • period behavior

Why a feature can look good while a backtest looks weak

This is a normal and important failure mode.

Common reasons include:

  • the feature is real, but too weak after costs
  • the implementation trades too often
  • the sizing is too aggressive
  • the strategy only works in one regime and collapses elsewhere
  • the execution assumptions are too optimistic

Why a backtest can look good while the feature evidence is weak

This is also possible, and it is dangerous.

Common reasons include:

  • too much parameter freedom
  • lucky period selection
  • overfitting to one market phase
  • a strategy rule set that looks net-positive after costs without strong underlying evidence

A good-looking backtest without strong evidence should be treated as fragile. A strong feature without a convincing backtest should be treated as incomplete.

How Statly separates the two

The product should keep these layers honest:

LayerMain question
Feature evidenceDoes the candidate appear to carry information associated with future returns (empirical)?
BacktestDoes the excess vs benchmark implementation survive historical replay?
PaperDoes the idea still behave coherently under current market conditions?
LiveDoes the deployment remain trustworthy when real capital is exposed?

The research surface should never pretend that one layer fully replaces the others.