Statly Docs
Research

Metrics And Evidence

Review the evidence layer customers should see by default and what belongs behind a more advanced layer.

Evidence should build trust, not turn raw research metrics into unexplained marketing headlines.

Default evidence layer

The default customer-facing layer answers:

QuestionWhat to show
How strong is the evidence?Evidence strength indicator
When was it last validated?Recency markers
What are the key risks?Clear warning language
What should I do next?Backtest, paper, or live suggested next step
Regime evidence heatmap
Trend
persistent directional flow
Range
two-sided rotation
Vol crush
compressed realized vol
Funding stress
carry dislocation and unwind
Funding carry
best when funding dislocation stays persistent
+0.18
strong
+0.09
mixed
-0.05
weak
+0.24
strong
Breakout continuation
needs directional follow-through to hold excess
+0.24
strong
-0.02
fragile
-0.08
weak
+0.10
mixed
Mean reversion
improves when ranges hold and liquidity is balanced
-0.06
weak
+0.21
strong
+0.14
strong
-0.02
fragile
Basis compression
strongest when cross-venue stress starts to normalize
+0.05
mixed
+0.08
mixed
-0.01
fragile
+0.19
strong
Regime IC
negativestrong

Core metric definitions

These are the core metrics the customer should be able to understand without needing an internal research runbook.

OOS IC

OOS IC means the average out-of-sample information coefficient across validation folds.

In the current research pipeline, the surfaced oos_mean_ic is the average out-of-sample linear correlation between the feature value and the future return:

IC = corr(feature_t, return_t+h)

The research engine also tracks rank IC internally, but the default customer-facing card currently leads with OOS IC.

In practical terms, this asks a simple question: when the feature is high or low, does the future return tend to move with it in a consistent direction?

IC IR

IC IR means the mean fold IC divided by the standard deviation of fold IC:

IC IR = mean(IC across folds) / stddev(IC across folds)

This is not a return metric. It is a stability metric. A higher IC IR means the feature's forward-associated quality is less erratic across validation folds.

Fold stability example
Stable feature
positive IC across most folds
F1
F2
F3
F4
F5
F6
Mean IC
+0.15
IC IR
2.6
Fragile feature
headline excess, weak fold consistency
F1
F2
F3
F4
F5
F6
Mean IC
+0.08
IC IR
0.6

Positive folds

Positive folds is the share of validation folds where out-of-sample IC remained positive.

Positive Fold Share = (# folds where IC > 0) / (# total folds)

This matters because one strong period is not enough. A useful feature should survive more than one split.

Decay status

Decay status answers whether recent live or rolling observations are weakening relative to the stored research baseline.

In the broader research stack, decay monitoring can look at:

  • rolling IC
  • rolling Sharpe
  • turnover drift
  • shortfall drift
  • live-vs-backtest IC gap
  • regime-specific IC weakness

The workspace should summarize the result, not dump the raw operator policy thresholds by default.

Advanced evidence layer

More advanced users may want deeper evidence, but it needs context. Advanced metrics should never appear as raw brag numbers without explanation.

Examples of advanced evidence:

  • Sample size
  • Holdout or out-of-sample markers
  • Positive fold share
  • Turnover and cost notes
  • Rank IC and rank-based evidence
  • Adjusted p-value and screening-survival context

Multiple testing and adjusted p-values

Feature screening is where multiple-comparison bias starts to matter.

The current screening path already uses a public, code-backed correction choice:

  • correction method: bonferroni_ic_pvalue.v1
  • significance rule: adjusted_p_value <= 0.05
  • directional requirement: directional_oos_mean_ic > 0

At a high level, the correction behaves like:

adjusted_p_value = min(1, raw_p_value * number_of_tests)

In plain language: a candidate should not survive only because many features were tested and one happened to look good by chance.

The docs should explain this methodology. The default customer card does not need to lead with adjusted_p_value.

What not to do

Do not lead with a single raw metric as if it is self-explanatory. Raw IC, isolated fold statistics, or isolated fitted weights without context are more likely to confuse than to build trust.

What the customer should take away

The customer should be able to answer:

  1. How much evidence exists
  2. How recent that evidence is
  3. What the main uncertainty is
  4. Whether the next safer action is another backtest, paper observation, or a live attempt with warning