Metrics And Evidence

Review the evidence layer customers should see by default and what belongs behind a more advanced layer.

Evidence should build trust, not turn raw research metrics into unexplained marketing headlines.

Default evidence layer

The default customer-facing layer answers:

Question	What to show
How strong is the evidence?	Evidence strength indicator
When was it last validated?	Recency markers
What are the key risks?	Clear warning language
What should I do next?	Backtest, paper, or live suggested next step

Regime evidence heatmap

Trend

persistent directional flow

Range

two-sided rotation

Vol crush

compressed realized vol

Funding stress

carry dislocation and unwind

Funding carry

best when funding dislocation stays persistent

+0.18

strong

+0.09

mixed

-0.05

weak

+0.24

strong

Breakout continuation

needs directional follow-through to hold excess

+0.24

strong

-0.02

fragile

-0.08

weak

+0.10

mixed

Mean reversion

improves when ranges hold and liquidity is balanced

-0.06

weak

+0.21

strong

+0.14

strong

-0.02

fragile

Basis compression

strongest when cross-venue stress starts to normalize

+0.05

mixed

+0.08

mixed

-0.01

fragile

+0.19

strong

Regime IC

negative→strong

Core metric definitions

These are the core metrics the customer should be able to understand without needing an internal research runbook.

OOS IC

OOS IC means the average out-of-sample information coefficient across validation folds.

In the current research pipeline, the surfaced oos_mean_ic is the average out-of-sample linear correlation between the feature value and the future return:

IC = corr(feature_t, return_t+h)

The research engine also tracks rank IC internally, but the default customer-facing card currently leads with OOS IC.

In practical terms, this asks a simple question: when the feature is high or low, does the future return tend to move with it in a consistent direction?

IC IR

IC IR means the mean fold IC divided by the standard deviation of fold IC:

IC IR = mean(IC across folds) / stddev(IC across folds)

This is not a return metric. It is a stability metric. A higher IC IR means the feature's forward-associated quality is less erratic across validation folds.

Fold stability example

Stable feature

positive IC across most folds

Mean IC

+0.15

IC IR

2.6

Fragile feature

headline excess, weak fold consistency

Mean IC

+0.08

IC IR

0.6

Positive folds

Positive folds is the share of validation folds where out-of-sample IC remained positive.

Positive Fold Share = (# folds where IC > 0) / (# total folds)

This matters because one strong period is not enough. A useful feature should survive more than one split.

Decay status

Decay status answers whether recent live or rolling observations are weakening relative to the stored research baseline.

In the broader research stack, decay monitoring can look at:

rolling IC
rolling Sharpe
turnover drift
shortfall drift
live-vs-backtest IC gap
regime-specific IC weakness

The workspace should summarize the result, not dump the raw operator policy thresholds by default.

Advanced evidence layer

More advanced users may want deeper evidence, but it needs context. Advanced metrics should never appear as raw brag numbers without explanation.

Examples of advanced evidence:

Sample size
Holdout or out-of-sample markers
Positive fold share
Turnover and cost notes
Rank IC and rank-based evidence
Adjusted p-value and screening-survival context

Multiple testing and adjusted p-values

Feature screening is where multiple-comparison bias starts to matter.

The current screening path already uses a public, code-backed correction choice:

correction method: bonferroni_ic_pvalue.v1
significance rule: adjusted_p_value <= 0.05
directional requirement: directional_oos_mean_ic > 0

At a high level, the correction behaves like:

adjusted_p_value = min(1, raw_p_value * number_of_tests)

In plain language: a candidate should not survive only because many features were tested and one happened to look good by chance.

The docs should explain this methodology. The default customer card does not need to lead with adjusted_p_value.

What not to do

Do not lead with a single raw metric as if it is self-explanatory. Raw IC, isolated fold statistics, or isolated fitted weights without context are more likely to confuse than to build trust.

What the customer should take away

The customer should be able to answer:

How much evidence exists
How recent that evidence is
What the main uncertainty is
Whether the next safer action is another backtest, paper observation, or a live attempt with warning

Metrics And Evidence

On this page