Metrics And Evidence
Review the evidence layer customers should see by default and what belongs behind a more advanced layer.
Evidence should build trust, not turn raw research metrics into unexplained marketing headlines.
Default evidence layer
The default customer-facing layer answers:
| Question | What to show |
|---|---|
| How strong is the evidence? | Evidence strength indicator |
| When was it last validated? | Recency markers |
| What are the key risks? | Clear warning language |
| What should I do next? | Backtest, paper, or live suggested next step |
Core metric definitions
These are the core metrics the customer should be able to understand without needing an internal research runbook.
OOS IC
OOS IC means the average out-of-sample information coefficient across validation folds.
In the current research pipeline, the surfaced oos_mean_ic is the average out-of-sample linear correlation between the feature value and the future return:
IC = corr(feature_t, return_t+h)The research engine also tracks rank IC internally, but the default customer-facing card currently leads with OOS IC.
In practical terms, this asks a simple question: when the feature is high or low, does the future return tend to move with it in a consistent direction?
IC IR
IC IR means the mean fold IC divided by the standard deviation of fold IC:
IC IR = mean(IC across folds) / stddev(IC across folds)This is not a return metric. It is a stability metric. A higher IC IR means the feature's forward-associated quality is less erratic across validation folds.
Positive folds
Positive folds is the share of validation folds where out-of-sample IC remained positive.
Positive Fold Share = (# folds where IC > 0) / (# total folds)This matters because one strong period is not enough. A useful feature should survive more than one split.
Decay status
Decay status answers whether recent live or rolling observations are weakening relative to the stored research baseline.
In the broader research stack, decay monitoring can look at:
- rolling IC
- rolling Sharpe
- turnover drift
- shortfall drift
- live-vs-backtest IC gap
- regime-specific IC weakness
The workspace should summarize the result, not dump the raw operator policy thresholds by default.
Advanced evidence layer
More advanced users may want deeper evidence, but it needs context. Advanced metrics should never appear as raw brag numbers without explanation.
Examples of advanced evidence:
- Sample size
- Holdout or out-of-sample markers
- Positive fold share
- Turnover and cost notes
- Rank IC and rank-based evidence
- Adjusted p-value and screening-survival context
Multiple testing and adjusted p-values
Feature screening is where multiple-comparison bias starts to matter.
The current screening path already uses a public, code-backed correction choice:
- correction method:
bonferroni_ic_pvalue.v1 - significance rule:
adjusted_p_value <= 0.05 - directional requirement:
directional_oos_mean_ic > 0
At a high level, the correction behaves like:
adjusted_p_value = min(1, raw_p_value * number_of_tests)In plain language: a candidate should not survive only because many features were tested and one happened to look good by chance.
The docs should explain this methodology. The default customer card does not need to lead with adjusted_p_value.
What not to do
Do not lead with a single raw metric as if it is self-explanatory. Raw IC, isolated fold statistics, or isolated fitted weights without context are more likely to confuse than to build trust.
What the customer should take away
The customer should be able to answer:
- How much evidence exists
- How recent that evidence is
- What the main uncertainty is
- Whether the next safer action is another backtest, paper observation, or a live attempt with warning