Data Provenance And Hygiene
See what kinds of market data feed the research system, how they are framed, and what the public trust claim should and should not be.
Reliable research starts before the first metric is calculated.
If the underlying market data is vague, stale, or treated carelessly, no amount of analysis or backtesting can repair the trust problem later.
The public trust claim
The customer-facing claim should be conservative:
- research is built on explicit market data families rather than opaque "black-box AI outputs"
- candidate evidence is tied to identifiable inputs and timestamps
- point-in-time discipline matters more than headline performance
- freshness and warning posture are part of the trust framework
That is enough to build credibility without pretending every operational detail needs to be public.
Data families the research stack works with
The research system works with concrete market inputs such as:
| Family | What it represents |
|---|---|
| Trades | Executed trade flow |
| Best bid / offer | Top-of-book liquidity and spread |
| Order book depth (L2) | Multi-level depth, imbalance, and microstructure pressure |
| Funding | Perpetual funding rates |
| Open interest | Market-wide open positioning |
| Mark price | Exchange mark price |
| Index price | Cross-venue reference price |
| Liquidations | Forced liquidation events |
| Oracle price | External reference feeds |
These are useful because they map to real market mechanisms. The product should never imply that the feature engine is creating value from unexplained magic inputs.
That includes book-aware data. Statly research does not stop at top-of-book alone. It also works with L2 / order book depth, which makes it possible to reason about:
- depth imbalance
- spread quality versus available liquidity
- order book pressure
- execution sensitivity in thinner conditions
Vendor-backed and native collection
Some lanes are sourced from external vendors. Others are collected natively.
The important customer-facing point is not which vendor is used for every lane. It is that provenance is tracked at all:
- data families are explicit
- research runs are associated with manifests and versioned inputs
- candidate posture can reflect stale or incomplete evidence instead of hiding it
That same provenance framework applies to both headline-friendly data families such as trades and funding, and more microstructure-heavy inputs such as L2 depth.
The product should teach that provenance is a first-class concern. It does not need to publish municipality-level or vendor-contract-level operations to do that.
How data should be treated before it becomes evidence
Market data should not go directly from ingestion to customer-facing evidence.
There are at least four questions the system must answer first:
- Was this data available at the time?
- Is it complete enough to support the claim being made?
- Does it look structurally plausible?
- Is it recent enough to deserve current trust?
That is why evidence posture should be downstream of data hygiene, not independent from it.
Point-in-time discipline
The research section should keep repeating one simple rule:
evidence_t must use only information available at time tThis is the foundation for avoiding look-ahead bias.
A feature should only be judged on what could have been known when the decision would actually have been made. Anything else turns research into hindsight storytelling.
Missing data and stale data
The public docs should be explicit that missing data and stale data are not interchangeable.
- Missing data means the input stream is incomplete for the period or lane being used.
- Stale data means the evidence may once have been valid, but its recency has degraded relative to current market conditions.
Those should not be hidden behind a single generic badge. They imply different risks:
- missing data threatens the integrity of the original claim
- stale data threatens how much current trust the claim deserves
Outliers and structural plausibility
Customers do not need an exchange-by-exchange outlier rulebook. They do need confidence that the system distinguishes between:
- real market stress
- broken prints
- feed anomalies
- partial coverage
The honest public claim is:
- the research stack takes structural plausibility seriously
- suspicious or degraded inputs should reduce confidence, not silently flow through as if they were clean
- the same hygiene standard applies whether the input is a trade print, a funding observation, or a depth snapshot
What the docs should not do yet is publish unsupported lane-by-lane policy detail if that policy is not ready to defend publicly.
Freshness is part of the evidence
Freshness is not cosmetic metadata.
It is part of whether a customer should trust a candidate today.
That is why the workspace should continue to make room for:
- recency markers
- decay warnings
- validation age
- safer suggested next steps
In practical terms, the product should communicate something like:
strong historical evidence - stale current confidence = caution, not certaintyWhat should remain internal for now
There is a healthy boundary between methodological transparency and operator runbooks.
It is reasonable to keep these out of the public docs for now:
- lane-by-lane outlier thresholds
- feed-specific incident handling
- vendor-specific operational playbooks
- exact internal governance thresholds
Those belong in internal admin and runtime documentation unless and until they are ready for public scrutiny.
What the customer should take away
The customer should leave this page believing four things:
- Statly research is grounded in real, identifiable market data families.
- Provenance and recency are part of the trust framework.
- The system tries to protect against hindsight and dirty inputs.
- The stack includes both top-of-book and
L2depth-aware market structure, not just headline price series. - Public methodology is open where it builds trust, and bounded where raw operator detail would be noise or over-claim.