Data Provenance And Hygiene

See what kinds of market data feed the research system, how they are framed, and what the public trust claim should and should not be.

Reliable research starts before the first metric is calculated.

If the underlying market data is vague, stale, or treated carelessly, no amount of analysis or backtesting can repair the trust problem later.

The public trust claim

The customer-facing claim should be conservative:

research is built on explicit market data families rather than opaque "black-box AI outputs"
candidate evidence is tied to identifiable inputs and timestamps
point-in-time discipline matters more than headline performance
freshness and warning posture are part of the trust framework

That is enough to build credibility without pretending every operational detail needs to be public.

Data families the research stack works with

The research system works with concrete market inputs such as:

Family	What it represents
Trades	Executed trade flow
Best bid / offer	Top-of-book liquidity and spread
Order book depth (L2)	Multi-level depth, imbalance, and microstructure pressure
Funding	Perpetual funding rates
Open interest	Market-wide open positioning
Mark price	Exchange mark price
Index price	Cross-venue reference price
Liquidations	Forced liquidation events
Oracle price	External reference feeds

These are useful because they map to real market mechanisms. The product should never imply that the feature engine is creating value from unexplained magic inputs.

That includes book-aware data. Statly research does not stop at top-of-book alone. It also works with L2 / order book depth, which makes it possible to reason about:

depth imbalance
spread quality versus available liquidity
order book pressure
execution sensitivity in thinner conditions

Vendor-backed and native collection

Some lanes are sourced from external vendors. Others are collected natively.

The important customer-facing point is not which vendor is used for every lane. It is that provenance is tracked at all:

data families are explicit
research runs are associated with manifests and versioned inputs
candidate posture can reflect stale or incomplete evidence instead of hiding it

That same provenance framework applies to both headline-friendly data families such as trades and funding, and more microstructure-heavy inputs such as L2 depth.

The product should teach that provenance is a first-class concern. It does not need to publish municipality-level or vendor-contract-level operations to do that.

How data should be treated before it becomes evidence

Market data should not go directly from ingestion to customer-facing evidence.

There are at least four questions the system must answer first:

Was this data available at the time?
Is it complete enough to support the claim being made?
Does it look structurally plausible?
Is it recent enough to deserve current trust?

That is why evidence posture should be downstream of data hygiene, not independent from it.

Point-in-time discipline

The research section should keep repeating one simple rule:

evidence_t must use only information available at time t

This is the foundation for avoiding look-ahead bias.

A feature should only be judged on what could have been known when the decision would actually have been made. Anything else turns research into hindsight storytelling.

Missing data and stale data

The public docs should be explicit that missing data and stale data are not interchangeable.

Missing data means the input stream is incomplete for the period or lane being used.
Stale data means the evidence may once have been valid, but its recency has degraded relative to current market conditions.

Those should not be hidden behind a single generic badge. They imply different risks:

missing data threatens the integrity of the original claim
stale data threatens how much current trust the claim deserves

Outliers and structural plausibility

Customers do not need an exchange-by-exchange outlier rulebook. They do need confidence that the system distinguishes between:

real market stress
broken prints
feed anomalies
partial coverage

The honest public claim is:

the research stack takes structural plausibility seriously
suspicious or degraded inputs should reduce confidence, not silently flow through as if they were clean
the same hygiene standard applies whether the input is a trade print, a funding observation, or a depth snapshot

What the docs should not do yet is publish unsupported lane-by-lane policy detail if that policy is not ready to defend publicly.

Freshness is part of the evidence

Freshness is not cosmetic metadata.

It is part of whether a customer should trust a candidate today.

That is why the workspace should continue to make room for:

recency markers
decay warnings
validation age
safer suggested next steps

In practical terms, the product should communicate something like:

strong historical evidence - stale current confidence = caution, not certainty

What should remain internal for now

There is a healthy boundary between methodological transparency and operator runbooks.

It is reasonable to keep these out of the public docs for now:

lane-by-lane outlier thresholds
feed-specific incident handling
vendor-specific operational playbooks
exact internal governance thresholds

Those belong in internal admin and runtime documentation unless and until they are ready for public scrutiny.

What the customer should take away

The customer should leave this page believing four things:

Statly research is grounded in real, identifiable market data families.
Provenance and recency are part of the trust framework.
The system tries to protect against hindsight and dirty inputs.
The stack includes both top-of-book and L2 depth-aware market structure, not just headline price series.
Public methodology is open where it builds trust, and bounded where raw operator detail would be noise or over-claim.

Data Provenance And Hygiene

On this page