Data QA workflows

Quality-assurance metrics

Each data set has QA metrics associated with it to ensure the data has the minimum expected data quality.

E.g., for 1-minute OHLCV data, some possible QA metrics are:
Missing bars for a given timestamp
Missing/nan OHLCV values within an individual bar
Data points with OHLC data and volume = 0
Data points where OHLCV data is not in the correct relationship
- E.g., H and L are not higher or lower than O and C
Outliers data points
- E.g., a data is more than N standard deviations from the running mean

The code for the QA flow is independent of bulk (i.e., historical) and periodic (i.e., real-time) data.

Bulk data single-dataset QA metrics

It is possible to run the QA flow to compute the quality of the historical data. This is done typically as a one-off operation right after the historical data is downloaded in bulk. This touches only one dataset, namely the one that was just downloaded.

Periodic QA metrics

Every N minutes of downloading real-time data, the QA flow is run to generate statistics about the quality of the data. In case of low data quality data the system sends a notification.

Cross-datasets QA metrics

There are QA workflows that compare different data sets that are related to each other, e.g.:

Consider the case of downloading the same data (e.g., 1-minute OHLCV for spot BTC_USDT from Binance exchange) from different providers (e.g., Binance directly and a third-party provider).
Consider the case where there is a REST API that allows to get data for a period of data and a websocket that streams the data
Consider the case where one gets an historical dump of the data from a third party provider vs the data from the exchange real-time stream
Consider the case of NASDAQ streaming data vs TAQ data disseminated once the market is close

Historical vs real-time QA flow

Every period $T_{dl,hist}$, a QA flow is run where the real-time data is compared to the historical data to ensure that the historical view of the data matches the real-time one.

This is necessary but not sufficient to guarantee that the bulk historical data can be reliably used as a proxy for the real-time data as-of, in fact this is simply a self-consistency check. We do not have any guarantee that the data source collected correctly historical data.

Data QA workflow naming scheme

A QA workflow has a name that represents its characteristics in the format:

{qa_type}.{dataset_signature}

i.e.,

production_qa.{download_mode}.{downloading_entity}.{action_tag}.{data_format}.{data_type}.{asset_type}.{universe}.{vendor}.{exchange}.{version\[-snapshot\]}.{asset}.{extension}

where:

qa_type: the type of the QA flow, e.g.,
production_qa: perform a QA flow on historical and real-time data. The interface should be an IM client, which makes it possible to run QA on both historical and real-time data
research_analysis: perform a free-form analysis of the data. This can then be the basis for a qa analysis
compare_historical_real_time: compare historical and real-time data coming from the same source of data
compare_historical_cross_comparison: compare historical data from two different data sources The same rules apply as in downloader and derived dataset for the naming scheme.

research_cross_comparison.periodic.airflow.downloaded_1sec_1min.all.bid_ask.futures.all.ccxt_cryptochassis.all.v1_0_0

Since cross-comparison involves two (or more dataset) we use a short notation merging the attributes that differ.

E.g., a comparison between the datasets

periodic.1minute.postgres.ohlcv.futures.1minute.ccxt.binance
periodic.1day.postgres.ohlcv.futures.1minute.ccxt.binance

is called:

compare_qa.periodic.1minute-1day.postgres.ohlcv.futures.1minute.ccxt.binance

since the only difference is in the frequency of the data sampling.

It is possible to use a long format {dataset_signature1}-vs-{dataset_signature2}.

E.g.,

| Name                  | Dataset Signature                | Description                                                          | Frequency                 | Dashboard | Data Location | Active? |
| --------------------- | -------------------------------- | -------------------------------------------------------------------- | ------------------------- | --------- | ------------- | ------- |
| hist_dl1              | Historical download              | - All of the past day data<br>- Once a day at 0:00:00 UTC            | -                         | s3://...  | Yes           |
| rt_dl1                | Real-time download               | - Every minute                                                       | -                         | s3://...  | Yes           |
| rt_dl1.qa1            | Real-time QA check               | Check QA metrics for dl1                                             | Every 5 minutes           | -         | s3://...      | Yes     |
| hist_dl1.rt_dl1.check | Check of historical vs real-time | Check consistency between historical and real-time CCXT binance data | Once a day at 0:15:00 UTC | -         | -             | -       |
| rt_dl2                | Real-time download               | - vendor=CryptoChassis<br>- exchange=Binance<br>- data type=bid/ask  | Every minute              | -         | s3://...      | Yes     |
| rt_dl2.qa2            | Real-time QA check               | Check QA metrics for dl3                                             | Every 5 minutes           | -         | s3://...      | Yes     |
| rt_dl1_dl2.check      | Cross-data QA check              | Compare data from rt_dl1 and rt_dl2                                  | Every 5 minutes           | -         | -             | -       |