Data client stack

As said in other documents, the data is downloaded and saved by DataPull with minimal or no transformation. Once the data is downloaded, it needs to be retrieved for processing in a common format (e.g., DataPull format).

We use a two-layer approach to handle the complexity of reading and serving the data to clients.

flowchart
  Vendor Data --> ImClient --> MarketData --> User

ImClient
Is data vendor and dataset specific
Adapt data from the vendor data to a standard internal MarketData format
Handle all the peculiarities in format and semantic of a specific vendor data
All timestamps are UTC
Asset ids are handled as strings
MarketData
Is independent of the data vendor
Implement behaviors that are orthogonal to vendors, such as:
- Streaming/real-time or batch/historical
- Time-stitching of streaming/batch data, i.e., merge multiple data sources giving a single and homogeneous view of the data
- E.g., the data from the last day comes from a real-time source while the data before that can come from an historical source. The data served by MarketData is a continuous snapshot of the data
- Replaying, i.e., serialize the data to disk and read it back, implementing as-of-time semantic based on knowledge time
- This behavior is orthogonal to streaming/batch and stitching, i.e., one can replay any MarketData, including an already replayed one
Data is accessed based on intervals [start_timestamp, end_timestamp] using different open/close semantics, but always preventing future peeking
Support real-time behaviors, such as knowledge time, wall clock time, and blocking behaviors (e.g., "is the last data available?")
Handle desired timezone for timestamps
Asset ids are handled as ints

Interfaces

Both ImClient and MarketData have an output format that is enforced by the base abstract class and the derived classes together
ImClient and MarketData have 3 interfaces each:
An external "input" format for a class
Format of the data as input to a class derived from MarketData/ImClient
An internal "input" format
It's the format that derived classes need to adhere so that the base class can do its job, i.e., apply common transformations to all classes
An external "output" format
It's the MarketData/ImClient format, which is fixed

Transformations

The chain of transformations of the data from Vendor to User are as follow:

mermaid flowchart Vendor --> DerivedImClient --> AbstractImClient --> DerivedMarketData --> AbstractMarketData --> User

Classes derived from ImClient
The transformations are vendor-specific
Only derived classes ImClient know what is exact semantic of the vendor-data
Whatever is needed to transform the vendor data into the internal format accepted by base ImClient
Abstract class ImClient
The transformations are fixed
Implemented by ImClient._apply_im_normalization()
Class derived from MarketData
The transformations are specific to the MarketData derived class
MarketData
The transformations are fixed

Output format of `ImClient`

The data in output of a class derived from ImClient is normalized so that:
The index:
Represents the knowledge time
Is the end of the sampling interval
Is called timestamp
Is a tz-aware timestamp in UTC
The data:
(optional) Is re-sampled on a 1 minute grid and filled with NaN values
Is sorted by index and full_symbol
Is guaranteed to have no duplicates
Belongs to intervals like [a, b]
Has a full_symbol column with a string representing the canonical name of the instrument
An example of data in output from an ImClient is: full_symbol close volume timestamp 2021-07-26 13:42:00+00:00 binance:BTC_USDT 47063.51 29.403690 2021-07-26 13:43:00+00:00 binance:BTC_USDT 46946.30 58.246946 2021-07-26 13:44:00+00:00 binance:BTC_USDT 46895.39 81.264098
TODO(gp): We are planning to use an ImClient data format closer to MarketData by using start_time, end_time, and knowledge_time since these can be inferred only from the vendor data semantic

Transformations by classes derived from `MarketData`

Classes derived from MarketData do whatever they need to do in _get_data() to get the data, but always pass back data that:
Is indexed with a progressive index
Has asset, start_time, end_time, knowledge_time
start_time, end_time, knowledge_time are timezone aware
E.g., asset_id start_time end_time close volume idx 0 17085 2021-07-26 13:41:00+00:00 2021-07-26 13:42:00+00:00 148.8600 400176 1 17085 2021-07-26 13:30:00+00:00 2021-07-26 13:31:00+00:00 148.5300 1407725 2 17085 2021-07-26 13:31:00+00:00 2021-07-26 13:32:00+00:00 148.0999 473869

Transformations by abstract class `MarketData`

The transformations are done inside get_data_for_interval(), during normalization, and are:
Indexing by end_time
Converting end_time, start_time, knowledge_time to the desired timezone
Sorting by end_time and asset_id
Applying column remaps

Output format of `MarketData`

The abstract base class MarketData normalizes the data by:
Sorting by the columns that correspond to end_time and asset_id
Indexing by the column that corresponds to end_time, so that it is suitable to DataFlow computation
E.g., asset_id start_time close volume end_time 2021-07-20 09:31:00-04:00 17085 2021-07-20 09:30:00-04:00 143.990 1524506 2021-07-20 09:32:00-04:00 17085 2021-07-20 09:31:00-04:00 143.310 586654 2021-07-20 09:33:00-04:00 17085 2021-07-20 09:32:00-04:00 143.535 667639

Asset ids format

`ImClient` asset ids

ImClient uses assets encoded as full_symbols strings
E.g., binance::BTC_UTC
There is a vendor-specific mapping:
From full_symbols to corresponding data
From asset_ids (ints) to full_symbols (strings)
If the asset_ids -> full_symbols mapping is provided by the vendor, then we reuse it
Otherwise, we build a mapping hashing full_symbols strings into numbers

`MarketData` asset ids

MarketData and everything downstream uses asset_ids that are encoded as ints
This is because we want to use ints and not strings in dataframe

Handling of `asset_ids`

Different implementations of ImClient backing a MarketData are possible, e.g.:
The caller needs to specify the requested asset_ids
In this case the universe is provided by MarketData when calling the data access methods
The reading backend is initialized with the desired universe of assets and then MarketData just uses or subsets that universe
For these reasons, assets are selected at 3 different points:
MarketData allows to specify or subset the assets through asset_ids through the constructor
ImClient backends specify the assets returned
E.g., a concrete implementation backed by a DB can stream the data for its entire available universe
Certain class methods allow querying data for a specific asset or subset of assets
For each stage, a value of None means no filtering

Data

Handling of filtering by time

Clients of MarketData might want to query data by:
Using different interval types, namely [a, b), [a, b], (a, b], (a, b)
Filtering on either the start_ts or end_ts
For this reason, this class supports all these different ways of providing data
ImClient has a fixed semantic of the interval \[a, b\]
MarketData adapts the fixed semantic to multiple ones

Handling timezone

ImClient always uses UTC as output
MarketData adapts UTC to the desired timezone, as requested by the client

Data client stack

Interfaces

Transformations

Output format of ImClient

Transformations by classes derived from MarketData

Transformations by abstract class MarketData

Output format of MarketData