Data client stack
As said in other documents, the data is downloaded and saved by DataPull with
minimal or no transformation. Once the data is downloaded, it needs to be
retrieved for processing in a common format (e.g., DataPull format).
We use a two-layer approach to handle the complexity of reading and serving the data to clients.
flowchart
Vendor Data --> ImClient --> MarketData --> User
ImClient- Is data vendor and dataset specific
- Adapt data from the vendor data to a standard internal
MarketDataformat - Handle all the peculiarities in format and semantic of a specific vendor data
- All timestamps are UTC
-
Asset ids are handled as strings
-
MarketData - Is independent of the data vendor
- Implement behaviors that are orthogonal to vendors, such as:
- Streaming/real-time or batch/historical
- Time-stitching of streaming/batch data, i.e., merge multiple data sources giving a single and homogeneous view of the data
- E.g., the data from the last day comes from a real-time source while the
data before that can come from an historical source. The data served by
MarketDatais a continuous snapshot of the data - Replaying, i.e., serialize the data to disk and read it back, implementing as-of-time semantic based on knowledge time
- This behavior is orthogonal to streaming/batch and stitching, i.e., one
can replay any
MarketData, including an already replayed one
- Data is accessed based on intervals
[start_timestamp, end_timestamp]using different open/close semantics, but always preventing future peeking - Support real-time behaviors, such as knowledge time, wall clock time, and blocking behaviors (e.g., "is the last data available?")
- Handle desired timezone for timestamps
- Asset ids are handled as ints
Interfaces
-
Both
ImClientandMarketDatahave an output format that is enforced by the base abstract class and the derived classes together -
ImClientandMarketDatahave 3 interfaces each: -
An external "input" format for a class
-
Format of the data as input to a class derived from
MarketData/ImClient -
An internal "input" format
-
It's the format that derived classes need to adhere so that the base class can do its job, i.e., apply common transformations to all classes
-
An external "output" format
- It's the
MarketData/ImClientformat, which is fixed
Transformations
- The chain of transformations of the data from
VendortoUserare as follow:
mermaid
flowchart
Vendor --> DerivedImClient --> AbstractImClient --> DerivedMarketData --> AbstractMarketData --> User
- Classes derived from
ImClient - The transformations are vendor-specific
- Only derived classes
ImClientknow what is exact semantic of the vendor-data -
Whatever is needed to transform the vendor data into the internal format accepted by base
ImClient -
Abstract class
ImClient - The transformations are fixed
-
Implemented by
ImClient._apply_im_normalization() -
Class derived from
MarketData -
The transformations are specific to the
MarketDataderived class -
MarketData - The transformations are fixed
Output format of ImClient
- The data in output of a class derived from
ImClientis normalized so that: - The index:
- Represents the knowledge time
- Is the end of the sampling interval
- Is called
timestamp -
Is a tz-aware timestamp in UTC
-
The data:
- (optional) Is re-sampled on a 1 minute grid and filled with NaN values
- Is sorted by index and
full_symbol - Is guaranteed to have no duplicates
- Belongs to intervals like
[a, b] -
Has a
full_symbolcolumn with a string representing the canonical name of the instrument -
An example of data in output from an
ImClientis:full_symbol close volume timestamp 2021-07-26 13:42:00+00:00 binance:BTC_USDT 47063.51 29.403690 2021-07-26 13:43:00+00:00 binance:BTC_USDT 46946.30 58.246946 2021-07-26 13:44:00+00:00 binance:BTC_USDT 46895.39 81.264098 -
TODO(gp): We are planning to use an
ImClientdata format closer toMarketDataby usingstart_time,end_time, andknowledge_timesince these can be inferred only from the vendor data semantic
Transformations by classes derived from MarketData
- Classes derived from
MarketDatado whatever they need to do in_get_data()to get the data, but always pass back data that: - Is indexed with a progressive index
- Has
asset,start_time,end_time,knowledge_time -
start_time,end_time,knowledge_timeare timezone aware -
E.g.,
asset_id start_time end_time close volume idx 0 17085 2021-07-26 13:41:00+00:00 2021-07-26 13:42:00+00:00 148.8600 400176 1 17085 2021-07-26 13:30:00+00:00 2021-07-26 13:31:00+00:00 148.5300 1407725 2 17085 2021-07-26 13:31:00+00:00 2021-07-26 13:32:00+00:00 148.0999 473869
Transformations by abstract class MarketData
- The transformations are done inside
get_data_for_interval(), during normalization, and are: - Indexing by
end_time - Converting
end_time,start_time,knowledge_timeto the desired timezone - Sorting by
end_timeandasset_id - Applying column remaps
Output format of MarketData
- The abstract base class
MarketDatanormalizes the data by: - Sorting by the columns that correspond to
end_timeandasset_id -
Indexing by the column that corresponds to
end_time, so that it is suitable to DataFlow computation -
E.g.,
asset_id start_time close volume end_time 2021-07-20 09:31:00-04:00 17085 2021-07-20 09:30:00-04:00 143.990 1524506 2021-07-20 09:32:00-04:00 17085 2021-07-20 09:31:00-04:00 143.310 586654 2021-07-20 09:33:00-04:00 17085 2021-07-20 09:32:00-04:00 143.535 667639
Asset ids format
ImClient asset ids
ImClientuses assets encoded asfull_symbolsstrings- E.g.,
binance::BTC_UTC - There is a vendor-specific mapping:
- From
full_symbolsto corresponding data - From
asset_ids(ints) tofull_symbols(strings) - If the
asset_ids->full_symbolsmapping is provided by the vendor, then we reuse it - Otherwise, we build a mapping hashing
full_symbolsstrings into numbers
MarketData asset ids
MarketDataand everything downstream usesasset_idsthat are encoded as ints- This is because we want to use ints and not strings in dataframe
Handling of asset_ids
-
Different implementations of
ImClientbacking aMarketDataare possible, e.g.: -
The caller needs to specify the requested
asset_ids - In this case the universe is provided by
MarketDatawhen calling the data access methods -
The reading backend is initialized with the desired universe of assets and then
MarketDatajust uses or subsets that universe -
For these reasons, assets are selected at 3 different points:
-
MarketDataallows to specify or subset the assets throughasset_idsthrough the constructor ImClientbackends specify the assets returned-
E.g., a concrete implementation backed by a DB can stream the data for its entire available universe
-
Certain class methods allow querying data for a specific asset or subset of assets
-
For each stage, a value of
Nonemeans no filtering
Data
Handling of filtering by time
- Clients of
MarketDatamight want to query data by: - Using different interval types, namely
[a, b), [a, b], (a, b], (a, b) - Filtering on either the
start_tsorend_ts - For this reason, this class supports all these different ways of providing data
ImClienthas a fixed semantic of the interval\[a, b\]MarketDataadapts the fixed semantic to multiple ones
Handling timezone
ImClientalways uses UTC as outputMarketDataadapts UTC to the desired timezone, as requested by the client