DataFlow
TODO(gp): Add an example of df with forecasts explaining the timing
Different views of System components
Different implementations of a component. A DataFlow component is described in terms of an interface and can have several implementations at different levels of detail.
Reference implementation. A reference implementation is vendor-agnostic implementation of a component (e.g., DataFrameImClient, DataFrameBroker)
Vendor implementation. A vendor implementation is a vendor-specific implementation of a component (e.g., CcxtImClient, CcxtBroker).
Mocked implementation. A mocked implementation is a simulated version of a vendor-specific component (e.g., a DataFrameCcxtBroker). A mocked component can have the same timing semantics as the real-component (e.g., an asynchronous or reactive implementation) or not.
Architecture
In this section we summarize the responsibilities and the high level invariants
of each component of a System
.
A System
is represented in terms of a Config
.
- Each piece of a
Config
refers to and configures a specific part of theSystem
- Each component should be completely configured in terms of a
Config
Component invariants
All data in components should be indexed by the knowledge time (i.e., when the data became available to that component) in terms of current time.
Each component has a way to know:
- What is the current time (e.g., the real-time machine time or the simulated one)
- The timestamp of the current data bar it's working on
Each component
- Should print its state so that one can inspect how exactly it has been initialized
- Can be serialized and deserialized from disk
- Can be mocked for simulating
- Should save data in a directory as it executes to make the system observable
Models are described in terms of DAGs using the DataFlow framework
Misc. Models read data from historical and real-time data sets, typically not mixing these two styles.
Raw data is typically stored in S3 bucket in the same format as it comes or in Parquet format.
DataFlow computing
Resampling VWAP (besides potential errors). This implies hardcoded formula in a mix with resampling functions.
vwap_approach_2 = (
converted_data["close"] *
converted_data["volume"]).resample(resampling_freq)
).mean() /
converted_data["volume"].resample(resampling_freq).sum()
vwap_approach_2.head(3)
- TODO(gp): Explain this piece of code
Template configs
- Are incomplete configs, with some "mandatory" parameters unspecified but
clearly identified with
cconfig.DUMMY
value - Have reasonable defaults for specified parameters
- This facilitates config extension (e.g., if we add additional parameters / flexibility in the future, then we should not have to regenerate old configs)
- Leave dummy parameters for frequently-varying fields, such as
ticker
- Should be completable and be completed before use
- Should be associated with a
Dag
builder
DagBuilder. It is an object that builds a DAG and has a
get_config_template()
and a get_dag()
method to keep the config and the Dag
in sync.
The client:
- Calls
get_config_template()
to receive the template config - Fills / modifies the config
- Uses the final config to call
get_dag(config)
and get a fully built DAG
A DagBuilder
can be passed to other objects instead of Dag
when the template
config is fully specified and thus the Dag
can be constructed from it.
DagRunner. It is an object that allows to run a Dag
. Different
implementations of a DagRunner
allow to run a Dag
on data in different ways,
e.g.,
FitPredictDagRunner
: implements two methodsfit
/predict
when we want to learn on in-sample data and predict on out-of-sample dataRollingFitPredictDagRunner
: allows to fit and predict on some data using a rolling patternIncrementalDagRunner
: allows to run one step at a time like in real-timeRealTimeDagRunner
: allows to run using nodes that have a real-time semantic