Skip to content

Dataset onboarding checklist

We follow a standard flow when onboarding a new dataset.

Create a GH issue and paste the following checklist in the issue specification whenever a request for a new dataset is made. The structure is pre-formatted as a markdown checklist.

Preparation and exploratory analysis

From docs/datapull/all.dataset_onboarding_checklist.reference.md

  • [ ] Decide on the timeline
  • E.g., is this a high-priority dataset or a nice-to-have?
  • [ ] Decide on the course of action
  • E.g., do we download only historical bulk data and/or also prepare a real-time downloader?
  • [ ] Review existing code
  • Is there any downloader that is similar to the new one, in terms of interface, frequency, etc.?
  • What code already existing can be generalized to accomplish the task at hand?
  • What needs to be implemented from scratch?
  • [ ] Create an exploratory notebook that includes:
  • Description of the data type, if this is the first time downloading a certain data type
  • Example code to obtain a snippet of historical/real-time data
  • If we are interested in historical data, e.g.,
    • How far in the past we need the data to be?
    • How far in the past the data source goes?
  • [ ] Create example code to obtain data in realtime
  • Is there any issue with the realtime data?
    • E.g., throttling, issues with APIs, unreliability
  • [ ] Perform initial QA on the data sample, e.g.,
  • Compute some statistics in terms of missing data, outliers
  • Does real-time and historical data match at first sight in terms of schema and content

Implement historical downloader

  • [ ] Decide what's the name of the data set according to dataset_schema conventions
  • [ ] Implement the code to perform the historical downloader
  • TODO(Juraj): Add a pointer to examples and docs
  • [ ] Test the flow to download a snippet of data locally in the test stage
  • Apply QA to confirm data is being downloaded correctly
  • [ ] Perform a bulk download for historical datasets
  • Manually, i.e., via executing a script, if the history is short or the volume of data is low
  • Via an Airflow DAG if the volume of the data is too large for downloading manually
    • E.g., im_v2/airflow/dags/test.download_bulk_data_fargate_example_guide.py

Automated AKA Scheduled downloader

  • [ ] Setup automatic download of data in pre-production:
  • Since pre-prod runs with code from the master branch (updated twice a day automatically), make sure to merge any PRs related to the dataset onboarding first
  • For historical datasets:
    • To provide a single S3 location to access the entire dataset, move the bulk history from the test bucket to the pre-prod bucket (source and destination path should be identical)
    • Add a daily download Airflow task to get data from a previous day and append it to the existing bulk dataset
  • For real-time datasets:

    • Add a real-time download Airflow task to get data continuously 24/7
  • [ ] For some real-time datasets, an archival flow needs to be added in order not to overwhelm the storage

  • Consult with the team leader if it's needed for a particular dataset
  • Example Airflow DAG is preprod.europe.postgres_data_archival_to_s3.py

  • [ ] Add an entry into the

  • Monster dataset matrix
  • [ ] Once the download is enabled in production, update the Master_raw_data_gallery

Quality Assurance

1. Check for Existing QA DAGs

  • [ ] Verify if there is already a similar QA DAG running.
  • [ ] Check for existing QA DAGs (e.g., bid_ask/OHLCV, Cross QA for OHLCV comparing real-time with historical data).
  • [ ] Action: If the new QA is just a change in the universe or vendor, append a new task to the existing running DAGs. Reference: Link to Relevant Section].

2. Create a New QA DAG (if necessary)

2.1. Create and Test QA Notebook

  • [ ] Develop a notebook to test the QA process.
  • [ ] Test over a small period to ensure it functions as expected.
  • [ ] Tip: Use a small dataset or limited time frame for quick testing.

2.2. Run QA Notebook via Invoke Command

  • [ ] Execute the QA notebook using the invoke command to validate functionality.
  • [ ] Example: Invoke Command Example

2.3. Create a New DAG File

  • [ ] Create a new DAG file after QA process validation.
  • [ ] Follow the standard procedure for DAG creation. Reference: DAG Creation Tutorial.

Last review: GP on 2024-04-20