Dataset onboarding checklist
We follow a standard flow when onboarding a new dataset.
Create a GH issue and paste the following checklist in the issue specification whenever a request for a new dataset is made. The structure is pre-formatted as a markdown checklist.
Preparation and exploratory analysis
From docs/datapull/all.dataset_onboarding_checklist.reference.md
- [ ] Decide on the timeline
- E.g., is this a high-priority dataset or a nice-to-have?
- [ ] Decide on the course of action
- E.g., do we download only historical bulk data and/or also prepare a real-time downloader?
- [ ] Review existing code
- Is there any downloader that is similar to the new one, in terms of interface, frequency, etc.?
- What code already existing can be generalized to accomplish the task at hand?
- What needs to be implemented from scratch?
- [ ] Create an exploratory notebook that includes:
- Description of the data type, if this is the first time downloading a certain data type
- Example code to obtain a snippet of historical/real-time data
- If we are interested in historical data, e.g.,
- How far in the past we need the data to be?
- How far in the past the data source goes?
- [ ] Create example code to obtain data in realtime
- Is there any issue with the realtime data?
- E.g., throttling, issues with APIs, unreliability
- [ ] Perform initial QA on the data sample, e.g.,
- Compute some statistics in terms of missing data, outliers
- Does real-time and historical data match at first sight in terms of schema and content
Implement historical downloader
- [ ] Decide what's the name of the data set according to
dataset_schema
conventions - [ ] Implement the code to perform the historical downloader
- TODO(Juraj): Add a pointer to examples and docs
- [ ] Test the flow to download a snippet of data locally in the test stage
- Apply QA to confirm data is being downloaded correctly
- [ ] Perform a bulk download for historical datasets
- Manually, i.e., via executing a script, if the history is short or the volume of data is low
- Via an Airflow DAG if the volume of the data is too large for downloading
manually
- E.g.,
im_v2/airflow/dags/test.download_bulk_data_fargate_example_guide.py
- E.g.,
Automated AKA Scheduled downloader
- [ ] Setup automatic download of data in pre-production:
- Since pre-prod runs with code from the master branch (updated twice a day automatically), make sure to merge any PRs related to the dataset onboarding first
- For historical datasets:
- To provide a single S3 location to access the entire dataset, move the bulk history from the test bucket to the pre-prod bucket (source and destination path should be identical)
- Add a daily download Airflow task to get data from a previous day and append it to the existing bulk dataset
-
For real-time datasets:
- Add a real-time download Airflow task to get data continuously 24/7
-
[ ] For some real-time datasets, an archival flow needs to be added in order not to overwhelm the storage
- Consult with the team leader if it's needed for a particular dataset
-
Example Airflow DAG is preprod.europe.postgres_data_archival_to_s3.py
-
[ ] Add an entry into the
- Monster dataset matrix
- [ ] Once the download is enabled in production, update the Master_raw_data_gallery
Quality Assurance
1. Check for Existing QA DAGs
- [ ] Verify if there is already a similar QA DAG running.
- [ ] Check for existing QA DAGs (e.g., bid_ask/OHLCV, Cross QA for OHLCV comparing real-time with historical data).
- [ ] Action: If the new QA is just a change in the universe or vendor, append a new task to the existing running DAGs. Reference: Link to Relevant Section].
2. Create a New QA DAG (if necessary)
2.1. Create and Test QA Notebook
- [ ] Develop a notebook to test the QA process.
- [ ] Test over a small period to ensure it functions as expected.
- [ ] Tip: Use a small dataset or limited time frame for quick testing.
2.2. Run QA Notebook via Invoke Command
- [ ] Execute the QA notebook using the invoke command to validate functionality.
- [ ] Example: Invoke Command Example
2.3. Create a New DAG File
- [ ] Create a new DAG file after QA process validation.
- [ ] Follow the standard procedure for DAG creation. Reference: DAG Creation Tutorial.
Last review: GP on 2024-04-20