Skip to content

Running dbt CI/CD at scale

Introduction

In this article, I'll explain the use of a CI/CD setup for dbt Core. This setup has been deployed to a medium sized company that manages 1000+ of daily model refresh. In a growing DBT project it is crucial to use a CI/CD system to streamline the testing and the deployment.

While dbt Cloud offers a hosted solution, it has limitations on the flexibility of customisation and the data privacy compares to running your own data infrastructure. At my work, handling customer data to a third party raises eyebrows and the limited customisation forces us to decide to run our own DBT ops. We found with Gitlab CI, a custom CI setup and Airflow, it meets all our requirements across the teams.

DBT CI Process

Continuous integration (CI) in dbt represents an approach to managing and validating your dbt model changes. Instead of waiting for a scheduled release stage, CI ensures that these changes are tested immediately upon availability. This process initiates automatically on a central server as soon as a git branch is pushed, setting off a comprehensive testing sequence to identify any potential code bugs. Once all tests pass successfully, the changes are  ready for merging and deployment to the production environment. This streamlined workflow not only accelerates the development cycle but also enhances the reliability and efficiency of the production deployment process.

One significant challenge in implementing CI for a dbt project is managing the data-centric nature of dbt tests. As the project grows, the increasing volume of data can progressively slow down the data warehouse, resulting in frequent timeouts within the CI setup. This deceleration not only affecting the efficiency of the testing process but also undermines the overall effectiveness of continuous integration by causing delays and potential disruptions. 

Slim CI

One of the key feature dbt Cloud offers is Slim CI, it is a powerful feature of dbt Cloud designed to enhance the efficiency and speed of CI. With a little wiring it's not difficult to achieve the same with the following steps.

     --state ./<dbt-artifact-path> 

Fully Integrated DBT Workflow

A71b00bf 1bdf 410b 942d 4ac3a56ab54e  02174344e394ebdd658cdd95278e9841c0551f71.png

To put everything together, below is a diagram approximately resembles to what was deployed. It manipulates two set of the different state files between integration and the production environment.

Dev Environment

  1. Dev uses deferred and slim local run to only test the modified models and tests.
  2. The target state file is pulled from the production environment (The Green box in the diagram) Typically, the state file is stored in a object storage service like S3 / GCS.
  3. Production data and the schema is used to run local tests, this ensures that your code is tested against the latest schema and not other environment. The production environment is used because that the code will be deployed to production, it's a test against a real environment. Using staging data/schema will cause potential schema mismatch.

CI Environment

  1. In the CI environment, it replicates the dev setup, also runs deferred and only modified models
  2. To make the CI process robust, it ensures your branch is rebased on the latest master branch. This step guarantees that your branch is compatible not only with the current production environment but also with other branches that have been merged but not yet deployed.
  3. The next step is to create a virtual data warehouse that is dedicated for this branch, the changes to this environment will not affect production or affect other DBT code committer. It can be achieved by doing shallow clone of the table into a new namespace in the data warehouse. In BigQuery, you could just use different dataset.
  4. The other step we have done is it clones the tables that are modified and also incremental to the CI environment. This is to minimise the efforts to rebuild an incremental models from scratch. After the incremental tables have cloned, DBT will detect the existing tables and it'll only run the incremental changes, further increase the efficiency.
  5. After all tests are completed, the CI system publishes a master state file to cloud storage. This file serves as the reference for the production environment, ensuring consistency and reliability.

Production Environment

  1. In the production environment, the process begins by importing the latest master state file, which dictates the specific models and associated DBT code to be executed. This ensures that only the validated changes are applied.

  2. Then it runs the usual CI process, except it is against a new environment, let's call it production staging

  3. Once all DBT runs and tests are successfully completed in the production staging environment, a full shallow clone operation is performed from staging to the production environment. This step guarantees that the latest validated data and configurations are seamlessly transferred into the live production environment. And will not cause half tables updated and then the other half are still being updated.

  4. Once the production tables have been updated, the system will publish a new production state file (referred to as the "Green file") to cloud storage. This Green file becomes the authoritative reference point for subsequent development and CI processes. Going forward, the dev/CI environment will use this Green file as the baseline target to identify and implement any modified changes.
  5. To further enhance operational clarity and transparency, you may consider additional steps such as refreshing dashboards and metrics, or notifying the relevant teams that the data refresh has been completed. These actions ensure that stakeholders are informed and that all downstream processes can proceed smoothly with up-to-date information.

Discussions

With some customisations, as a data engineer, you can establish a robust DBT CI/CD system that rivals commercial offerings. It is fully integrated into our data infrastructure, offering enhanced extensibility while ensuring comprehensive protection of data privacy.

However, like any system, there are inherent limitations and challenges to consider. These may include supporting executing large datasets, managing complex dependencies across projects and maintaining synchronisation between development and production environments. Despite these challenges, the tailored approach allows for greater flexibility and control over the data pipeline, ultimately supporting more efficient and secure data operations.