What is dbt (data build tool) and when should you use it?
Khuyen Tran
Data Engineering
TLDR
Discover the Hidden Benefits and Drawbacks of dbt.
Motivation
If your organization is looking to create a data-driven product, you should consider having efficient data pipelines to:
Stay competitive:
Quick access to data and its analysis through efficient data pipelines accelerates decision-making and keeps you ahead of your competition.
Reduce costs:
The time and effort required to collect and transform data can be significantly reduced with efficient data pipelines, which can reduce costs and enable employees to focus on higher-level tasks that require human intelligence.
Image by Author
One tool that has gained popularity in recent years for managing data pipelines is dbt (data build tool).
While dbt can provide significant benefits for managing and modeling data, it may not be the best tool for every situation. In this article, we’ll explore the use cases of dbt to determine whether dbt is a good fit for your organization.
Feel free to play and fork the source code of this article here:
https://github.com/khuyentran1401/dbt-demoWhat is dbt?
dbt is an open-source tool for transforming data in a data warehouse. With dbt, users can write SQL queries that transform data and create repeatable workflows that can be easily tested and automated.
Image by Author
When Should You Consider dbt
You should consider using dbt when:
You have a data warehouse:
dbt is an effective tool for organizing, transforming, and testing data in a data warehouse environment.
Your data pipeline is complex:
dbt’s modular approach can help break down complex pipelines into smaller pieces, making it easier to debug, test, and make changes to your pipeline.
Your data changes frequently:
dbt’s snapshot allows you to track changes in data over time.
Transparency is essential to your organization:
dbt can automatically generate documentation for your pipeline. It also provides insights into the data transformation process through data lineage.
Data quality is essential to your organization:
dbt can test your models to help you catch errors early and ensure that your data is accurate.
Before diving into some helpful features of dbt, we need to first set up the dbt environment. This will allow you to easily follow along with the discussion.
Setup dbt
Install dbt
This article uses BigQuery as a data warehouse for dbt, to install dbt’s adapter for BigQuery, type:
1
pip install dbt-bigquery
Create a project
To initialize a dbt project called
dbt_demo
, type the following:
1
dbt init dbt_demo
The following is the project directory.
Image by Author
Modularity
dbt’s modular approach allows you to break down a complex pipeline into smaller pieces.
Create a new SQL file in the
models
directory, named
models/trips_per_property.sql
with the following code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- models/trips_per_property.sql
WITH stations AS (
SELECT
station_id, council_district,
CASE
WHEN property_type IN ('parkland', 'sidewalk', 'nonmetered_parking') THEN 'free_parking'
ELSE property_type
END AS property_type,
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_stations`
WHERE
property_type IN (
'parkland',
'sidewalk',
'nonmetered_parking',
'paid_parking'
)
),
trips AS (
SELECT
start_station_id
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
start_station_id is NOT NULL
)
SELECT
stations.property_type,
COUNT(*) AS trips,
FROM
trips
JOIN stations ON trips.start_station_id = stations.station_id
GROUP BY
stations.property_type
Since this model contains multiple data transformations, it can be difficult to maintain and test.
We could separate the model into different models by doing the following:
Create two files
stg_stations.sql
and
stg_trips.sql
under the
models
directory.
Replace the existing
stations
CTE and
trips
CTE with the following SQL code:
Image by Author
In the
stg_trips.sql
file, write the SQL code from the
trips
CTE in the original model.
1
2
3
4
5
-- models/stg_trips.sql
SELECT start_station_id
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE start_station_id is NOT NULL
In the
stg_stations.sql
file, write the SQL code from the
stations
CTE in the original model.
1
2
3
4
5
6
7
8
9
10
11
12
-- models/stg_stations.sql
SELECT
station_id, council_district,
CASE
WHEN property_type IN ('parkland', 'sidewalk', 'nonmetered_parking') THEN 'free_parking'
ELSE property_type
END AS property_type,
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_stations`
WHERE
property_type IN ('parkland', 'sidewalk', 'nonmetered_parking', 'paid_parking')
By breaking down the model into multiple models, you can now reuse the data models in other models:
Image by Author
Your team members can also work on the different parts of the project simultaneously.
Image by Author
Because each module is self-contained, it can be easily tested and validated.
Image by Author
Code Reusability
Using macros in dbt simplifies the process of reusing SQL code fragments across various models.
To demonstrate, we will create a macro called
get_end_time
in the
macros/get_end_time.sql
file. This micro gets an end time based on a start time and duration.
1
2
3
4
# macros/get_end_time.sql
{% macro get_end_time(start_time, duration_minutes) %}
(SELECT TIMESTAMP_ADD({{ start_time }}, INTERVAL {{ duration_minutes }} MINUTE) AS end_time)
{% endmacro %}
Think of the
get_end_time
macro as a Python function that can be used repeatedly to generate a value based on the inputs provided.
Image by Author
Another model can now use this micro with specific values for
start_time
and
duration_minutes
:
1
2
3
4
5
6
7
8
9
SELECT
start_station_id,
start_time,
end_station_id,
{{ get_end_time('start_time', 'duration_minutes') }} AS end_time
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
start_station_id IS NOT NULL AND end_station_id IS NOT NULL
Testing
With dbt, you can easily verify the correctness of your models. For instance, to ensure the
stg_stations
model is correct, you may want to check that:
The
station_id
column is non-null and has unique values.
The
property_type
column only has the values ‘free_parking’ or ‘paid_parking’.
To test the
stg_stations
model, create a new YAML file in the
models
directory, named
models/schema.yml
with the following content.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
version: 2
models:
- name: stg_stations
columns:
- name: station_id
tests:
# Should not contain null values.
- not_null
# Should be unique
- unique
- name: property_type
tests:
# Should be either 'free_parking' or 'paid_parking'.
- accepted_values:
values: ['free_parking', 'paid_parking']
Then run
dbt test
to run all of the tests.
By catching potential errors early in the transformation process, tests can reduce the risk of errors in downstream applications and analyses.
Image by Author
In addition, tests provide a way to track changes in the data over time and to ensure that the transformation logic remains correct as the underlying data changes.
Image by Author
Documentation
With dbt, you can easily document your models and share the documentation with your team.
To create documentation for your models, follow these steps:
Add descriptions to your
models/schema.yml
file.
Image by Author
Run
dbt docs generate
to generate the documentation for your project.
Run
dbt docs serve
command to launch the documentation.
Access the documentation by navigating to
http://localhost:8080in your web browser.
Image by Author
Documenting your data models helps to ensure that everyone who works with the models can comprehend their functionality and purpose.
You can also examine data lineage in the documentation by clicking “View lineage graph.”
Image by Author
Track changes in data
With dbt, you can track data changes over time through snapshots.
To see why this is useful, imagine you have a
subscriber_type
table where the
subscriber_type
field is frequently overwritten as subscribers modify their membership status.
If a user changes their membership from “trial” to “premium”, the record of when the user had the “trial” membership will be lost.
Image by Author
Losing this record can be problematic if we want to use it for other evaluations such as calculating the average time taken by a user to change their subscription.
dbt can snapshot these changes to help you understand how values in a row change over time. Here’s an example of a snapshot table for the previous example:
Image by Author
When You Should Not Use dbt
While dbt can provide significant benefits for managing and modeling data, you should not use dbt when:
You don’t have a data warehouse:
dbt is designed to work specifically with data warehouses, such as Snowflake, Redshift, or BigQuery. If you don’t have a data warehouse, then dbt may not be the right tool for you.
You want a one-stop shop for ETL:
While dbt is an excellent tool for transforming data, it does not offer a complete end-to-end solution for ETL (Extract, Transform, Load) tasks. Other tools are needed for tasks such as data extraction, data cleansing, and data loading.
You want to visualize your data:
dbt is not a data visualization tool. If you want to create visualizations, you will need to use a separate tool or library such as Tableau, Looker, or PowerBI.
You want advanced feature engineering beyond the limitations of SQL grammar:
As SQL is dbt's primary language, there are some limitations when it comes to feature engineering. If you need to perform advanced feature engineering tasks that go beyond the capabilities of SQL, you may need to use other tools or languages such as Python or R.
Image by Author
Conclusion
In this article, we cover some helpful features of dbt and learn when to and not to use dbt. I hope this article gives you the knowledge needed to decide whether dbt is the right solution for your data pipelines.
Link to original article:
https://towardsdatascience.com/what-is-dbt-data-build-tool-and-when-should-you-use-it-9dc1566d960dStart building for free
No need for a credit card to get started.
Trying out Mage to build ranking models won’t cost a cent.
No need for a credit card to get started. Trying out Mage to build ranking models won’t cost a cent.