Enforcing DBT Dependencies in Snowflake
This is a repost of the original blog post on inside.tails.com. If you like dbt, and like dogs and live in the London area, then maybe you'll also like working at tails, and we'd love to hear from you. We're always looking for smart data people to join the team, so head to tails.com/careers and get in touch.
tl;dr If you use dbt and Snowflake, we made a dbt package you can use to make sure that you keep your dependencies tidy and understandable. Check it out here: https://github.com/tailsdotcom/dbt-snowflake-dependency-tests
We use dbt to version control our analytical logic, it's a great tool and has really step changed the way we keep track of how that logic changes over time and enabled an ever growing team to contribute to it. By their own description:
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analysing it. dbt allows anyone comfortable with SQL to own that workflow.
In short it allows us to define all the transformations we use for day to day analytics in one place as SQL files and deploy that to our data warehouse, Snowflake. The logic can then be edited by anyone on the team without having to worry about the underlying implementation or about things like connections, database clients or scheduling.
We have just over 400 models in our dbt deployment, and that means we have started to put in place some automated ways of keeping our repository clean.
Snowflake is a newer tool for us, and it's been a great replacement for Redshift which we used to use. One downside however is that in Redshift we had an easy way of keeping the dependencies of models in check (using the pg_depend
table which it inherits from its postgres past). In Snowflake that has been much harder.
Recently however Snowflake released the GET_OBJECT_REFERENCES
command which looked like it might provide a way of doing this. The challenge with this is that it has to be called individually for each view that you want to check dependencies for, and there's no way of calling it in bulk. We've been scratching our heads at tails for a way of getting around this problem for a few months, until we realised that the answer was already in front of us...
dbt Magic
dbt uses the Jinja2 templating library under the hood, and allows us to do all kinds of fun with loops and variables. Using this library we can get the name of a given view, use that to provide the arguments to GET_OBJECT_REFERENCES
, use the results in another query which has also been supplied with a list of regular expressions, and use the Snowflake query engine to do all the hard work.
We can use the dbt post-hook
functionality to install this function to be run after every single model in a schema, and once we've got a list of the offending dependencies (the ones which don't match our list of regular expressions), then we can use some of the dbt macros to raise exceptions to alert them to the user:
...
Completed with 1 errors:
Database Error in model my_problem_model (models/base/my_problem_model.sql)
Invalid Dependencides for view. Invalid dependencies: PROBLEM_DB.PROBLEM_SCHEMA. Allowed patterns are: ['foo_db\\..*', 'bar_db\\..*']
compiled SQL at target/compiled/my_project/base/my_problem_model.sql
This gives us back some of the functionality we lost in transitioning over from Redshift to Snowflake and helps us keep our dependencies in line.
So plain sailing then?
One difference in how GET_OBJECT_REFERENCES
is implemented to how pg_depend
was implemented is how far the dependencies go.
- For
pg_depend
the only dependencies listed are the ones directly referenced in the view itself. - For
GET_OBJECT_REFERENCES
, all the dependencies are listed, even ones not directly referenced. This means ifview_a
is defined asselect * from source_table
andview_b
is defined asselect * from view_a
then if we callGET_OBJECT_REFERENCES
onview_b
we will getview_a
andsource_table
even though it's not referenced directly in the definition ofview_b
.
This means that we have to be a bit more clever with how we express our dependency rules. In the past we used to say that our base
schema was the only one that could depend on non-dbt models and that our core
schema could only depend on base
. But now that dependencies are transparent, we will see non-dbt models in the GET_OBJECT_REFERENCES
for our core
schema, if only because core
depends on base
and base
depends on non-dbt models.
Our way around this problem for now, is just to put the rules on the base
schema and say that it cannot depend on the core
schema, rather than mandating the dependency in reverse. It doesn't achieve quite the same end, but it gets us closer and allows us to keep our repository clean with less mental effort.
Got ideas?
If you've got ideas on how we can improve this tool, we're really keen to have contributions and suggestions in the GitHub repo directly. We use this tool every day, and so improvements to it help us and every other team using it. https://github.com/tailsdotcom/dbt-snowflake-dependency-tests