This is a repost of the original blog post on inside.tails.com. If you like dbt, and like dogs and live in the London area, then maybe you'll also like working at tails, and we'd love to hear from you. We're always looking for smart data people to join the team, so head to tails.com/careers and get in touch.
tl;dr If you use dbt and Snowflake, we made a dbt package you can use to make sure that you keep your dependencies tidy and understandable. Check it out here: https://github.com/tailsdotcom/dbt-snowflake-dependency-tests
We use dbt to version control our analytical logic, it's a great tool and has really step changed the way we keep track of how that logic changes over time and enabled an ever growing team to contribute to it. By their own description:
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analysing it. dbt allows anyone comfortable with SQL to own that workflow.
In short it allows us to define all the transformations we use for day to day analytics in one place as SQL files and deploy that to our data warehouse, Snowflake. The logic can then be edited by anyone on the team without having to worry about the underlying implementation or about things like connections, database clients or scheduling.
We have just over 400 models in our dbt deployment, and that means we have started to put in place some automated ways of keeping our repository clean.
Snowflake is a newer tool for us, and it's been a great replacement for Redshift which we used to use. One downside however is that in Redshift we had an easy way of keeping the dependencies of models in check (using the
pg_depend table which it inherits from its postgres past). In Snowflake that has been much harder.
Recently however Snowflake released the
GET_OBJECT_REFERENCES command which looked like it might provide a way of doing this. The challenge with this is that it has to be called individually for each view that you want to check dependencies for, and there's no way of calling it in bulk. We've been scratching our heads at tails for a way of getting around this problem for a few months, until we realised that the answer was already in front of us...
dbt uses the Jinja2 templating library under the hood, and allows us to do all kinds of fun with loops and variables. Using this library we can get the name of a given view, use that to provide the arguments to
GET_OBJECT_REFERENCES , use the results in another query which has also been supplied with a list of regular expressions, and use the Snowflake query engine to do all the hard work.
We can use the dbt
post-hook functionality to install this function to be run after every single model in a schema, and once we've got a list of the offending dependencies (the ones which don't match our list of regular expressions), then we can use some of the dbt macros to raise exceptions to alert them to the user:
... Completed with 1 errors: Database Error in model my_problem_model (models/base/my_problem_model.sql) Invalid Dependencides for view. Invalid dependencies: PROBLEM_DB.PROBLEM_SCHEMA. Allowed patterns are: ['foo_db\\..*', 'bar_db\\..*'] compiled SQL at target/compiled/my_project/base/my_problem_model.sql
This gives us back some of the functionality we lost in transitioning over from Redshift to Snowflake and helps us keep our dependencies in line.
So plain sailing then?
One difference in how
GET_OBJECT_REFERENCES is implemented to how
pg_depend was implemented is how far the dependencies go.
pg_dependthe only dependencies listed are the ones directly referenced in the view itself.
GET_OBJECT_REFERENCES, all the dependencies are listed, even ones not directly referenced. This means if
view_ais defined as
select * from source_tableand
view_bis defined as
select * from view_athen if we call
view_bwe will get
source_tableeven though it's not referenced directly in the definition of
This means that we have to be a bit more clever with how we express our dependency rules. In the past we used to say that our
base schema was the only one that could depend on non-dbt models and that our
core schema could only depend on
base. But now that dependencies are transparent, we will see non-dbt models in the
GET_OBJECT_REFERENCES for our
core schema, if only because
core depends on
base depends on non-dbt models.
Our way around this problem for now, is just to put the rules on the
base schema and say that it cannot depend on the
core schema, rather than mandating the dependency in reverse. It doesn't achieve quite the same end, but it gets us closer and allows us to keep our repository clean with less mental effort.
If you've got ideas on how we can improve this tool, we're really keen to have contributions and suggestions in the GitHub repo directly. We use this tool every day, and so improvements to it help us and every other team using it. https://github.com/tailsdotcom/dbt-snowflake-dependency-tests