SQL in 2019
NB: This is an adapted version of part of the documentation for sqlfluff. It has been adapted to stand alone, but does still refer to the tool.
SQL has been around for a long time, as a language for communicating with databases, as a communication protocol. More recently with the rise of data as a business function, or a domain in it's own right SQL has also become an invaluable tool for defining the structure of data and analysis - not just as a one off but as a form of infrastructure as code.
As analytics transitions from a profession of people doing one-offs,and moves to building stable and reusable pieces of analytics, more and more principles from software engineering are moving in the analytics space. One of the best articulations of this is written in the viewpoint section of the docs for the open-source tool dbt.
Two of the principles mentioned in that article are quality assurance and modularity.
Quality assurance
The primary aim of sqlfluff as a project is in service of that first aim of quality assurance. With larger and larger teams maintaining large bodies of SQL code, it becomes more and more important that the code is not just valid but also easily comprehensible by other users of the same codebase. One way to ensure readability is to enforce a consistent style, and the tools used to do this are called linters.
Some linters which are well known in the software community are flake8 and jslint
Sqlfluff aims to fill this space for SQL.
Modularity
SQL itself doesn't lend itself well to modularity, so to introduce some flexibility and re-usability it is often templated. Typically this is done in the wild in one of the following ways:
- Using the limited inbuilt templating abilities of a programming language directly. For example in python this would be using the format string syntax:
Which would evaluate to"SELECT {foo} FROM {tbl}".format(foo="bar", tbl="mytable")
SELECT bar FROM mytable
- Using a dedicated templating library such as jinja2. This allows a lot more flexibility, more powerful expressions and macros. Often tools like dbt or apache airflow which allow templated sql to be used directly, will utilise a library like jinja2 under the hood, rather than using something bespoke to that tool.
All of these templating tools are great for modularity but they also mean that the SQL files themselves are no longer valid SQL code, because they now contain these configured *placeholder* values, intended to improve modularity.
Sqlfluff allows limited templating using both of the methods outlined above, to allow you to still lint those SQL files as part of your CI/CD pipeline (which is great 🙌), rather than waiting until you're in production(which is bad 🤦, and maybe too late).
To learn more about how sqlfluff uses templating, check out the documentation on readthedocs.