SQLFluff @ Coalesce 2020

Transcript

Welcome everyone to my talk today on SQL fluff. I realized there's a lovely theme going on today, around, how we all collaborate on things together. In particular, I'm going to draw on some of the things that John mentioned, earlier in his talk, particularly around catching quality. First, how to help teams be on the front foot about things, and especially on what Caitlin was talking about around responsibility and creativity, about how to help teams work together and feel like it's not someone else's job to do things. That it's, everyone's job, that the quality and creativity is high.

But, before I dive into that, I realized there's a little bit of context I need to cover here.and we're going to zoom out of SQL. We're going to zoom out of dbt and just talk about programming languages for a second. And that the second word here is language. I think we often treat SQL as a way of talking to machines. It's a way of telling the database to do what you wanted to do, but language is used for so much more than just writing. You're expecting someone to read it. If it's a language and in some ways the reader in these cases is not just a machine: it's another human. Now that might be another human on your team who's trying to understand what you did. It might also be future you, and I'm sure all of us have had the experience of coming back to SQL you've written some time ago and kicking yourself because past-you has made your current life a lot harder in making it much harder to understand what you're doing. Languages form as a way of communication and a store of knowledge. And so many people have talked over the last few days about how that their dbt repository has become not just a store of SQL code that we want to reuse in different places. This is a store of knowledge about the definitions and things which underlie all of the work that your team does. Having that in a way that it's legible and readable andeasily understandable feels like a pretty important thing.

Zooming slightly in from the big picture for a second, I'm going to go somewhere that isn't SQL to start with. I want to spend a little bit of time on the cousin that I'm sure all the dbt crowd know a lot about, I'm going to talk a little bit about Python. Here we have two pieces of Python code, which are functionally identical. If you want to go through and actually match the symbols up, both of them are exactly the same, they both execute, they both run, they both work. One of them is really, really hard to read. And the other one is really, really easy to read. The computer does not care which of these you use. However, what we just said about programming languages, the difference between one of these and the other of these is about how the humans are going to react when they come across this. If you have a dbt project, when you've got more than one person working on it and, and their memory isn't perfect, someone else is going to need to read the code and build on it and work out what's going on.

Python has got it pretty easy in this way. Python was developed in the late eighties, but released in 1991, and there's a tool which several people might be aware of called flake8, which was builton top of a whole load of other linting efforts that have gone on. flake8 is a combination of tools that allows you to look at a patch of code and work out how this should be formatted. flake8 would say things like, Oh, you've got too many spaces in this place,  your function isn't in there in the right way, your indentation doesn't match up, you're referring to something that you never use. All of these things are things that the computer doesn't actually care about, but make a difference for the way that you write code and give you a way of measuring whether you're writing this in a way that's easy to work with more recently.

One of the more exciting things that's happened in this world as a tool called Python Black based around Henry Ford's quote "you can have a model T in any color you like, as long as it's black" and black is a lot more radical here in saying that actually you should devolve all of your beliefs about how you should format your code to the machine,let it format it, let it format it consistently for anyone who is using it, and in a same way as anyone else. This is quite radical. For someone who comes to it early, it feels invasive. It feels like someone's telling you how you should format your code, but actually once you get over the hurdle, it, means you think about it less. And the people I talked to who are most passionate about this stuff will say things like it makes the the style and formatting of your code transparent. You don't see it anymore. You don't see indentation, you don't see capitalization, you don't see leading or trailing commas. But what you do see is just the logic of what's going on in there, and that's actually what you want to communicate. You don't want yourself or people around you or the other people you work with to be spending time thinking about style. You want them to be thinking about how they can do the best things with the numbers available to help people make decisions around them.

So let's get to the meat of the matter and start talking about SQL. SQL is older than Python. The original spec for SQL was released in 1974 in an academic article talking about the way that we should work with databases and there were lots and lots and lots of different kinds of SQL. To say that SQL is one language is pretty inaccurate these days. There's obviously the original SQL. There's also ANSI SQL. There's multiple different versions of ANSI SQL. And for anyone who's tried migrating a dbt project from Redshift or Snowflake or from Snowflake to BigQuery or BigQuery to Redshift you'll know that those versions of SQL and not the same. They're not entirely different, they kind of look the same, but there's enough differences that it makes it difficult. And especially if anybody's worked with databases like Postgres or MySQL, even just things like quoting a different, it's also true for BigQuery, to add on top of that excellent tools like dbt, and Airflow, add templated SQL on top of that. This is where the code that your team and that you will be collaborating on isn't actually SQL, it's something that will eventually turn into SQL once other things have been added to it. So that's the right-hand side of the slide. The left hand side of this is because a lot of these different database backends have grown out different analytical communities. They've worked in isolation and they've come up with norms and ways of doing things that are completely understandable given where they've come from, but they haven't all settled in the same place. And as people come together to collaborate on SQL more and more these days,we're finding people from different backgrounds starting to work together and the norms from those different communities and the cultures that there's different communities are clashing. Some of those cultures prefer uppercase because that was necessary. Some of those cultures preferred lowercase because it was necessary. We have leading and trailing commas. We have aliases, we have no aliases. We have qualified statements. We have unqualified. We have indentation, which I'll come back to you later. We have what order should the columns in your statement be? Should you use white space, not white space. I could go on and on and on, but there are so many different ways that we can write this, and in some ways it doesn't matter which one we pick.What matters is that as a community, that we have a common set of rules so that we can actually talkabout what's what's meaningful.

This is a beautiful example. So I was putting this talk together and I was talking to Claire. And this is a screenshot from one of Claire's first commits at Fishtown,where she'd asked Tristan for review. Just take a moment to read this. And what I want you to think about is how does Claire come out of this feeling and how does Tristan come out of this feeling?

So Claire, as a new employee at Fishtown has asked the CEO for feedback and he said, this looks great, but by the way, can you fix the indentation. Claire's come out and is going like, Oh, I feel so bad and Tristan is going to be torn on this example, right? He's he knows the importance of consistent style and knows the importance of having norms here and knows that if we're going to get better, someone has to give us feedback. But at the same time, he knows that receiving this feedback is going to feel really awkward and both parties know that this is an important thing to do, and yet both leave the interaction feeling icky at best after this.

But what other options do we have? Right. Either we don't review style code becomes inconsistent and say, it's inconsistent just because we have more people collaborate on it. And that's going to mean the efficiency in working with it goes down, understanding suffers, and ultimately we all end up sad. Okay.We've got option two. And this is, this is kind of where style review is happening. This is the norm. Humans review it. And that means you're trying to review both logic and you're trying to review style. That means that exactly the screenshot that I showed you before happens. People don't engage with reviews because they're icky and unpleasant, which means we don't review style, which means code becomes inconsistent and inefficiency and understanding suffer. And we all end up sad. Thinking about the example I gave you about Python before Python has solved this right. But SQL hasn't had a solution to this really what we need as a wayof reviewing style by machine. That means the machine is giving you the feedback. You don't get mad at the machine because the machine, the machine isn't giving you style feedback, because it just wants to make you look bad. The machine is making sure that your code is in a way that it is going to be efficient for someone else to review. Drawing on what Caitlin was talking around earlier, it makes style everybody's problem, rather putting something together and expecting someone else to catch it. It's a way of helping everybody on your team take responsibility for their own, formatting.and then it also makes human reviewers focus on the logic, not the style because the style has already been updated.That gives you a couple of things. Firstly, it helps quality improve, right? And we all know why that's important and I'm not going to go into that again, but it makes reviews more enjoyable for both sides. As a person who is asking for a review, you get much more substantive feedback, you get much more useful feedback and you come out of it feeling. Not just you come out of feeling more fulfilled because someone's actually giving you something helpful, likewise, it's much more important for the reviewer as well, because they're more efficient and they get to give you substantive feedback, which feels like a much more productive thing to be doing rather than focusing on the style, which is much easier to see. And step three is always profit. So everyone's happy.

That's all the context. Looking into this several years ago (and actually, one of the, the Fishtown team posted a meetuptalk that I did just over a year ago at the London dbt meets up and was talking about exactly this problem), and that there was not anautomated tool for doing this with SQL. Coming out of that as part of a hack day at tails.com where I work, I tried to wrestle with this problem and try to work out how we do this. If anybody's got curiosity about how that went, ask in the chat, but it was the initial was not fun. But we came out with, SQLFluff, which is now a tool that has over 40 contributors. We have over a thousand downloads a day and it's a way of limiting SQL automatically, including all of the bits around SQL, that go with it.

So what does it look like? On the very basic of this, it's a command line tool where you can lint SQL. So if you want to lint a file, you can say SQL fluff, lint the file and you'll get back whether it has passed all of the rules, whether it's failed. In this case, you can see it's flagged at three things. That there's some operators that don't have spaces around them and the surrounding white space hanging around at the end of this. We could also lint folders, most of us work with more than one SQL file, and it will go through and lint all of your files. This works in a nice way as well around CI pipelines. The exit code of this is non-zero if it fails, that means if you have it in a pipeline of steps that needs happen. If this fails, the pipeline will fail and you can put this in your CI checks. Which leads to an interesting aside about actually using this. So this is, an actual screenshot of our current CI pipelines at tails around our dbt project, where I made a commit and got caught by my own Linter here for having inconsistent capitalization. Now at the time, I think I said something loud and quite rude at the time, but this is actually excellent. I've just had really brutal, clear feedback from an entity, about the quality of my SQL saying it is not up to scratch for the project. I don't feel bad about any human in doing that because the machine gave me the feedback. And particularly in my case, I had a big part in putting the machine together, that's giving me this feedback. So I don't have an icky review experience with anyone else on my team. And also we can see that this is an action and it works.

What happened when we first turned this on is we found out that we had 66,000 lines of SQL across 600 models, all which was inconsistent. And the prospect of bringing that back under control is daunting to say the least. So we took a little bit of inspiration from the Python Black project. You told us about flake8 was a linter. It helps understand where the problems are, but that's only half of the challenge here. The black side of this is to bring in automated fixing, and for a lot of the, at least the early rules that we put in place with SQLfluff, they're actually not that hard to fix, but things like capitalization, spacing, trailing white space, a relatively easy for a linter to fix.

And so we did it. So here's a brief demo. You can see two queries at the bottom one, which has got white space all over the place, one of the bottom, right. Which doesn't, and you can say SQL fix, give it a file name. It will say, look, here are the issues that I've found, would you like to fix them? And if you press yes, we'll go through and actually fix them. This works not just on an single files, but also works on folders, so if you are trying to clear up a huge dbt repository, which might have lots of inconsistency all over the place that makes it so much easier as a practical tool for us to use.

But we can go a little bit deeper. So what I've just covered there is the very basic use of SQLfluff as you might use it. But like I said before, a lot of us already have norms withinour organizations about how we do SQL styling, SQL linting,what our appropriate norms are. And while yes, we, as a SQL team have views on how this stuff should be done. We also recognize that organizations have made decisions about this already. And if you've already got a project, which is limited in a particular way, who are we to tell you to do it different? So we can configure a particular rules to be turned onparticular rules to be turned off. We also have to dialect here.

So to enable some of the more interesting rules that are in the background here, we actually had to build a SQL parser and SQL parser that understood what the SQL was. And that means we have to wrestle with the issue of dialects. Dialects are in there. So an example for us here, this is very simple projects.I'm saying it's a Snowflake dialect.We want to be linting rule number three and rule number six. And we want to be able to add in some dummy variables here, some of the jinja codes that we can link sensibly around it. And you can see that the linting rules that we brought up on the right have respected that. We've only got errors here for rule number three, rule number six. We know exactly where they are and the fixing works around the templated fields so that we can automatically fix some of this stuff. So you can see here, we fixed some of the indentation around that rank function. We've moved the templated field back and add some of the other things that right.

Going even deeper. I mentioned we had to build a parser behind this, and so one of the things we can pull out to that is the parsed structure? What is the parsed structure in a way that is common between different dialects? So regardless of what SQL dialect you use, there are common elements about language, which we can pull out. One of the things that's in one of the pre-releases coming out shortly is an ability to programmatically interrogate your SQL files to understand; what tables are you referencing, what columns are you referencing in those tables. It's still very early days and I don't want to say it's all perfect or that it's all done, but there were some really interesting use cases we could come up here, around how we deal with dependencies, between different parts of our projects. If you believe that in your staging, that only some schemas within your project should be able to depend on your staging area. Maybe that's something that we can enforce here in the future using using custom rules.

So 'm going to stop bringing us back up to the surface. Now, a bit of a recap. We set out trying to do a couple of things. And while none of these are a hundred percent perfect, we have got a certain amount of the way down the verge on these things. We have automated linting against the common stock guide, we have time-saving fixing of some of the common linting issues. And I say some of the issues here. SQL fluff doesn't aspire to be able to fix everything. There are some things about your SQL, which if the machine went to try and fix it, it would at best guess and at worst, just be wrong. And there are some things that we want to, enforce, which you actually have to go back and rethink your code, because it's not just about style. It's about what your intent is and was that clear? It's CI/CD ready. Niall from the team at tails, did a great office hours talkabout this couple of weeks ago, talking about how we use this in our CI pipelines. It's dbt and templating ready (mostly). If you're using a lot of dbt packages, I encourage you to look at the pre-release version rather than the current live stable version. And it is configurable to your organization. So you can have arguments about trailing leading commas.

But what does this all mean? Just the obvious benefit of doing this right? You can stem the tide and write better code. All of us work with lots of SQL all the time and just slowly helping this get better over time is a way of stopping it getting worse. And just by stopping it getting worse it's automatically going to get slowly better at a time. What I said about leading and trailing commas though, is very interesting point and that it also focuses the conversation here so previously, and I'm sure several of us have been in this situation, if you ever were in an argument, and they tend to be arguments about capitalization or indentation or comma placement, the outcome of the argument would be one person felt like they'd won, one person felt like they'd lost, or both people felt like they'd lost, but nothing ever came out of it, that was written, or decided they were never meaningful conversations was. Now we have a way of having the outcome of one of those conversations about style be a configuration file that we can version control. If we can say on the 30th of December, Alan changed the config file to go from trailing commas to leading commas, and we know why we changed it. We can put a pull request around that. We can discuss it. We could focus the conversation about style in a productive way, rather than it being about arguments around style. Thirdly, and this one I think is really easy to overlook new peoplecoming into your organization.When new people coming to the industry will see this as the norm. All of us have grown up around SQL that didn't have any standardization around it. And that means we think about it. We think about, Oh, you should identify this here. Should I do that? That really, we want our colleagues to be focusing on how to make the best up the numbers that are available to them, how to do the best of the data available to them. And, and we want them to see, they want, we want them to see through style rather than training them to come up with opinionated views on comments. And finally code review can be more enjoyable, which I don't need to tell everyone why that's, that's a good thing.

But there's more, and I briefly want everybody here to consider not just your organization, but to consider the industry right. SQL as it is so much of what we do has been. Behind closed doors, it's viewed by businesses is that secret sauce. And, you know, we can't collaborate on that. Can't do this, we can't do that. And I think we've all seen the growth of open source tools over the last couple of years, SQL fluff is one of them. dbt is also one of them. I've got hacktoberfest t-shirt on, if anybody is involved in that as well. And it's encouraging the community of sharing around analytics and data, and to share we need a common language.We talked about why we need a common language, but unless we, as an analytics community, start to communicate in the same way, it's going to be really hard for us to start using each other's or using each other's work. And a tiny example of this is around dbt packages. Imagine if you saw that the dbt package that's, that's produced around Zendesk, for example, tool that many people use, you look to that and the reason that you couldn't understand what it was doing was because it didn't follow your style guides. And it was a bit hard to understand. It just adds that little bit of extra friction to using it. And then when you bring it in, it makes queries in your database, the aren't styled the same way as everything else. And it just adds this extra layer of friction. Just imagine what it would be is if, as a whole community, we had a consistent way of writing SQL, what would that do for the way that not just. You work within your team, but the way that you hire and the way that you train the way that you could be confident when you're bringing someone else in, from a different organization, that they are already going to be onboard with the kind of style that you use and the way that enables us all to collaborate more on SQL projects together.

What next? First that this, this is while this started as my pet project, this is so much more than that now. I have put a shout-out to all of the people at the top as well. I think all the people I've probably missed someone. And if I have missed you, I'm really sorry. but there are so many people who've been parts of this project to get it to where it is today. And so many of them have come from the dbt community. I guess to the, to the Fishtown team here who are on the call, that SQLFluff wouldn't exist without the dbt community. And, thank you.

What next? So there are, there are depending on how you want to use SQLFluff. I can recommend two different releases. Now there's currently a current stable live release. It's what we use live on our dbt repository at tails. And you can get it by just going to pip install SQLFluff. It's currently the most stable version out there. If however, you are a) Interested in the new stuff.b) don't want to have to upgrade in a couple of weeks when we actually release the new version or c) if you have a project which uses macros significantly, or has some very big, complicated SQL files in it, or uses dbt packages extensivel: .I would highly recommend checking out the pre-release. The most recent one, I think is 0.4.0a3 out there at the moment. You can get it by saying pip install sqlfluff --pre to allow it to include pre releases. And this includes a lot of changes, but the notable ones here are around being able to use the dbt templating engine itself to compile your SQL before linting it, that means that, if you're using some of the more exotic macros things, don't break. It also means we have source mapping, so if you're using a lot of macros, the references to where the linting errors are in your code will map back to the source file, not to the compiled file, which would previously what would happen. It's also significantly faster. There are a selection of things that you've done in here, which are related to the source mapping together ones that make it a lot faster, particularly on large code bases.

And lastly its building the community. We have quite a few teams using it already, but it's a relatively small community using SQLFluff right now. The purpose of today's talk is to share it with you, the community. We think it's a great tool. We think people should use it. I think it has meaningful implications for, for how we all write SQL and collaborates on things, but without the community, it's somewhat useless. So come and find us!

github.com/sqlfluff