SQL allows you to do many things, but encapsulating logic isn't one of them…unless you understand the magic of Common Table Expressions (CTEs).
Recently, I've been doing a lot of data analysis. A lot of data analysis. I'm reviewing legacy data on a system that's been in production for over 20 years, and I’m trying to do forensic analysis on it. One of my projects is implementing some enhanced logic, but before it can be put into service, the stakeholders have to approve the changes, and the only way that happens is if I can provide detailed before and after numbers for the auditors. That means finding a way to apply ad hoc complex filters quickly and effectively in order to separate high- and low-impact effects.
They Be Stealin' My Bucket!
Sorry for the gratuitous use of an Internet meme; it happens to be one of my son's favorites. If you want to find it, just Google LOLRUS. But I digress; I just wanted a fun way to bring the word bucket into play, because that's the focus of this article. One of the most common filtering tasks I have to do during this analysis is to break up the data I'm summarizing into buckets by date. This wouldn't be too difficult except that the buckets are somewhat arbitrary, and they also change based on the subset of the data I'm looking at. Older data can be broken down by years, while newer data has to be down to months or even weeks. And the really tricky bit is that I don't know what the buckets need to be until I start breaking the data down into buckets.
And here's the best part: this is legacy data, which means the dates are stored as decimal values, not true dates. And we all know just how easy it is to perform date arithmetic with decimal numbers in SQL, right? Let's take a peek at the sort of calculation I have to do and see if you recognize the issue as one you've run into yourself. Let's say that, for the purposes of this example, I need to summarize production quantities into buckets. The calculations are a bit complex, including production quantities and values, but in order to keep the example manageable, I'll use only a single data point, the production quantity. So basically, I need to go through the file PRODHIST and summarize the quantity produced (PHQTY). Summarization needs to be by manufacturing site (PHSITE) as well as into buckets by date.
The date field is receipt date, PHRCDT, an eight-digit field. I'm going to give you the first, simplest breakdown: by one year, two years, and more than two years. We'll run off of July 1 of this year. Here's the first pass:
select PHSITE,
sum(case when PHRCDT < 20150701
then PHQTY else 0 end) YR2PLUS,
sum(case when PHRCDT >= 20150701 and PHRCDT < 20160701
then PHQTY else 0 end) YR2,
sum(case when PHRCDT >= 20160701 and PHRCDT < 20170701
then PHQTY else 0 end) YR1,
sum(case when PHRCDT >= 20170701
then PHQTY else 0 end) FUTURE
from PRODHIST group by PHSITE
What this will do is sum the PHQTY by PHSITE into four buckets.
Site YR2PLUS YR2 YR1 FUTURE
EAST 10,100,330.45 1,554,612.75 1,653,212.70 140,214.21
(...)
It's pretty simple. And it's relatively easy when you have hardcoded cutoffs. But what if you need to change the bucketing, perhaps to something a little more granular? As an example, let's look at the FUTURE bucket. We included it only on a whim because we know there can be no receipts later than today, but remember this is legacy data and you never know what you might find. So now the auditors would like all future data broken down by months, something like this:
select PHSITE, PHCLAS,
sum(case when PHRCDT >= 20170701 and PHRCDT < 20170801
then PHQTY else 0 end) MO1,
sum(case when PHRCDT >= 20150801 and PHRCDT < 20160901
then PHQTY else 0 end) MO2,
sum(case when PHRCDT >= 20150901 and PHRCDT < 20161001
then PHQTY else 0 end) MO3,
sum(case when PHRCDT >= 20171001
then PHQTY else 0 end) MO4PLUS
from PRODHIST group by PHSITE, PHCLAS
This wasn't terribly hard, because the number of columns was the same and again, I had hardcoded dates. All that were needed were a few modifications to the existing buckets and an extra comparison on the first one to skip older data. Once again, pretty simple. But now let's do a thought exercise. Say the auditors love this, but they want to run it regularly based on the current date. Ah, now we get interesting.
Date Arithmetic in SQL
This shouldn't be too hard, right? Basically, I just want something like this:
sum(case when PHRCDT >= CURRENT_DATE
and PHRCDT < CURRENT_DATE + 1 MONTH
then PHQTY else 0 end) MO1,
I have to repeat this for each bucket, but I think that would be acceptable if it weren't for the pesky fact that PHRCDT isn't a date field. It's a legacy standard eight-digit decimal field, and that's not going to work very well. Now, I could update the statement to dynamically convert one of the operands or the other so that they are compatible. Back in the olden days that might be something like this:
sum (case
when date(substr(digits(PHRCDT),1,4) || '-' ||
substr(digits(PHRCDT),5,2) || '-' ||
substr(digits(PHRCDT),7,2)) >= CURRENT_DATE
and date(substr(digits(PHRCDT),1,4) || '-' ||
substr(digits(PHRCDT),5,2) || '-' ||
substr(digits(PHRCDT),7,2)) < CURRENT_DATE + 1 MONTH
then PHQTY else 0 end) MO1,
Just imagine that repeated for every bucket. Suddenly a simple bucketing explodes into dozens of lines of code that are easy to break with a simple transposition. I've actually seen code like this, and I'll wager you have, too. The other option is to convert the date to CCYYMMDD, something I talked about in a previous article on date formatting. It works, but it isn't much cleaner:
sum(case
when PHRCDT >=
decimal(to_char(CURRENT_TIMESTAMP, 'YYYYMMDD'),8,0))
and PHRCDT <
decimal(to_char(CURRENT_TIMESTAMP + 1 MONTH, 'YYYYMMDD'),8,0))
then PHQTY else 0 end) MO1,
There are other way to convert dates to decimal, but they're no easier to understand, and I still need to repeat whatever I do for every bucket. Twice, in fact. And if the auditors need me to summarize another field (say, the count of receipts), I then have to duplicate that whole mess again. The readability is way down, the chance for a typo way up, and all in all this is just going in a bad direction.
Common Table Expressions to the Rescue!
And now to the solution, our good friend the Common Table Expression, or CTE! With the CTE, I can do all the bucketing calculations up front and then use those results for my buckets, no matter how many I need. I can easily change the bucketing in one place.
First, let's look at the modified code.
-- Define bucket cutoff dates
with t1(LOWEST, B1CUT, B2CUT, B3CUT) as (values (
decimal(to_char(CURRENT_TIMESTAMP,'YYYYMMDD'),8,0),
decimal(to_char(CURRENT_TIMESTAMP + 1 MONTHS,'YYYYMMDD'),8,0),
decimal(to_char(CURRENT_TIMESTAMP + 2 MONTHS,'YYYYMMDD'),8,0),
decimal(to_char(CURRENT_TIMESTAMP + 3 MONTHS,'YYYYMMDD'),8,0)))
-- Summarize production into buckets
select PHSITE, PHCLAS,
sum(case when PHRCDT >= LOWEST and PHRCDT < B1CUT
then PHQTY else 0 end) BUCKET1,
sum(case when PHRCDT >= B1CUT and PHRCDT < B2CUT
then PHQTY else 0 end) YR2,
sum(case when PHRCDT >= B2CUT and PHRCDT < B3CUT
then PHQTY else 0 end) YR1,
sum(case when PHRCDT >= B3CUT
then PHQTY else 0 end) BUCKET4
from PRODHIST cross join T1 group by PHSITE, PHCLAS
The definition of the CTE T1 allows us to create as many buckets as we need and assign names to all of them. In this case, the lowest summarized date is today's date, and subsequent cutoffs are today plus one month, plus two months, and plus three months. That particular code is in its own section.
In the next, separate section of the SQL statement, I use those values to summarize my data. I've changed the names of the resulting sums to be the more generic BUCKET1–BUCKET4, which would make it pretty easy to extend the code. Also, you might notice the use of the rather rare CROSS JOIN clause. It's rare because it means to join every record in the first table with every record in the second table, which isn't something you normally do. But it makes perfect sense here, because the T1 CTE has only one record, and I want to use that record to act on every record in the PRODHIST file.
So now, whenever the auditors want a slightly different bucketing of the results, I just change the definition of T1. If I'm particularly lazy (I mean efficient!), I just change the second half to some arbitrarily large number of buckets, say 10. If the auditors ask for 5 buckets, I just repeat the last cutoff for buckets 6–10 and don't include those columns in the report I hand back to them. Done!
So, I hope you get a feel for how you can use CTEs to encapsulate complex logic in your SQL statements. This will reduce duplication and make your code less error-prone and more future-proof. Have fun with CTEs!
LATEST COMMENTS
MC Press Online