SQL has many features to group and summarize, but you need a little know-how to find the detail under the summaries.
If you know me, you know I love Common Table Expressions (or CTEs). In today's article, we're going to take advantage of CTEs and aggregate functions to learn how to find data that matches group conditions. Group conditions come up in many situations, whether it's looking for duplicate items, finding completed orders, or identifying mismatches on denormalized data. I'll show you how to use those group conditions to retrieve the underlying data.
Why Is This an Issue?
Think of it this way: in order to find group conditions, you group your data rows by key values. You then use aggregate functions such as COUNT and SUM to find the group conditions. In so doing, you by definition lose any other values in those rows. Let's give a simple example and the traditional solution:
select CUSTNO from CUSTMAST
group by CUSTNO having COUNT(*) > 1
order by CUSTNO
Not much to it, but it really underscores the issue. If I give the results of this query to users in order to help ascertain the problem, they're going to immediately want to at the very least know the name of the customer that's duplicated. (And yes, I know that you can avoid this sort of thing with unique keys, but not all systems have their data models set up that completely, and not all group conditions lend themselves to database constraints.) In the past, we've been able to easily enough give them at least one of the names this way:
select CUSTNO, max(CUSTNAME) from CUSTMAST
group by CUSTNO having COUNT(*) > 1
order by CUSTNO
Ah, very clever! We use another aggregate function, MAX, in order to at least select one of the customer names. Quick quiz question: what happens if you don't use MAX and instead just try to include the CUSTNAME field in the list? Answer: you get a "Column or expression in select list not valid" error! But you knew that already. If you think about it, though, the technique above is pretty limited. It only gets you one value, and if you want to add more fields, you have to do MAX on all of them, which means you might be getting values from different rows. All in all, not an optimal solution. Enter the CTE:
with CustDups as
(select CUSTNO from CUSTMAST
group by CUSTNO having COUNT(*) > 1)
select * from CUSTMAST
where CUSTNO in (select CUSTNO from CustDups)
order by CUSTNO, CUSTNAME
There we go! Now I'll show all columns of every row that has a duplicate. I can, of course, limit the rows for the user by specifying a list of fields rather than the asterisk in my SELECT list.
Other Syntaxes for Other Uses
I could have done this with a JOIN as well:
with CustDups as
(select CUSTNO from CUSTMAST
group by CUSTNO having COUNT(*) > 1)
select * from CUSTMAST join CustDups using (CUSTNO)
order by CUSTNO, CUSTNAME
It's a little more concise, but has its pros and cons. The biggest benefit of the JOIN approach makes itself felt when you're doing more analytical types of inquiries:
with OrderSum as
(select ORDNO, SUM(QTYORD) Ordered, SUM(QTYRCV) Received
from ORDDTL group by ORDNO)
select * from OrderSum join ORDDTL using (ORDNO)
where Received >= Ordered
order by ORDNO, ORDLIN
This is a pretty simple case: I'm showing the detail lines for every order where the total received quantity is greater than or equal to the total ordered quantity. Note that this may or may not be a particularly useful query, since an over-received line might cancel out an under-received one. Rather than check an aggregate condition, you might want to test each line individually. For example, you might want to show all the lines for every order that has at least one over-received line. That would be cool!
You could do that with this query:
with OverOrders as
(select ORDNO from ORDTL where QTYRCV > QTYORD
group by ORDNO)
select * from ORDDTL
where (ORDNO) in (select ORDNO from OverOrders)
order by ORDNO, ORDLIN
Slick, right? Although I know some of you might be saying we could replace the SELECT..GROUP BY with a simple SELECT DISTINCT, like this (I'm only going to show the CTE definition, which is the part that defines the temporary table OverOrders):
with OverOrders as
(select distinct ORDNO from ORDTL where QTYRCV > QTYORD)
You'd be right. Just using the SELECT DISTINCT would allow me to reduce a little of the complexity. I love DISTINCT; it allows me to do quite a few things, one of which I'll show you in a second. But the first syntax allows you a little finer-grained control. Let's imagine, for example, you needed to know situations where more than one line was over-received. You could do that with just a simple extension of our first query:
with OverOrders as
(select ORDNO from ORDTL where QTYRCV > QTYORD
group by ORDNO having COUNT(*) > 1)
Like before, the clause above is just the definition of the CTE that identifies the orders that meet the criteria, not the subsequent SELECT that brings in the detail data. You can see that using the GROUP BY syntax allows you to use the COUNT aggregate function and would, by extension, allow you to use any aggregate function, such as SUM or MAX. It's a very powerful technique.
A Last Challenging Problem
OK, maybe it's not all that challenging, but it took me quite a while to figure it out. Now that I know the secret, though, I find myself using it all the time. The generic situation is simple: give me a list of detail for groups of records where some field in the row has more than one value within that group. The generic description is a little vague, so let me give you a more concrete example: give me a list of all the PO lines for items that we buy from multiple vendors. Yes, there may be a good reason to buy from multiple vendors, but there could also be an opportunity for savings by using a single source. So how do we do that? Simple:
with ItemVendor as
(select distinct POITEM, POVEND from PODTL),
MultiVend as
(select POITEM from ItemVendor
group by POITEM having COUNT(*) > 1)
select * from PODTL
where POITEM in (select POITEM from MultiVendor)
This is a multi-step process. First, I use DISTINCT to create a list of item/vendor combinations. Then, I use GROUP BY to narrow that initial list to only those items purchased from more than one vendor. Why do I have to do this? Because I can't just get a list of all items with more than one line; the lines could all be from the same vendor. I first have to coalesce all the lines for each item/vendor and then use those to identify the items that truly are purchased from multiple vendors. Finally, I take that list and use it to select all the detail lines from the original detail table. At the end, we have a list of all the PO lines for every item that we purchase from different vendors. And because we're going back to the underlying detail, the query includes things like date and price. That's the sort of analytics that can really help you make business decisions—and without having to go to an expensive BI solution!
This should give you some ideas on how using aggregation can get you the data you need for your business. Happy data-diving!
LATEST COMMENTS
MC Press Online