This time I’m just writing to tell you a story about the WITH clause, because it is really powerful and vastly underutilized.
You can read the basics about the WITH clause in my previous article about subqueries.
So, this is the story about how it helped a couple of guys get some desired results in an efficient way.
One of the guys was working on a database-related task a few days ago, and asked the other guy for his opinion about how to get some results he needed for his task.
He had a query that returned a list of accounts that had more than 4 rows in a transactions table. From this description, the query was something like this:
SELECT Account
FROM transactions
GROUP BY Account
HAVING COUNT(Account) > 4;
But now he needed to list those transactions that were counted in the previous query, so he was thinking about using the above select as a subquery, but wasn’t sure how to do it.
So, the first thing that was suggested was something like this:
SELECT *
FROM transactions t
WHERE EXISTS
(
SELECT Account
FROM transactions
GROUP BY Account
HAVING COUNT(Account) > 4
AND account = t.account
);
NOTE: This query didn’t really need to be correlated. Something like this would have been more efficient:
SELECT *
FROM transactions
WHERE account IN
(
SELECT Account
FROM transactions
GROUP BY Account
HAVING COUNT(Account) > 4
)
But it turned out that, in reality, the grouped query included some conditions, and was really something like this:
SELECT Account
FROM transactions
WHERE code <> 'something'
AND DATE > 'some date'
AND account LIKE 'something%'
GROUP BY Account
HAVING COUNT(Account) > 4;
And thus, the suggested query was returning more rows than expected. So, the second guy’s first reaction was: “Ah, Ok, just add the same conditions to the main query and it should work“, but it was starting to get a little ugly.
He tested it, but it turned out to be too inefficient, so he needed to cancel it before it gave any results. The table had millions of rows, and the columns referenced in the conditions were not indexed.
So, they thought about subquery factoring to filter the table just once instead of doing it in the main query and the subquery, and they came up with something like this:
WITH temp AS
(
SELECT *
FROM transactions
WHERE code <> 'something'
AND DATE > 'some date'
AND account LIKE 'something%'
)
SELECT *
FROM temp t
WHERE EXISTS
(
SELECT Account
FROM temp
GROUP BY Account
HAVING COUNT(Account) > 4
AND Account = t.Account
);
This one performed the full table scan only once in the WITH clause, and then listed and grouped the filtered results in the main query.
This version gave results a little faster, but still took several seconds to return.
The results were correct, but the first guy wasn’t sure to understand how the EXISTS condition was affecting the results of the main query, so, the second guy started to explain to him how this worked, when it downed on him: This query was still very inefficient because the subquery in the EXISTS condition was a correlated one, so the grouping and counting was being performed once for each row in the potential result set from the main query!
They were still underutilizing the power of the WITH clause.
After they realized that, this is the query they ended up using:
WITH temp AS
(
SELECT *
FROM transactions
WHERE code <> 'something'
AND DATE > 'some date'
AND account LIKE 'something%'
)
, temp2 AS
(
SELECT Account
FROM temp
GROUP BY Account
HAVING COUNT(Account) > 4
)
SELECT *
FROM temp t1
JOIN temp2 t2
ON t1.Account = t2.Account;
So, as you can see, in the first subquery they filter the table, which is done only once, and then in the second subquery they group and count the already filtered results obtained from the first one, which is done only once too, and then in the main query they simply join those temp resultsets.
This version returned its results immediately.
Sometimes one tends to do things using the first approach that comes to mind, even though it can many times not be the most efficient way to do it. Fortunately, sometimes performance is so unacceptable that it forces us to try some other method.
And in case you are wondering, yes, using analytic functions makes this task actually simpler and even more efficient, but I thought this story was a good way to show you how the WITH clause can help optimize queries in some situations:
SELECT *
FROM
(
SELECT t.*, COUNT ( * ) OVER (PARTITION BY t.account) account_count
FROM transactions t
WHERE code <> 'something'
AND DATE > 'some date'
AND account LIKE 'something%'
)
WHERE account_count > 4;
Which for readability purposes could be written this way:
WITH trans AS
(
SELECT t.*, COUNT ( * ) OVER (PARTITION BY t.account) account_count
FROM transactions t
WHERE code <> 'something'
AND DATE > 'some date'
AND account LIKE 'something%'
)
SELECT *
FROM trans
WHERE account_count > 4;
In what other ways have you used the WITH clause for query optimization?
Share your wisdom in the discussion section below!
Subscribe to be informed about new posts, tips and more awesome things.