Subqueries are a very powerfull feature of SQL, but I have seen them being misused or underused many times, and that is why I decided to write about them today.
For starters, a subquery is a SELECT statement that is included or nested within another SQL statement, which can be another SELECT or an INSERT, UPDATE or DELETE. Subqueries are always enclosed within parentheses.
In most cases, subqueries are executed prior to the execution of the main statement where they are nested in, with the exception of correlated subqueries, which are covered some paragraphs below.
The result set produced by a subquery is only kept in memory during the life of the main statement, that is, when the main statement’s execution finishes, the subquery’s result set is discarded.
Here is a very simple example:
SELECT * FROM table_A WHERE column_X IN ( SELECT column_X FROM table_B );
In this example the subquery is what you see in lines 5 and 6, so the SELECT from table_B is run first, then the query from table_A is executed using the result set from table_B as the filter criteria, and when execution finishes, the result set from the subquery is discarded.
One of the ways subqueries can be categorized is based on whether they reference columns from the main statement or not. When they do reference columns from the outer statement they are called correlated subqueries, and when they don’t reference anything outside themselves they are called non-correlated subqueries. Let’s review these types with a little more detail.
As mentioned earlier, they reference one or more columns from the outer or main statement (which can be a SELECT, UPDATE or DELETE). So, since they are related to the main query, they cannot be executed alone prior to executing the main statement. Instead, they are executed once for each candidate row that can possibly be returned or affected by the main statement.
Here is a simple example of a correlated subquery:
SELECT * FROM customers c WHERE 100 < ( SELECT COUNT ( * ) FROM orders WHERE customer_id = c.customer_id );
As you can see, "c.customer_id" appears in the subquery's WHERE clause, where "c" is the alias for the customers table in the main query.
But the WHERE clause is not the only place where you can find correlated subqueries. Check this UPDATE example, where the subquery is in the SET clause:
UPDATE customers c SET orders = ( SELECT COUNT( * ) FROM orders WHERE customer_id = c.customer_id ) WHERE customer_type = 1;
Or this query where it is in the SELECT list:
SELECT customer_id, name, ( SELECT COUNT( * ) FROM orders WHERE customer_id = c.customer_id ) orders_count FROM customers c;
The problem with correlated subqueries, is that since they are evaluated once per each candidate row from the main statement, they can severely affect performance, and I have seen people misusing them quite a few times.
So, to get the same results than the above query, I could use this one, which doesn't need any subquery:
SELECT customer_id, name, COUNT( * ) orders_count FROM customers c JOIN orders o USING (customer_id) GROUP BY customer_id, name;
In general, which one do you think would tend to perform better? Share your thoughts in the discussion section at the end of the article 😉
Well, fortunately the query optimizer is getting smarter in every new version of the database, and in many situations it is smart enough to see that a query with a correlated subquery can be transformed into a different and more efficient query, so it ends up running probably a join query or some other type of statement that is more efficient than the one we were telling it to run. However, this happens many times, but not always.
So, correlated subqueries can be very useful, but always make sure that using them is the best way to approach a problem.
As you might have guessed, these are the ones that are independent of the main statement, such as the first example included in the article.
Non-correlated subqueries are executed once before the execution of the main statement, and can be included in all places where a correlated subquery can be included, plus some other places. For example, here is an INSERT statement, which uses a non-correlated subquery:
INSERT INTO departments ( department_id, department_name, division_id ) VALUES ( 100, 'DEVELOPMENT', ( SELECT division_id FROM divisions WHERE division_name = 'INFORMATION TECHNOLOGY' ) );
And they can even appear in the HAVING clause of a SELECT statement.
Scalar Subqueries (Single Row, Single Column)
Subqueries that return only one row and only one column are sometimes called "scalar" subqueries, because they return a single value.
The INSERT from the above example includes a scalar subquery, which was used to get the id for the IT division. Scalar subqueries are the only type of subquery that can be included in the VALUES clause of an INSERT statement.
Other types, based on the number of rows and columns returned
There are other combinations of number of columns/rows returned by the subquery, which have some importance, because those numbers affect the places where a determined subquery can be used.
For example, when you use the IN operator in the WHERE clause of a query, you can use one that returns several rows, regardless of the number of columns it returns. For example, if you are filtering by one column, you can do something like this:
SELECT * FROM customers WHERE customer_id IN ( SELECT customer_id FROM orders WHERE amount > 1000 );
But you can also do something like this, to use the IN operator to compare more than one column:
SELECT * FROM customers WHERE (customer_id,last_order) IN ( SELECT customer_id,placement_date FROM orders WHERE amount > 1000 );
And you could also do something similar in the SET clause of an UPDATE statement, but in this case the subquery must return only one row (can be correlated or non-correlated):
UPDATE customers c SET (sales, last_order) = ( SELECT COUNT( * ), MAX(placement_date) FROM orders WHERE customer_id = c.customer_id ) WHERE customer_id BETWEEN 1 AND 10000;
On a similar note, when you use the ANY or ALL special keywords with comparison operators, you can have a subquery that returns several rows, but only one column.
So, in general, the number of rows and columns returned by a subuquery determines whether it can be used in different situations or places of a main statement, and you shouldn't need to try to memorize those restrictions, because, in general, they just make sense.
An inline view is simply a subquery that appears in the FROM clause of a SELECT statement. They can be used in other parts of the main query as you would with a table or view (reference its columns, join it to other data sources, etc) . Inline views are always non-correlated subqueries.
The concept is very simple, but its use is very powerful.
Here is an example of an inline view:
SELECT customer_id, c.name, COUNT( * ) orders_count FROM customers c JOIN ( SELECT * FROM orders WHERE amount < 100 ) o USING (customer_id) WHERE o.placement_date > '01-JAN-2000' GROUP BY customer_id, c.name;
The example shows an inline view from a single table, but you can have a complete SELECT statement there, with joins, functions, grouping, etc.
The only additional consideration is that you have to provide an alias for the inline view, which is also pretty obvious, because otherwise you wouldn't have a way to reference it from other parts of the query.
Subquery Factoring (The WITH Clause)
The WITH clause, or subquery factoring clause as it's also called, was introduced in version 9 of the database, and what it does is that it allows you to declare subqueries (can be more than one) that are treated in a similar way than inline views, but can be used in more than one place in the same main query (without having to write it more than once, and without needing to be executed more than once).
So, the example used in the inline views section can be rewritten using the WITH clause, this way:
WITH small_orders AS ( SELECT * FROM orders WHERE amount < 100 ) SELECT customer_id, c.name, COUNT( * ) orders_count FROM customers c JOIN small_orders o USING (customer_id) WHERE o.placement_date > '01-JAN-2000' GROUP BY customer_id, c.name;
Here is an example that shows how you can declare more than one subquery using the WITH clause. The example is equivalent to the above query, and doesn't make sense to be written like this. My only goal here is to show you how to do it.
WITH small_orders AS ( SELECT * FROM orders WHERE amount > 100 ), c AS ( SELECT * FROM customers ) SELECT customer_id, c.name, COUNT( * ) orders_count FROM c JOIN small_orders o USING (customer_id) WHERE o.placement_date > '01-JAN-2000' GROUP BY customer_id, c.name;
For an example of how it can help you build more efficient queries, take a look at this article.
So, as you can see, subqueries can be included in all DML statements, and they can appear in many different places depending on their type and the kind of results they return. They can be very powerful and are sometimes underutilized, but can also be misused making SQL statements inefficient.
Here I only covered the basics, so that you have an idea of what is possible, but there can be restrictions, features, and other interesting uses for subqueries that were not covered here, and you might want to research and/or test on your own.
When writing code for applications or scripts that are going to be run more than once, it is always a good idea to test and profile different approaches, to see what the better option is, taking performance and code readability in consideration.
Have something to say, suggest, or ask? Share your wisdom below!