This is the written version of a lesson from my SQL Course, in which I introduce students to the awesome and very commonly used SQL GROUP BY clause.
If you want to follow along, here is the script you can use to create the employee table used in these examples: TablesCreation-script.
In a previous lesson you learned about aggregate functions, and you saw that they return only one row, and that the function was applied to all of the rows affected by the query.
That is pretty cool, but in reality, most of the times you will not use them this way, and that is why in this lesson I’m going to show you how to use them in conjunction with the GROUP BY clause, which is even cooler.
So, what is the SQL GROUP BY Clause?
Well, the idea is very simple. The GROUP BY Clause groups the result set into groups, and most of the times it is used with aggregate functions. When that is the case, the aggregate function is applied to each group, and the results include one row per group.
As always, an example is better than words, so, here we go.
This example is similar to one we used some lessons ago. I want to count, or get the number of, rows that exist in my employee table, but this time, I don’t want to know the total number of rows, but the number of rows per department.
I’m going to write the query, and then we will analyze it together.
SELECT department_id, COUNT(*) FROM employee GROUP BY department_id;
In line 1 I have the aggregate function, which in this case is COUNT, and in line 3 I have the GROUP BY clause, which is composed of the GROUP and BY keywords and a list of expressions, which in this case contains only one column.
In natural language, I’m telling the database something like this: “Retrieve all of the rows in the employee table and group them by deparment_id, and then, apply the COUNT function to each group”.
Let’s see what it returns:
As you see, I get one row for each department id, and I get the COUNT of employees for every department, or in other words, all employees with department id 1 were included in the first group, all employees with deparment id 2 where included in the second group, and so on.
There is a rule you need to keep in mind when writing this kind of query:
Every expression that appears in the SELECT list that is not an aggregate function or an expression including an aggregate function, MUST appear in the GROUP BY clause.
In this case I have deparment id in the SELECT list, and I have it in the GROUP BY clause. If you don’t comply with this rule, you will get an error.
On the contrary, however, you can have expressions in the GROUP BY clause that are not in the SELECT list.
And although most of the times you will use the group by clause with aggregate functions, you are not forced to do it. For example, if I don’t include the COUNT function in the previous example, I get only the list of deparment id’s, and this works the same as if I had use this query:
SELECT DISTINCT department_id FROM employee;
Not the most common way to use it, but you have to know it.
Okay, I’m going to show you something else:
Here is a modified version of the query, where I added the bonus column in 3 places: The SELECT list, the GROUP BY clause, and the ORDER BY clause:
SELECT department_id,bonus,count(*) FROM employee GROUP BY department_id, bonus ORDER BY department_id,bonus;
Now, let’s look at the results:
Notice that I got more rows in the results this time. Why is that?
That’s because in this query we are grouping by two columns: department_id and bonus, which means that in the first group we are including all employees that have Department Id 1 and Bonus 100. If some rows have the same department, but the bonus column is different, then they are not included in that group. And that is why in the second row we have the other group for department id 1, but bonus 200.
Another thing that can be useful to know is that the order of the expressions in the select list is not important regarding the columns included in the GROUP BY clause. As long as all of them appear in the GROUP BY clause, then you will be fine.
And, of course, you can add a WHERE clause to a query that includes a GROUP BY clause. In that case, the GROUP BY clause should go after the WHERE clause.
And One more thing:
This is not very common, but you can nest aggregate functions.
Take a look at this query:
SELECT MAX(SUM(salary)), AVG(SUM(salary)) FROM employee GROUP BY department_id;
See? I have the SUM nested into the call to MAX, and into the call to AVG.
Now, look at the results:
I have a GROUP BY clause in line 3, so, why did I get only one row as a result?
That’s because what happened is that, first, the rows were grouped by department id and then the SUMs where applied to every group, and after those results were calculated, the MAX and AVG functions were applied to them, and at that point, they worked in similar way than when you apply aggregate functions without a GROUP BY clause.
I don’t want to confuse you, so I will try to clarify this:
Let’s run the query only with the inner functions.
SELECT SUM(salary), SUM(salary) FROM employee GROUP BY department_id;
Take a look at the results:
Now imagine that this is your table and now you want to get the max of the first column. What would you get?
How many rows should you get? You want the max of those rows, right? So, the MAX is only one number, or in other words, you would get one row.
16500 is the greatest value, right? So, that is what my original query should give me.
And as for the AVG, to get it I would need to sum all the values in the second column and divide it by 4, because I have 4 rows, right?
And how many values should I get? Only one: The average of all of those values.
So, when you nest aggregate functions, this is what happens, and that is why you get only one row even though you have a GROUP BY clause.
Okay, time for your next practice. To work on this challenge you need to download and run the script mentioned at the beginning of the article. Go ahead and download it if you haven’t.
This is the problem description:
Write a query to list the different bonuses from the employee table, along with the number of employees that earn that bonus, and the greatest salary for employees in that group.
Please include only employees who were born before 1995.
And this is the expected output:
Post your solution in the comments section below!