Analytic functions in Oracle SQL are very useful and are utilized by many people, but sometimes without completely understanding how they work or what data they operate on.
This article does not focus on a specific function, but on analytic functions in general.
I decided to write it because I have seen people being confused by the results they get when they make even a very small change in the analytic clause of a function call, which is the part that determines on which data the function will operate and how it will calculate the results, and also because several friends have asked me to write about them in the past.
So, if you are new to analytic functions, or have only used them sporadically without completely understanding how they work, you might find this article useful.
So, what are analytic functions?
Analytic functions are functions that calculate aggregate values based on a group of rows, so they are similar to aggregate functions, but the difference is that as opposed to aggregate functions, analytic functions return multiple rows for each group.
They are very useful for performing complex calculations on data, and although most of these operations can be performed using other features, such as joins, subqueries, user defined functions, etc, analytic functions usually allow you to write simpler and more efficient queries.
Take a look at this example:
SELECT e.*, COUNT( * ) OVER (PARTITION BY department_id) AS the_count FROM employee e ORDER BY department_id,id;
Can you tell me what this query does?
Well, it lists all of the records in my employee table, but besides the columns from the table, it includes a column that returns the count of employees in the same department of the employee you are looking at. I’m talking about the “the_count” column.
This query looks very simple, but it is pretty cool, because if I wanted to write a query that gave me the same results, WITHOUT using analytic functions, I would need to do something like this:
SELECT e.*, (SELECT COUNT( * ) FROM employee WHERE department_id = e.department_id ) AS the_count FROM employee e ORDER BY department_id, id;
I needed an additional correlated scalar subquery, so, not only does this query look a little more complex. It is also less efficient.
And this is actually one of the simplest things you can do with analytic functions. They are actually very powerful.
Let me show you another example: Let’s say I have to create a report that lists all employees with their salaries, along with a column that tells me the difference between their salary and the average salary of their department, and another column that tells me the difference between their salary and the average salary of whole company.
That would be as easy as this:
SELECT id,name,department_id,salary, salary-AVG(salary) OVER (PARTITION BY department_id) AS diff_dept_avg, salary-AVG(salary) OVER () AS diff_total_avg FROM employee ORDER BY department_id,id;
As you can see, with analytic functions I can have different levels of aggregation in the same query. In this case, in the first average, I’m essentially grouping by department_id, but in the second average, I’m including the whole table.
Let’s analyze the analytic average I used in this example:
As you remember, AVG is an aggregate function too, so what makes it be used as analytic function in this example? That OVER keyword that you see int lines 2 and 3 is what makes it an analytic function. Without the OVER keyword, it would be treated as a normal aggregate function, and would require a GROUP BY clause added to the query. So, every time you see the OVER keyword, you are looking at an analytic function.
The Analytic Clause:
The analytic clause is the part that comes after the OVER keyword, and it is always enclosed in parenthesis. In the first average the analytic clause includes “partition by department_id”, and in the second one, the analytic clause is empty, but as you see, the parenthesis are still there after the OVER keyword.
Now, the analytic clause can include 3 different things:
- A partition clause, as in the first average in our example.
- An order by clause, which we will use in another example.
- And a windowing clause, which we will also discuss in a moment.
So, what is the partition clause?
The partition clause works in a similar way than the GROUP BY clause. It “partitions” the result set into groups defined by the column or columns specified and applies the function to each of those groups. In the case of the first average, the partition clause divides the result set in groups by the department id. In the case of the second average, since we didn’t include a partition clause, the function is applied to the whole result set.
Now, let’s talk about the order by clause.
Did you read my article about writing TOP-N queries?
In one of the examples I used the ROW_NUMBER analytic function to create a top-5 query, and it was something like this:
WITH numbered AS ( SELECT employee_id,first_name,last_name,salary, ROW_NUMBER() OVER (ORDER BY salary DESC) ROW_NUMBER FROM employees ) SELECT * FROM numbered WHERE ROW_NUMBER <= 5;
In this case, the analytic clause doesn’t include a partition clause, which means that the function is applied to the whole result set, but it does include an order by clause.
As the name implies, this clause defines the order that will be used to apply the function. So, since the function assigns a row number to every record, if you order by salary in descending order, as in this example, the number 1 will be assigned to the employee with the highest salary, and the last number will be assigned to the employee with the lowest salary, but if you order in ascending order, the row number 1 will be assigned to the employee with the lowest salary.
So, as you can see, not all of the sub clauses of the analytic clause are useful for all of the functions. In the case of row_number and the ranking functions in general, an order by clause is mandatory.
Now, in these examples, I included only one column in the partition and order by clauses, but you can include more than one if you want to.
And now, let’s talk about the windowing clause!
The first thing I need to point out is that the windowing clause cannot be used with all analytic functions. You need to know that not all of the available functions support a windowing clause.
OK, so, the windowing clause allows you to define a “sliding window” on which the function will operate. We call “a window” to a group of rows on which the function will operate, so imagine that we have an actual window as in this image:
The function will operate on the rows that are visible through the window, so if I move the window, the function will operate on a different set of rows.
The window is different from the partition because the partition is static. Once you define a partition, it doesn’t change, but the window can be different or can move as we process different rows.
[wd_hustle id=”videocourseoffertutorialcopy” type=”embedded”]
Let’s see it with an example:
SELECT id, name, salary, department_id, SUM(salary) OVER (PARTITION BY department_id ORDER BY id ROWS UNBOUNDED PRECEDING) AS accumulated FROM employee ORDER BY department_id, id;
|10||IGOR KEYS OSBOURNE||6000||3||12750|
|13||LUKE GREEN JR||4750||4||16500|
Let’s review the results.
For the first row in my result set you can see that the accumulated column has the same value as the salary of that employee, but for the second row, it has the sum of the salary from the first row and the salary from the second row which is 4000 plus 2550, which is equal to 6550, and for the third row the accumulated total includes the salary from the first three rows.
So, those records that are included to calculate the function, in this case, the SUM, constitute the WINDOW, and as you can see, in this case, the window is not moving, but growing, because the beginning of the window remains at the beginning of the partition, but the end of the window is the current row.
Now, as you can see, the accumulated value resets when the department id changes. See? For “JAMES BOSH”, which is the first row of the partition for department id 2, the accumulated salary is again equal to the salary of that employee. Why? Because the window moves or grows only inside the partition, and in this case, we partitioned the result set by department id.
Now, let’s take a look at the windowing clause, to understand how we achieved this result.
First of all, you need to know that you can add a windowing clause ONLY if you added an ORDER BY clause, and as a matter of fact, when you add an ORDER BY clause, you automatically add a default windowing clause, which you can override by adding your own… windowing clause.
So, in this example, the windowing clause starts with the “ROWS” keyword , which is one of two possibilities. We can use ROWS or RANGE. When we use ROWS, it means that we will define the window in terms of physical numbers of rows. We will see an example of this in a moment. And when we use RANGE, it means that we will define the window in terms of a logical range of values, that is relative to the column or columns we ordered by.
It might sound a little confusing now, but it will be clearer if we use an example.
Take a look at this query:
SELECT id, name, salary, department_id, SUM(salary) OVER (PARTITION BY department_id ORDER BY id ROWS 1 PRECEDING) accumulated_rows FROM employee ORDER BY department_id, id;
|10||IGOR KEYS OSBOURNE||6000||3||8250|
|13||LUKE GREEN JR||4750||4||9500|
In this case, I’m telling it that I am defining my window in terms of numbers of rows, and I’m stating that my window starts 1 row before the current row (that is what the PRECEDING keyword means).
And if I don’t specify where the window ends, as in this case, it defaults to end with the current row.
So, in this example, the accumulated salary for each row is the sum of the salary from the previous row and the current one.
Now, let’s change this ROWS keyword, to RANGE, to see the difference:
SELECT id, name, salary, department_id, SUM(salary) OVER (PARTITION BY department_id ORDER BY id range 1 PRECEDING) accumulated_range FROM employee ORDER BY department_id, id;
|10||IGOR KEYS OSBOURNE||6000||3||6000|
|13||LUKE GREEN JR||4750||4||9500|
When I use RANGE, the number “1” that I put after that word, doesn’t mean 1 row. It means a difference of “1” from the column by which I’m ordering the partition, which in this case is the employee id. This means that if I’m looking at the employee with id 100, the window will include the previous row if the id in that row is 99 because I defined a range of 1.
In our example, the accumulated for John Smith is the same as his salary, because he is the first row in the partition. The accumulated for Justin Black is also his own salary, because the id from the previous row, which is the one for John, doesn’t fall inside the range of 1 ID that we defined. For the row of John to be included in the window, it would need to have an id of 4, because the id of Justin is 5, and we defined a range of 1. Does it make sense?
Take a look at the accumulated for Angie Crood. It is the sum of the salary of Justin and her salary. Why?, because the ID of Angie is 6 and the Id of Justin is 5, so it falls between the range of 1 ID that we specified.
Another RANGE example
Here is another example that might help understanding how RANGE works: If I apply an analytic function on a transactions table and I define the window using RANGE based on the transaction date, if the range is defined as 5 PRECEDING, the window will include all the transactions whose date is between 5 days before and the date of the current row, and that means that the window could include thousands of rows if there are a lot of transactions per day, or may include only the current row if that was the only transaction in that 6-day period. On the other hand, if I use ROWS 5 PRECEDING, the window will always be composed of the previous 5 rows, and the current one, regardless of their date (as long as they are included in the same partition).
There is one restriction that you need to be aware of regarding the use of RANGE in the windowing clause. You can use RANGE only if you order the partition by a column that is of a numeric or a date data type. If you order by a varchar2 and try to use a RANGE window, you will get an error. Why? If you think about it, it makes sense, because, you can define a range of numbers, or a range of dates, but how would you define a “range” of strings? :$
Okay, what else can we do with the windowing clause?
In this case, we were only defining the beginning of the window, but we can also specify the end of the window.
For example, I can do this:
SELECT id, name, salary, department_id, SUM(salary) OVER (PARTITION BY department_id ORDER BY id ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) accumulated_rows FROM employee ORDER BY department_id, id;
|10||IGOR KEYS OSBOURNE||6000||3||8250|
|13||LUKE GREEN JR||4750||4||9500|
In this case, for each row, the window starts with the previous row and ends in the next row, so for each row the accumulated salary is the sum of the previous employee’ salary, its own salary, and the salary of the next employee, using the ID as the ordering criteria.
Now, in one of the previous examples, I used “unbounded preceding” to define the start of the window. What does it mean?
It means that the window starts with the first row of the partition, and when you don’t specify the end of the partition, it defaults to the current row, so in that example, the window includes from the first row of the partition and ends with the current row. In a similar way, you can also use “unbounded following” to specify the end of the window.
And you might remember that I mentioned before that if you add an order by clause, you automatically added a default windowing clause if you don’t provide your own, and the default window that is used is one that starts at the beginning of the partition and ends with the current row, so, the following query:
SELECT id, name, salary, department_id, SUM(salary) OVER (PARTITION BY department_id ORDER BY id range BETWEEN unbounded preceding AND CURRENT ROW) AS accumulated FROM employee ORDER BY department_id, id;
Produces the same results as this one, in which the default windowing clause is used:
SELECT id, name, salary, department_id, SUM(salary) OVER (PARTITION BY department_id ORDER BY id) AS accumulated FROM employee ORDER BY department_id, id;
Okay, this is the end of the article. After reading this you should have a very basic understanding of how analytic functions work, and what the role of the different parts of the analytic clause is.
I will write about some specific analytic functions in future posts.
As always, please feel free to comment, make suggestions or ask for clarification of anything in the comments section below.