Tuesday, April 01, 2008

SQL Analytic


In my current consulting activity, I’m involved in a project relying on Oracle 10g for most of its data warehouse infrastructure. I thought it would be a nice opportunity for me to brush-up on my SQL skills.

The OLAP SQL functions (also referred to as Analytic Functions) come in very handy when dealing with data reporting or analytical needs. So this post will summarize the SQL analytic functions available in Oracle (some of them since 8i and other since 9i).

Although reports produced using these functions could very well be done using SQL or PL/SQL workarounds, these would be more clumsy and potentially less than optimal solution (depending on how well Oracle Optimize process these functions).


These functions can be roughly classified into these categories:

1. Ranking (show top 5 or bottom 5?),

2. Reporting Aggregation (compare values at different level of aggregation, e.g. region sales versus national, market share?),

3. Window Aggregation (moving average, cumulative sum?)

4. Row comparison (studying change or variation over sequential row, LAG/LEAD, or get first and last value of sorted groups with FIRST/LAST?)

It is important to note that these functions operate just before the ORDER BY clause of a query (which occurs last on the SQL process sequence order), or at the very last step when no such clause is present. Also important, these functions are not allowed within the WHERE, FROM and HAVING clauses, and are really meaningful within the result set of a query (i.e. in the SELECT clause) and more rarely in the ORDER BY clause.

The follow the following generic form:

function(<arguments>) OVER(<analytic clause>)

where

+function is the analytical function (e.g. AVG, CORR, COVAR_POP, COVAR_SAMP, COUNT, CUME_DIST, DENSE_RANK, FIRST, FIRST_VALUE, LAG, LAST, LAST_VALUE, LEAD, MAX, MIN, NTILE, PERCENT_RANK, PERCENTILE_CONT, PERCENTILE_DISC, RANK, RATIO_TO_REPORT, STDDEV, STDDEV_POP, STDDEV_SAMP, SUM, VAR_POP, VAR_SAMP, VARIANCE)

+arguments is optional and could include more than one (<argument1, argument2, argument3>)

+analytic clause can be of one the following: Query Partition clause (to break a result set into multiple partitions according to a criteria set expression); Order-By clause (for sorting data into each partition); and Windowing clause (to define a window of data).


1. Ranking functions

These functions role is to provide ordering info to records based on some given attribute. The basic three functions available (i.e. ROW_NUMBER, RANK and DENSE_RANK) only differ in the way they manage ordering ties. Here’s a very simple sample to illustrate the differences:


RANK and DENSE_RANK rules allow for the top-n query, which is not possible with ROW_NUMBER as it does not treat tie.


Note that we can also control the orderings of nulls value by specifying the self-explanatory clauses NULLS LAST or NULLS FIRST (default) after the “ORDER BY attr” clause.

Other functions can be classified within the Ranking category : PERCENT_RANK, CUME_DIST. They operate very similarly as the previous ones (e.x. PERCENT_RANK() OVER(ORDER BY attr), but compute the cumulative fraction typical of histograms.

In conclusion, the ranking functions offer a better alternative than relying on the well-known hack (creating a subselect virtual table with ORDER BY used in combination with the ROWNUM function) that can handle tie correctly.


2. Reporting Aggregation functions

Most common aggregation SQL functions (like SUM, COUNT, AVG, MAX, MIN, VARIANCE, etc) can be used as analytic functions as well, eliminating the constraint of having the aggregate value to be at the same level as the row set (mandatory when using GROUP BY). We can then have a global aggregated value presented with all data rows, as well as an aggregated value at a different partitioned window level all in a single query statement. Here’s a self-explanatory example with AVG:


Also belonging in this category is the RATIO_TO_REPORT which gives ratio of a value versus an aggregated comparator. Actually, this function is simply a shortcut version of : value / SUM(value) OVER() == RATIO_TO_REPORT (value) OVER().

We also include in this category the function ROLLUP and CUBE which both provide aggregated values at higher level (the difference being that CUBE will provide aggregate values throughout all combination of attributes) as illustrated by this simple example:


Notice the use of GROUPING function (which return 1 when aggregation was done over a specified attribute and 0 otherwise) to help differentiate between NULL values in the table and artificial NULL generated for reporting purposes.

Window Partitioning Operator

We often used in combination with the Ranking and Reporting aggregation functions, an operator known as the window partitioning operator. This operator role is to provide a way to separate data into grouping and perform the functions on these groupings. It relies on the use of PARTITION BY placed in the analytic clause. Here’s a simple example illustration the window partitioning used with the RANK() function:


We can see that the second RANK function now operates on a defined grouping (here defined by Grouping column) and not the overall set of rows.


3. Window Aggregation functions

These functions permit the calculation of moving average on a set of ordered data records. For example, to get a moving average using three data points window (one in the past, one current and one in future), we could use :

AVG(value) OVER(ORDER BY t ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)

As found in the following sample:


Note that for missing values on the first and last record, the preceding and following records are considered NULL, and thus not used in the AVG function (see the first moving AVG value: 150 which is calculated from only 100 and 200).

This is a general rule applicable for treating NULL values when doing aggregation function, i.e. always ignored in the calculation.

The window aggregate function can also be used to calculate running or cumulative sum. This is possible using the ROWS UNBOUNDED PRECEDING and CURRENT ROW clauses for the window boundary definition. This gives:


As a final comment about this category of functions, let’s note that as in the case of ranking and report aggregation, it is possible to partition their calculation using the PARTITION BY clause.


4. Row Comparison

The row comparison functions are very useful for showing previous and subsequent records data on the same row as the current one within an ordered sequence of records. These functions are LAG and LEAD. These again are very simple to use as demonstrated in this small example:


The more generic format allow us to define offset and default values for NULL, i.e.: LAG(attr, offset, default_null).

Again, these functions can also be partitioned similarly as the previous ones. Anyone who has hacked a solution for this type of report by relying on self-join query can certainly appreciate the clearness and compactness of these functions.

Martin

No comments: