Monday, September 5, 2022

Oracle for Analytics - SQL Modeling

SQL modeling in Oracle provides a way to do calculations like a spreadsheet does. It leverages the data processing power of Oracle database engine and immediate access to the large data sets. With the calculation rules specified, SQL modeling is another sophisticated technique for data analysis and reporting.

Let's work on a few samples together to get an overview of this technique. I created a view by the name of SALES_VIEW, which has 5 columns, YEAR, SEGMENT, CATEGORY, SALES and PROFIT. All the values for each column are displayed in the table below.


All models in this exercise will be built on this view. 


The Basics


We start with a model that sums up Sales in the last 2 year by category and assign the values to 2020 respectively. 
Here comes the SQL script.

SELECT
    SEGMENT
   ,CATEGORY
   ,YEAR
   ,S
FROM SALES_VIEW_TBL
MODEL
    RETURN ALL ROWS
    PARTITION BY (SEGMENT)
    DIMENSION BY (CATEGORY, YEAR)
    MEASURES (SALES S)
    UNIQUE DIMENSION
    RULES UPSERT SEQUENTIAL ORDER
    (
        S['Appliance', 2020] = S[CV(CATEGORY), 2019] + S[CV(CATEGORY), 2018],
        S['Furniture', 2020]  = S[CV(CATEGORY), 2019] + S[CV(CATEGORY), 2018]
    )
    ORDER BY
    SEGMENT
   ,CATEGORY
   ,YEAR
;

Below is the result we got. The records with gray background are newly generated by the model.


First of all, all the columns appearing in the model are categorized into 3 groups, partition, dimension and measure columns. 

  • Partition columns. They divide the data set into logical blocks, in other words, define the boundaries for calculations. The concept is similar with partitions in the analytic functions.

  • Dimension columns. They are attributes describe the measures. A full combination of them can  identify one cell in a partition. 

  • Measure columns. These columns have numeric values and similar to facts in a multi-dimensional fact table.
In a model, you'll need to explicitly specify the 3 groups of columns.

Second, RULES block gives the instructions for how to compute the measures in the model. You can use SQL functions including analytic ones, not to mention arithmetic operators. 

Third, a measure is reference by the dimensions. You can reference one measure as well as multiple measures. For example, S['Appliance', 2020] indicates one cell. CV function returns the current value of a dimension used in the left side of a computation formula. 

Additionally, you can also specify some general calculation rules. Here are some examples.

RETURN ALL ROWS: This returns all selected including updated rows by the rules. RETURN UPDATED ROWS will return only those updated rows. 

UNIQUE DIMENSION:  This says the combination of partition and dimension columns uniquely identify one cell.

SEQUENTIAL ORDER: This means the rules will be executed in the written order in the clause.

 

Rules for SQL Modeling


As you might have noticed in the previous section, you can apply rules in the Model clause to cell calculations, row handling, execution order and etc.

For cell reference, here are some basic rules. Please be noted that  the dimensions are the key to position a cell.

Single Cell Reference - Constants are used for dimensions, for example, S['Appliance', 2020].

Multi-Cell Reference - Here are the two examples for the left side and right side respectively, S['Appliance', YEAR >= 2018], MAX(S)['Appliance', Year BETWEEN 1998 AND 2002].

CV Function - It provides a means to use the dimension values on the right side which are referenced on the left side as illustrated in this formula, S['Appliance', 2020] = S[CV(CATEGORY), 2019] + S[CV(CATEGORY), 2018].


For determining cell behavior, you can use UPDATE, UPSERT, and UPSERT ALL. Here are some highlighted points.

UPDATE - If the referenced cell in the left side exists, its measure is updated with the calculation result. No new cells will not be inserted.

UPSERT -  If the cell exists, its measure will be updated. If it doesn't exist, a new cell will be inserted.

UPSERT ALL -  Simply put, it enables the model rules on the left side to have UPSERT capability. For example, S[ANY, 2020] = S[CV(), 2019] + S[CV(), 2018].


There are also some other general rules, for example,

  • Measures are the only updatable columns
  • In terms of execution order, all other clauses, except for Select Distinct and Order By, are evaluated before the Model clause

  • You can't use an aggregate or analytic function in either Select list or Order By list when you have a Model clause in your query

  • You can't use subqueries in the rules except in FOR loop
  • And etc.



Using Aggregate/Analytic Functions in SQL Modeling


To use aggregate or analytic functions in a model clause is straightforward. Basically you just follow the syntax of those functions, no magic here. Let's take a look at a couple of examples.

An Example of Using an Aggregate Function


SELECT
    SEGMENT
   ,CATEGORY
   ,YEAR
   ,S
FROM SALES_VIEW
WHERE YEAR = 2019
MODEL
    DIMENSION BY (SEGMENT, CATEGORY, YEAR)
    MEASURES (SALES S)
    RULES UPSERT
    (
        S['Retail', 'Sub-total', 2019]          = SUM(S)[CV(SEGMENT), ANY, 2019],
        S['Wholesale', 'Sub-total', 2019]  = SUM(S)[CV(SEGMENT), ANY, 2019],
        S['Total', '', 2019]                          = SUM(S)[ANY, 'Sub-total', 2019]
    )
    ORDER BY
    SEGMENT
   ,YEAR
   ,CATEGORY
;


An Example of Using an Analytic Function

  
SELECT
     YEAR
    ,SEGMENT
   ,CATEGORY
   ,S
   ,CSUM
FROM SALES_VIEW
MODEL
    PARTITION BY ( YEAR )
    DIMENSION BY ( SEGMENT, CATEGORY )
    MEASURES (SALES S, 0 CSUM )
    RULES UPSERT
    (
        CSUM[ANY, ANY] = SUM(S) OVER ( PARTITION BY YEAR
                                                                         ORDER BY         YEAR, SEGMENT, CATEGORY 
                                                                         ROWS UNBOUNDED PRECEDING )
    )
    ORDER BY
     YEAR
    ,SEGMENT
   ,CATEGORY
;



Parallel Execution


When you have a number of processors in your database server, parallel computation is expected to be an efficient way to improve performance. Model clause computation can leverage it if you have PARTITION BY keys in the model. Accordingly data will be partitioned among the parallel slaves.


No comments:

Post a Comment

AWS - Build A Serverless Web App

 ‘Run your application without servers’. The idea presented by the cloud service providers is fascinating. Of course, an application runs on...