Oracle Underground BI & Dataviz: July 2019

Friday, July 12, 2019

Using Right-Click Add Statistics with Oracle Analytics Cloud

Oracle Analytics Cloud supports advanced analytics functions with the single click of a mouse button, hiding all the underlying code & math required to support these. It supports capabilities like forecast, outliers identification, clusters computation, and trend-lines. Unlike writing a formula, this right click capability helps the users to quickly enrich their visualizations with advanced analytics based insights.

Advanced analytics capabilities can be added to charts in two different ways by the click of a mouse:

1) By right-clicking on the chart

2) or by visiting the bottom left Chart Properties pane :

Forecast

Forecasting is the process of making predictions of the future, based on past and present data and most commonly by analysis of trends. In this example, we have a report showing Sales by Month and let's forecast Sales by adding it to the chart. The Sales line extends to show forecast with distinguishable color to it.

Various properties of the forecast part can be controlled from the properties pane of the viz. Number of periods to be forecasted can be chosen by overriding the default. Confidence areas can be edited or turned off by setting the prediction interval. Eventually, the forecasting model used to calculate forecast can be chosen between these 3 options :

ARIMA (Moving Average, default): assumes that overall past data is a reliable base to explain & project the future
Seasonal ARIMA: if you find a regular pattern of changes that repeats over time periods, then choose Seasonal Arima.
ETS : Exponential Triple Smoothing is often used in analysis of time series data.

Choose the appropriate prediction interval to get an estimate of an interval in which a future observation will fall, with a certain probability, given what has already been observed.

The right-click option we describe here is the quick way to add a forecast to most of your analysis. If you wish to go deeper and persist the forecast to be reused in other visualizations, then you may as well create a custom calculation using the same forecast function. It provides with more control over the various options for the computation, here is the syntax

FORECAST(numeric_expr, series, output_column_name, options, runtime_binded_options)

numeric _expr : represents the metric to be forecasted.
series : is the time grain at which the forecast model is built.
dimension columns. this is the dimension grain you want to forecast on. If series is omitted, that is basically the grain of your data query.
output_column_name : is the output column. The valid values are

'forecast' : forecast value
'low', : value of the confidence interval low-bound
'high',: value of the confidence interval high-bound,
'predictionInterval' : value of the confidence prediction interval used for this forecast

options : is a string list of name=value pairs each separated by ';'. The values can directly be hardcoded values, or they can refer to a column for example. In this case, put %1 ... %N, for the value and it will reference the same sequenced column in runtime_binded_options.
runtime_binded_options: is an optional comma separated list of runtime binded colums or literal expressions.

Example with ARIMA : In the example, model type used in ARIMA, no of periods to forecast is 6 while prediction interval is 70

FORECAST(
revenue,
(time_year, time_quarter),
'forecast',
'modelType=arima;numPeriods=6;predictionInterval=70;'
)

Example with ETS : In the example, model type used in ETS, no of periods to forecast is 6 while prediction interval is 70

FORECAST(
revenue,
(time_year, time_quarter),
'forecast',
'modelType=ETS;numPeriods=6;predictionInterval=70;'
)

Clusters

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are show a coherence and maximal proximity to each other than to those in other groups. In this example, we have a scatter plot where profit and Sales are plotted against each other with all the scatter dots representing cities. The colors of the dots represent the various clusters that they are grouped into.

Properties of a Cluster calculation can be controlled from the properties pane, like number of clusters for exp. Choose the algorithm used for Clustering between K-means clustering and Hierarchical clustering.

a) K-means clustering: aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
b) Hierarchical clustering: a hierarchy of clusters is built either by agglomerative (bottom-up) or a Divisive (top-down) approach.
A number of clusters can be chosen by entering an appropriate number.

As for other advanced analytics options, if you wish to dig deeper and have more control over how the cluster is being calculated, there is an option to create a calculated column using the cluster function. Here is the syntax

CLUSTER((dimension_expr1 , ... dimension_exprN), (expr1, ... exprN), output_column_name, options, runtime_binded_options)

dimension_expr : represents a list of dimensions , e.g. (productID, companyID), to be the individuals clustered.
expr represents : a list of attributes or measures that make up the 'space' (all the variables) that will be used to cluster the selected dimension_expr.
output_column_name : is the output column. The valid values are ''clusterId', 'clusterName', 'clusterDescription', 'clusterSize', 'distanceFromCenter', 'centers'.
options : is a string list of name=value pairs each separated by ';'. The values can directly be hardcoded values, or they can refer to a column for example. In this case, put %1 ... %N, for the value and it will reference the same sequenced column in runtime_binded_options.
runtime_binded_options: is an optional comma separated list of runtime binded colums or literal expressions.

Example with Clusters : Here the algorithm used is k-means with no of clusters - 5 without the usage of random seed.

CLUSTER((product, company), (billed_quantity, revenue), 'clusterName', 'algorithm=k-means;numClusters=5;maxIter=%2;useRandomSeed=FALSE;enablePartitioning=TRUE')

Outliers

Outliers are data records that are located the furthest away from the average expectation of individuals values. For example extreme values that deviate the most from other observations on data. They may indicate variability in measurement, experimental errors or a novelty. To the same report where the clusters are added, outliers are added and depicted as different shapes. Scatter dots represented by circles are outliers while the ones represented by squares are non-outliers.

Outliers calculation also offers option to choose between algorithms K-means and Hierarchical to be computed.

Right-click option is a quick way to add outliers to your analysis. If you wish to dig deeper and have more control over how the outlier is being calculated, there is an option to create a calculated column using the outlier function. Here is the syntax

OUTLIER((dimension_expr1 , ... dimension_exprN), (expr1, ... exprN), output_column_name, options, [runtime_binded_options])

dimension_expr: represents a list of dimensions , e.g. (Product, Department)
expr: represents: a list of dimension attributes or measures to be used find outlier's.
options : is a string list of name=value pairs each separated by ';'. The values can directly be hardcoded values, or they can refer to a column for example. In this case, put %1 ... %N, for the value and it will reference the same sequenced column in runtime_binded_options.
runtime_binded_options: is an optional comma separated list of runtime binded colums or literal expressions.

Example with Outliers : Here the algorithm used is mvoutlier.

OUTLIER((product, company), (billed_quantity, revenue), 'isOutlier', 'algorithm=mvoutlier')

Trendlines

Similar to adding a forecast for your metric, you can measure the trend of your metric by adding a trend line which indicates a general course of the tendency of the metric in question. A trendline is a straight line connecting a number of points on a graph. It is used to analyze the specific direction of a group of values set in a presentation. In this example, we have Sales by Month and the trendline shows that Sales is on the incline.

Configuration options of the trendline can be controlled in the options pane. The method for
the trend line has 3 options

Linear: A linear trend line is a best fit straight line with linear data sets. Your data is linear if the patter in its data points resembles a line. A linear trend line shows that your metric is increasing or decreasing at a steady rate.
Polynomial: A polynomial trendline involves an equation with several factors of the main X value with various exponent degrees (level 2 = x square, 3=x cubic power...). The result is a curved line that may better fit your data when data fluctuates up and down. It is useful, for example, for understanding gains and losses over a large data set. The order of the polynomial can be determined by the number of fluctuations in the data or by how many bends (hills and valleys) appear in the curve.
Exponential: An exponential trendline is a curved line that is most useful when data values rise or fall at increasingly higher rates. You cannot create an exponential trendline if your data contains zero or negative values.

Confidence Interval A confidence interval is a range of values, derived from sample statistics, that is likely to contain the value of an unknown population parameter. Because of their random nature, it is unlikely that two samples from a given population will yield identical confidence intervals.

Trendline can also be calculated in a custom calculation :

TRENDLINE(numeric_expr, ([series]) BY ([partitionBy]), model_type, result_type)

numeric_expr: represents the data to trend. This is the Y-axis. This is usually a measure column.
series: is the X-axis. It is a list of numeric or time dimension attribute columns.
partitionBY: is a list of dimension attribute columns that are in the view but not on the X-axis.
model_type: is one of the following ('LINEAR', 'EXPONENTIAL').
result_type: is one of the following ('VALUE', 'MODEL'). 'VALUE' will return all the regression Y values given X in the fit. 'MODEL' will return all the parameters in a JSON format string.

Example with Trendline : Trend of revenue by by product using linear regression.

TRENDLINE(revenue, (calendar_year, calendar_quarter, calendar_month) BY (product), 'LINEAR', 'VALUE')

Reference Lines

Reference lines are vertical or horizontal lines in a graph, corresponding with user-defined values on the x-axis and y-axis respectively.

By default the reference line is of type - line and the method chosen for the computation of the line can be chosen from the various options like Average, Minimum, Maximum, etc. For example, In an airline industry, if passenger turnout is plotted wrt to time, the reference would be able to guide users to know whether the passenger turnout for a particular month is above or below average.
Here is an example where the reference line is added on a report where we have Sales by Month. The reference is plotted on the average value of the Sales.

The other method to draw a reference line is Band. The band provides 2 lines with a range of reference. The function can be chosen between Custom and Standard Deviation, while the upper limit and lower limit of the reference line can be chosen between Average, Maximum, Minimum, etc. For example, if we are analyzing sales by month and we have a reference band plotted from Average to maximum, it would be easy to find out, in which months the sales were falling in the above average, below the maximum band.

The function can also be chosen as Standard Deviation and you have an option to select the value of 1,2 or 3. If you choose 1, then the band is drawn in such a way that 68% of values lie within the band. If you choose 2, then 95% of the value lies and for 3, 99.7% of values lie within the band.

If you like to see a live project which showcases these advanced capabilities, please visit our Sandbox environments here

Oracle Analytics cloud offers a pretty user-friendly method to leverage advanced analytics functions on a chart with a single mouse click. Having advanced analytical functions like Forecast, cluster and outliers provide a strong capability to business users to have better insights into data.

Are you an Oracle Analytics customer or user?

We want to hear your story!

Please voice your experience and provide feedback with a quick product review for Oracle Analytics Cloud!

Thursday, July 4, 2019

Unleash your Database power within OAC using Evaluate function

Database systems like Oracle have tremendous analytical and data manipulation capabilities. These help in performing powerful transformations and gaining valuable insights out of massive amounts of data. Databases typically provide a wide range of inbuilt functions that can perform different kinds of deep analyses in a very scalable manner. Often, users are also able to define their own custom functions to meet specific query needs.

As users of Oracle Analytics (OA Cloud or Server), it is extremely valuable to be able to leverage these DBs functions directly from within the OA framework.

The Oracle Analytics (OA) Evaluate function helps in doing just that. It enables users to create powerful analyses by directly invoking database functions. Evaluate provides a syntax to pass the specified database function with optional OA referenced columns as parameters. It is particularly useful when users want to call a specialized database function to process a heavy calculation directly in the underlying database. Evaluate can be used to call functions from any underlying database, Oracle, Essbase cube, DB2, MSSQL, Apache Hive and so on. It is a Scalar function, that is, which takes one or more values but returns a single value for every row.

The following video guides you through the process of leveraging Evaluate with Oracle Analytics, the rest of the blog below also gives the details needed.

Syntax

The syntax of Evaluate wraps around the syntax of the underlying DB function : EVALUATE('db_function(%1...%N)' AS data_type, column1, columnN) where

db_function() syntax is a valid SQL function available in the underlying database.
Column1 to columnN are the optional input parameters to the DB function. %1 is substituted with the value of column1, %2 with column2 and so on, for as many arguments as needed in the db function.
As data_type is an optional parameter that specifies the datatype of the return value (an OA datatype). Specifying data_type is not meant for type casting. Use it only when the return type cannot be easily predicted by the input columns. If there is a need to cast the return value, then add an explicit cast function.

Prerequisites

In order to be able to use Evaluate function in DV, following two conditions need to be met

First, set Evaluate Support Level parameter to 2 in OAC system configuration. By default, Evaluate support level is set to 0, which prevents OAC from executing evaluate functions. To enable it, go to System Settings on the OAC environment and set Evaluate Support Level to 1 or 2. Setting it to 1 means Evaluate is supported to users with ManageRepositories permission. Setting it to 2 means Evaluate is supported for all users.

Make sure Data Access is set to Live For the datasets that will be referenced by the Evaluate function. By setting it to Live, we are ensuring that OAC Server always sends the query to the underlying database and does not try to simply retrieve results from cache.

How to use Evaluate

Let’s consider the Customers table from the sample data SH schema in Oracle Database.

This table contains information about Customer attributes like name, gender, address, phone number etc.

Querying the customers table in OA DataViz, here's how customer phone numbers look, with mixed format : some of the customer records have the phone number in the format xxx-xxx-xxxx while some are in the format (xxx) xxx-xxxx. Let’s say we want to change all the phone numbers to a common format (xxx) xxx-xxxx.

Let's take the example of running some re-gex based replacement rule to achieve this. The database we are using (Oracle DB) has a REGEXP function that allows to do this, let us invoke it just for illustration of Evaluate syntax here.

To begin with, let's first go to our OA System Settings and make sure Evaluate Support Level is set to 1 or 2. This is done only once for the OA environment.

Next, let's create an OA DV data-set sourcing from the Customers table in SH schema. Let's make sure we set the Data Access to Live for this dataset in order to be able to use Evaluate function.

In the Oracle Database, the SQL syntax of the REGEXP_REPLACE function we want to use to change the Phone Number formats would be the following :

REGEXP_REPLACE(cust_main_phone_number, '([[:digit:]]{3})\-([[:digit:]]{3})\-([[:digit:]]{4})', '(\1) \2-\3')

To use this in DV, let's create a custom calculation directly from a project connected to the datasource and use the Evaluate function with the following syntax.

In the Evaluate Expression, we have the syntax for REGEXP_REPLACE as the db function that needs to be called, %1 is the column value being passed to the function which will be substituted by the first column, CUST_MAIN_PHONE_NUMBER (in blue). There is only one parameter in our example.
Notice that quotes is the 'escape character' : so for every single quote used within the original SQL function, we use the escape character which means we precede it by another single quote. That's why you see several places with twice single quotes in the Evaluate formula.

Bringing the new calculation into a visualization looks like this. Phone number format has changed by invoking the database REGEXP_REPLACE function.

If we look at the underlying log with physical SQL sent by OA to the database (use the Session Monitor screen for accessing this), we see that Evaluate has been removed and only REGEXP_REPLACE function is being sent to the underlying database.

Evaluate function can also be used within an OA dataflow. The same REGEXP_REPLACE can be called using an Evaluate function in the Add Columns step from any OA dataflow. Evaluate will properly functionship if all parameters of the dataflow are present in source tables from the same underlying database.

EVALUATE vs EVALUATE_AGGR
Evaluate formula will return the result of the database function as a column of type Attribute in OA. This result cannot be used as a metric or be broken down by other dimensions/attribute in the Datasets.
Evaluate_Aggr is a variation of the Evaluate syntax for OA that allows to invoke aggregate functions in the underlying database and return them as aggregated Metrics in OA. This function is intended for aggregate functions with a GROUP BY clause.

Oracle database, for instance, supports a wide range of aggregate functions. APPROX_COUNT_DISTINCT is one such function that returns the approximate number of distinct values of an expression. This function is particularly useful when used on large amounts of data wherein APPROX_COUNT_DISTINCT is significantly faster than an exact COUNT(DISTINCT) with negligible deviation from the exact result.

Let's see how to invoke this function from within OA using Evaluate_Aggr. Let us use the SH schema that has Sales, Customers and Products table. Let's approximate the number of distinct customers who purchased the various product categories in our fact table. We could run a COUNT(DISTINCT cust_id) on the Sales table, grouped by Product Categories. But this could be process time consuming if we had large amounts of data.

Instead of this, we can invoke APPROX_COUNT_DISTINCT function using Evaluate_Aggr in OAC. The OA DV calculation would look like the following.

The result of this calculation is a metric in OA because we are using Evaluate_Aggr syntax. So in our DV or Answers visualization, we can simply add some attributes to break down the result further. For example we can bring PROD_CATEGORY column and show a bar chart. The bar chart below compares the result of Evaluate_Aggr(APPROX_COUNT_DISTINCT) with the results of actual count distinct calculation. We can see that the approximation of number of customers is very close to the actual values.

So both Evaluate and Evaluate_Aggr OA functions offer a powerful way to leverage any database capabilities directly within the OA framework, both for Cloud (OAC), Server (OAS), or even DV Desktop.

Are you an Oracle Analytics customer or user?

We want to hear your story!

Please voice your experience and provide feedback with a quick product review for Oracle Analytics Cloud!