Friday, September 27, 2019

Calculating Period to Period Growth in OAC Without RPD Modelling

Being able to calculate Period to Period growth and growth rates is a common need in most analytics insights today. OAC provides a robust semantic layer capability to do this with the OAC Metadata Repository where time hierarchies and rich time series calculations can be defined and modeled. But most often in self-service-analytics context, we quickly upload a file-based dataset and don't take time to model a metadata layer on it. We directly run visualizations on the data, and we soon need to start defining calculations like time series, growth rates etc... to extract deeper insights.

How to calculate Period to Period with an Oracle DV datasource and no repository ?

This blog highlights a simple technique to build any period to period calculations without building an RPD model with OAC. There are several ways this can be achieved, this entry explains a direct way of running live calculations using a data-source join preparation. Other upcoming blog entries will detail other alternative techniques for similar calculations.

This 6 minute video goes through the process of defining this, below are some illustrated steps as well.

Our data is an excel dataset with a list of 10K orders over several years. Each order has a date and several attributes and metrics. One of the many attribute is the Product Category (3 distinct values), and another is Customer Segment (4 distinct values). We do need to calculate Y/Y growth of monthly sales by Customer Segment by Product Category.
The calculation needed here must be able to properly aggregate Sales it by these attributes (Month, Segment, Category), call the same value for last year, and then compile the growth rate. As we have no repository on our xls upload, we have no semantic Time Hierarchy availalbe. So we cannot leverage the Time Series specific calculations in OAC.

To achieve this dynamic calculation in OAC DV, we can use a simple technique : duplicating the initial dataset in Oracle DV, and joining the two duplicate sources with a Period <---> (Period -1) join. If we do this, the initial dataset will still show actual period in the project, and the duplicated - joined dataset will provide the data for the previous period, at the right aggregated level.

First, let's start by duplicating the original dataset in DV. Note that this does not duplicate the source file or source table data, it only creates another entry in OAC pointing to the same source data.

Once dataset is duplicated, we need to create a custom 'join-key' column in each dataset which we will use to join the two datasets. That happens in the data preparation tab of DV. The column we create in our case is a 'month-key' column. in the original dataset it's simply a month sequence showing YEAR*100+MONTH_NUMBER. For example a date like 21-Jan-2012 will show a month key of 201201

So we create this column in both datasets, but the trick is that in one of the datasets we add or substract 1 to the year number (depending on which dataset). In my case, I will use the duplicated dataset to show year ago data, so I create a column in this dataset with a '+1' in the year number. I could just as well have created a 'Year-1' column in the other dataset.

That way, when we join both datasets via this month-key, records with for example year 2013 in the original dataset will join to the records with year 2012 in the Yago dataset : 2013 = 2012 + 1. So now we join the datasets in DV data prepare :

Note that we join on the month key column, but we must also join on every other column that we want our analysis to potentially aggregate on : Customer Segment and Product Category in our case. If we don't join on these, our reports by Customer Segment and Product Category will not give detailed results.

So once this is done, we just have to build visualizations on the joined datasets. One more important point : by default visualizations will show a join type (Data Blending icon, on each viz properties tab) of All Rows joining with All Rows (full outer joins). In our case, we need to set this to 'Only Matching Rows' for the Yago dataset. This viz config will be needed for each viz we build using the two datasets:

With this, we can now build all sorts of visuals and calculations using the Yago Sales metric. Because we specifically joined on month-key,Customer Segment & Product Category, all these computation will aggregate correctly at the levels of Month & above (Quarter, Year), Customer Segment & above, and Product Category & above.
This technique can be used in a similar way in DV Data Flows, but in this case, datasets do not need to be duplicated. Also, default joining in Data Flows there is not of type full outer. That simplifies the various Vizs configuration process. An upcoming blog post will describe how to optimally use this technique in a Data Flow.

Friday, July 12, 2019

Using Right-Click Add Statistics with Oracle Analytics Cloud

Oracle Analytics Cloud supports advanced analytics functions with the single click of a mouse button, hiding all the underlying code & math required to support these. It supports capabilities like forecast, outliers identification, clusters computation, and trend-lines. Unlike writing a formula, this right click capability helps the users to quickly enrich their visualizations with advanced analytics based insights.

Advanced analytics capabilities can be added to charts in two different ways by the click of a mouse: 

1) By right-clicking on the chart 

2) or by visiting the bottom left Chart Properties pane :


Forecasting is the process of making predictions of the future, based on past and present data and most commonly by analysis of trends. In this example, we have a report showing Sales by Month and let's forecast Sales by adding it to the chart. The Sales line extends to show forecast with distinguishable color to it.

Various properties of the forecast part can be controlled from the properties pane of the viz. Number of periods to be forecasted can be chosen by overriding the default. Confidence areas can be edited or turned off by setting the prediction interval. Eventually, the forecasting model used to calculate forecast can be chosen between these 3 options :

  • ARIMA (Moving Average, default): assumes that overall past data is a reliable base to explain & project the future
  • Seasonal ARIMA: if you find a regular pattern of changes that repeats over time periods, then choose Seasonal Arima.
  • ETS : Exponential Triple Smoothing is often used in analysis of time series data.

Choose the appropriate prediction interval to get an estimate of an interval in which a future observation will fall, with a certain probability, given what has already been observed.

The right-click option we describe here is the quick way to add a forecast to most of your analysis. If you wish to go deeper and persist the forecast to be reused in other visualizations, then you may as well create a custom calculation using the same forecast function. It provides with more control over the various options for the computation, here is the syntax

FORECAST(numeric_expr, series, output_column_name, options, runtime_binded_options)

numeric _expr : represents the metric to be forecasted.
series : is the time grain at which the forecast model is built. 
dimension columns. this is the dimension grain you want to forecast on. If series is omitted, that is basically the grain of your data query.
output_column_name : is the output column. The valid values are 

  • 'forecast' : forecast value
  • 'low', : value of the confidence interval low-bound
  • 'high',: value of the confidence interval high-bound, 
  • 'predictionInterval' : value of the confidence prediction interval used for this forecast

options : is a string list of name=value pairs each separated by ';'. The values can directly be hardcoded values, or they can refer to a column for example. In this case, put %1 ... %N, for the value and it will reference the same sequenced column in runtime_binded_options.
runtime_binded_options:  is an optional comma separated list of runtime binded colums or literal expressions.

Example with ARIMA : In the example, model type used in ARIMA, no of periods to forecast is 6 while prediction interval is 70

(time_year, time_quarter), 

Example with ETS : In the example, model type used in ETS, no of periods to forecast is 6 while prediction interval is 70

(time_year, time_quarter), 


Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are show a coherence and maximal proximity to each other than to those in other groups. In this example, we have a scatter plot where profit and Sales are plotted against each other with all the scatter dots representing cities. The colors of the dots represent the various clusters that they are grouped into.

Properties of a Cluster calculation can be controlled from the properties pane, like number of clusters for exp. Choose the algorithm used for Clustering between K-means clustering and Hierarchical clustering.

a) K-means clustering:  aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
b) Hierarchical clustering: a hierarchy of clusters is built either by agglomerative (bottom-up) or a Divisive (top-down) approach.
A number of clusters can be chosen by entering an appropriate number.

As for other advanced analytics options, if you wish to dig deeper and have more control over how the cluster is being calculated, there is an option to create a calculated column using the cluster function. Here is the syntax

CLUSTER((dimension_expr1 , ... dimension_exprN), (expr1, ... exprN), output_column_name, options, runtime_binded_options)

dimension_expr : represents a list of dimensions , e.g. (productID, companyID), to be the individuals clustered.
expr represents : a list of attributes or measures that make up the 'space' (all the variables) that will be used to cluster the selected dimension_expr.
output_column_name : is the output column. The valid values are ''clusterId', 'clusterName', 'clusterDescription', 'clusterSize', 'distanceFromCenter', 'centers'.
options : is a string list of name=value pairs each separated by ';'. The values can directly be hardcoded values, or they can refer to a column for example. In this case, put %1 ... %N, for the value and it will reference the same sequenced column in runtime_binded_options.
runtime_binded_options:  is an optional comma separated list of runtime binded colums or literal expressions.

Example with Clusters : Here the algorithm used is k-means with no of clusters - 5 without the usage of random seed.

CLUSTER((product, company), (billed_quantity, revenue), 'clusterName', 'algorithm=k-means;numClusters=5;maxIter=%2;useRandomSeed=FALSE;enablePartitioning=TRUE')


Outliers are data records that are located the furthest away from the average expectation of individuals values. For example extreme values that deviate the most from other observations on data. They may indicate variability in measurement, experimental errors or a novelty. To the same report where the clusters are added, outliers are added and depicted as different shapes. Scatter dots represented by circles are outliers while the ones represented by squares are non-outliers.

Outliers calculation also offers option to choose between algorithms K-means and Hierarchical to be computed.

Right-click option is a quick way to add outliers to your analysis. If you wish to dig deeper and have more control over how the outlier is being calculated, there is an option to create a calculated column using the outlier function. Here is the syntax

OUTLIER((dimension_expr1 , ... dimension_exprN), (expr1, ... exprN), output_column_name, options, [runtime_binded_options])

dimension_expr: represents a list of dimensions , e.g. (Product, Department)
expr: represents: a list of dimension attributes or measures to be used find outlier's.
options : is a string list of name=value pairs each separated by ';'. The values can directly be hardcoded values, or they can refer to a column for example. In this case, put %1 ... %N, for the value and it will reference the same sequenced column in runtime_binded_options.
runtime_binded_options:  is an optional comma separated list of runtime binded colums or literal expressions.

Example with Outliers : Here the algorithm used is mvoutlier.

OUTLIER((product, company), (billed_quantity, revenue), 'isOutlier', 'algorithm=mvoutlier')


Similar to adding a forecast for your metric, you can measure the trend of your metric by adding a trend line which indicates a general course of the tendency of the metric in question. A trendline is a straight line connecting a number of points on a graph. It is used to analyze the specific direction of a group of values set in a presentation. In this example, we have Sales by Month and the trendline shows that Sales is on the incline.

Configuration options of the trendline can be controlled in the options pane. The method for 
the trend line has 3 options

  • Linear: A linear trend line is a best fit straight line with linear data sets. Your data is linear if the patter in its data points resembles a line. A linear trend line shows that your metric is increasing or decreasing at a steady rate.
  • Polynomial: A polynomial trendline involves an equation with several factors of the main X value with various exponent degrees (level 2 = x square, 3=x cubic power...). The result is a curved line that may better fit your data when data fluctuates up and down. It is useful, for example, for understanding gains and losses over a large data set. The order of the polynomial can be determined by the number of fluctuations in the data or by how many bends (hills and valleys) appear in the curve.
  • Exponential: An exponential trendline is a curved line that is most useful when data values rise or fall at increasingly higher rates. You cannot create an exponential trendline if your data contains zero or negative values.

Confidence Interval A confidence interval is a range of values, derived from sample statistics, that is likely to contain the value of an unknown population parameter. Because of their random nature, it is unlikely that two samples from a given population will yield identical confidence intervals. 

Trendline can also be calculated in a custom calculation :

TRENDLINE(numeric_expr, ([series]) BY ([partitionBy]), model_type, result_type)

numeric_expr: represents the data to trend. This is the Y-axis. This is usually a measure column.
series: is the X-axis. It is a list of numeric or time dimension attribute columns.
partitionBY: is a list of dimension attribute columns that are in the view but not on the X-axis.
model_type: is one of the following ('LINEAR', 'EXPONENTIAL').
result_type: is one of the following ('VALUE', 'MODEL'). 'VALUE' will return all the regression Y values given X in the fit. 'MODEL' will return all the parameters in a JSON format string.

Example with Trendline : Trend of revenue by by product using linear regression.

TRENDLINE(revenue, (calendar_year, calendar_quarter, calendar_month) BY (product), 'LINEAR', 'VALUE')

Reference Lines

Reference lines are vertical or horizontal lines in a graph, corresponding with user-defined values on the x-axis and y-axis respectively.
By default the reference line is of type - line and the method chosen for the computation of the line can be chosen from the various options like Average, Minimum, Maximum, etc. For example, In an airline industry, if passenger turnout is plotted wrt to time, the reference would be able to guide users to know whether the passenger turnout for a particular month is above or below average.
Here is an example where the reference line is added on a report where we have Sales by Month. The reference is plotted on the average value of the Sales.

The other method to draw a reference line is Band. The band provides 2 lines with a range of reference. The function can be chosen between Custom and Standard Deviation, while the upper limit and lower limit of the reference line can be chosen between Average, Maximum, Minimum, etc. For example, if we are analyzing sales by month and we have a reference band plotted from Average to maximum, it would be easy to find out, in which months the sales were falling in the above average, below the maximum band.

The function can also be chosen as Standard Deviation and you have an option to select the value of 1,2 or 3. If you choose 1, then the band is drawn in such a way that 68% of values lie within the band. If you choose 2, then 95% of the value lies and for 3, 99.7% of values lie within the band.

If you like to see a live project which showcases these advanced capabilities, please visit our Sandbox environments here

Oracle Analytics cloud offers a pretty user-friendly method to leverage advanced analytics functions on a chart with a single mouse click. Having advanced analytical functions like Forecast, cluster and outliers provide a strong capability to business users to have better insights into data.

Thursday, July 4, 2019

Unleash your Database power within OAC using Evaluate function

Database systems like Oracle have tremendous analytical and data manipulation capabilities. These help in performing powerful transformations and gaining valuable insights out of massive amounts of data. Databases typically provide a wide range of inbuilt functions that can perform different kinds of deep analyses in a very scalable manner. Often, users are also able to define their own custom functions to meet specific query needs.

As users of Oracle Analytics (OA Cloud or Server), it is extremely valuable to be able to leverage these DBs functions directly from within the OA framework. 

The Oracle Analytics (OA) Evaluate function helps in doing just that. It enables users to create powerful analyses by directly invoking database functions. Evaluate provides a syntax to pass the specified database function with optional OA referenced columns as parameters. It is particularly useful when users want to call a specialized database function to process a heavy calculation directly in the underlying database. Evaluate can be used to call functions from any underlying database, Oracle, Essbase cube, DB2, MSSQL, Apache Hive and so on. It is a Scalar function, that is, which takes one or more values but returns a single value for every row.

The following video guides you through the process of leveraging Evaluate with Oracle Analytics, the rest of the blog below also gives the details needed.


The syntax of Evaluate wraps around the syntax of the underlying DB function : EVALUATE('db_function(%1...%N)' AS data_typecolumn1columnN) where 
  • db_function() syntax is a valid SQL function available in the underlying database.
  • Column1 to columnN are the optional input parameters to the DB function. %1 is substituted with the value of column1, %2 with column2 and so on, for as many arguments as needed in the db function.
  • As data_type is an optional parameter that specifies the datatype of the return value (an OA datatype). Specifying data_type is not meant for type casting. Use it only when the return type cannot be easily predicted by the input columns. If there is a need to cast the return value, then add an explicit cast function.


In order to be able to use Evaluate function in DV, following two conditions need to be met
  1. First, set Evaluate Support Level parameter to 2 in OAC system configuration. By default, Evaluate support level is set to 0, which prevents OAC from executing evaluate functions. To enable it, go to System Settings on the  OAC environment and set Evaluate Support Level to 1 or 2. Setting it to 1 means Evaluate is supported to users with ManageRepositories permission. Setting it to 2 means Evaluate is supported for all users.
  1. Make sure Data Access is set to Live For the datasets that will be referenced by the Evaluate function. By setting it to Live, we are ensuring that OAC Server always sends the query to the underlying database and does not try to simply retrieve results from cache.

How to use Evaluate

Let’s consider the Customers table from the sample data SH schema in Oracle Database. 
This table contains information about Customer attributes like name, gender, address, phone number etc.  

Querying the customers table in OA DataViz, here's how customer phone numbers look, with mixed format : some of the customer records have the phone number in the format xxx-xxx-xxxx while some are in the format (xxx) xxx-xxxx. Let’s say we want to change all the phone numbers to a common format (xxx) xxx-xxxx. 

Let's take the example of running some re-gex based replacement rule to achieve this. The database we are using (Oracle DB) has a REGEXP function that allows to do this, let us invoke it just for illustration of Evaluate syntax here. 

To begin with, let's first go to our OA System Settings and make sure Evaluate Support Level is set to 1 or 2. This is done only once for the OA environment.

Next, let's create an OA DV data-set sourcing from the Customers table in SH schema. Let's make sure we set the Data Access to Live for this dataset in order to be able to use Evaluate function.  

In the Oracle Database, the SQL syntax of the REGEXP_REPLACE function we want to use to change the Phone Number formats would be the following :

REGEXP_REPLACE(cust_main_phone_number, '([[:digit:]]{3})\-([[:digit:]]{3})\-([[:digit:]]{4})', '(\1) \2-\3')

To use this in DV, let's create a custom calculation directly from a project connected to the datasource and use the Evaluate function with the following syntax.

In the Evaluate Expression, we have the syntax for REGEXP_REPLACE as the db function that needs to be called, %1 is the column value being passed to the function which will be substituted by the first column, CUST_MAIN_PHONE_NUMBER (in blue). There is only one parameter in our example.
Notice that quotes is the 'escape character' : so for every single quote used within the original SQL function, we use the escape character which means we precede it by another single quote. That's why you see several places with twice single quotes in the Evaluate formula.

Bringing the new calculation into a visualization looks like this. Phone number format has changed by invoking the database REGEXP_REPLACE function. 

If we look at the underlying log with physical SQL sent by OA to the database (use the Session Monitor screen for accessing this), we see that Evaluate has been removed and only REGEXP_REPLACE function is being sent to the underlying database. 

Evaluate function can also be used within an OA dataflow. The same REGEXP_REPLACE can be called using an Evaluate function in the Add Columns step from any OA dataflow. Evaluate will properly functionship if all parameters of the dataflow are present in source tables from the same underlying database.

Evaluate formula will return the result of the database function as a column of type Attribute in OA. This result cannot be used as a metric or be broken down by other dimensions/attribute in the Datasets.
Evaluate_Aggr is a variation of the Evaluate syntax for OA that allows to invoke aggregate functions in the underlying database and return them as aggregated Metrics in OA. This function is intended for aggregate functions with a GROUP BY clause. 

Oracle database, for instance, supports a wide range of aggregate functions. APPROX_COUNT_DISTINCT is one such function that returns the approximate number of distinct values of an expression. This function is particularly useful when used on large amounts of data wherein APPROX_COUNT_DISTINCT is significantly faster than an exact COUNT(DISTINCT) with negligible deviation from the exact result. 
Let's see how to invoke this function from within OA using Evaluate_Aggr. Let us use the SH schema that has Sales, Customers and Products table. Let's approximate the number of distinct customers who purchased the various product categories in our fact table. We could run a COUNT(DISTINCT cust_id) on the Sales table, grouped by Product Categories. But this could be process time consuming if we had large amounts of data. 

Instead of this, we can invoke APPROX_COUNT_DISTINCT function using Evaluate_Aggr in OAC. The OA DV calculation would look like the following.

The result of this calculation is a metric in OA because we are using Evaluate_Aggr syntax. So in our DV or Answers visualization, we can simply add some attributes to break down the result further. For example we can bring PROD_CATEGORY column and show a bar chart. The bar chart below compares the result of Evaluate_Aggr(APPROX_COUNT_DISTINCT) with the results of actual count distinct calculation. We can see that the approximation of number of customers is very close to the actual values.

So both Evaluate and Evaluate_Aggr OA functions offer a powerful way to leverage any database capabilities directly within the OA framework, both for Cloud (OAC), Server (OAS), or even DV Desktop.

Thursday, June 13, 2019

How to configure Usage Tracking in OAC?

Oracle Analytics Cloud supports the accumulation of Usage Tracking (UT) statistics that helps in monitoring system usage and performance. 

These statistics can be used in a variety of ways such as in system or database optimization, aggregation strategies, or internal billing of users/departments based on the resources that they consume. Usage tracking is particularly useful in determining user queries that are creating performance bottlenecks, based on query frequency and response time. 
Oracle BI Server tracks usage at a detailed query level. When Usage Tracking is enabled, BI Server collects data records for every query that is executed and writes it all to database tables. Both logical and physical SAL queries are tracked and logged in separate tables, along with various performance measures : 

The following video gives a high level overview of how to set up Usage Tracking on your OAC environment. This blog content below gives more details about this capability.

Prerequisites to enable Usage Tracking

Following conditions need to be met in order to enable Usage Tracking on your environment:
1) Usage Tracking requires a full RPD (Metadata Repository) to be enabled. It will track usage on any queries, even outside the RPD, but it only requires an RPD for its configuration. For example, UT cannot be enabled when only a TCM (Thin Client Modeler, ie web based admin-tool) model is active on your OAC.

2) UT requires an access to a Database to write data back into tables. So a user login with Create table privileges on the database schema will be needed during the configuration.

Steps to configure Usage Tracking

UT configuration broadly involves the following two steps: 
1) Defining a Database connection in RPD: using Admintool client (editor for RPD repository), define a database connection in the physical layer of your RPD. This connection should point to a database where Usage Tracking tables will be created and maintained. This database can be anywhere on the Cloud as long as it is accessible for write accss by OAC. 

2) System Settings in OAC : Configure Usage Tracking parameters like connection pool name, physical query table name and logical query table name.

Step 1: Define a database connection in rpd

Open the RPD which is uploaded in OAC and create a new database in the physical layer. Provide an appropriate name (eg: UsageTracking) and choose the database type as Oracle 12c.

Under this database, create a new Connection Pool with an appropriate name (Eg: UTConnectionPool). Provide the connection details to the database where Usage Tracking is to be configured and login credentials to the schema. 
Note:- This database user needs to have create table privileges on the schema. 

Next, create a physical schema in the RPD with the same name as the Database schema to be used by UT (Eg:- UT_demo). 

Once these definitions are completed, the physical layer should look like this. 

Save the RPD and upload it onto OAC using the Replace Data Model option within the Service Console.

Step 2:System Configuration

On OAC, open the System settings screen from the Console Tab. 

Scroll down and modify the following properties.

1) Toggle on the Enable Usage Tracking button

2) Usage Tracking Connection Pool: 
Enter the connection pool name in the format : <Database>.<ConnectionPool>. 
Eg:- UsageTracking.UTConnectionPool

3) Usage Tracking Physical Query Logging Table: 
This the table where details about the physical queries executed by BI Server against the underlying database are tracked. 
Enter the name in the format :  <Database>.<Schema>.<Table>
Eg:- UsageTracking.UT_demo.PhysicalQueries

4) Usage Tracking Logical Query Logging Table
This is the table where details about the logical SQLs executed by BI Server are tracked. Enter the name in the format : <Database>.<Schema>.<Table>
Eg:- UsageTracking.UT_demo.LogicalQueries

5) Usage Tracking Max Rows : Specifies the number of rows after which the UT tables will be purged. The default value is 0 which implies there is no purging. You can set it to a any desired value and the records would be purged once the threshold is reached.

After entering these details, restart Oracle BI Server. Once the server is restarted, Usage Tracking is enabled and the two UT tables are created in the database.

UT Tables Explained

Open SQL Developer, login to the UT database and observe that the 2 tables are created.

To generate some usage tracking data, login to OAC and click around some DV projects: both XSA as well as subject area based. Also open the BI classic home and open few dashboards to generate some queries. Now observe that the tables for logical and physical queries are populated with usage tracking information.

Important Columns in Logical Queries table

END_TS End time of query execution
ERROR_TEXT Error message if the query has errored out
ID Primary key
QUERY_TEXT Actual Query Text
RESP_TIME_SEC Total response time of query in seconds
ROW_COUNT No of rows returned by the query
SAW_DASHBOARD Dashboard name where query is getting generated
SESSION_ID Session ID from the user firing the query
START_TS Start time of the query execution
SUBJECT_AREA_NAME   Subject area used for the query
USER_NAME User ID who has executed the query

Important Columns in Physical Queries table

ID Primary Key
LOGICAL_QUERY_ID  Foreign Key from the logical queries table
TIME_SEC Time taken by query to complete
ROW_COUNT No of rows returned by Query
START_TS Start time of the query
END_TS End time of query

The join between the 2 tables can be performed using the join condition 
   LogicalQueries.ID  =  PhysicalQueries.Logical_Query_ID. 
Note that not all logical queries would generate a corresponding physical query. Some of the logical queries may hit cache and return results from cache and not generate a physical query.

Analyzing UT Data

Once UT is enabled, the system usage can be analyzed from DV. In order to do this, create a DV connection to the UT database, create datasets for the PhysicalQueries and LogicalQueries table and analyze them within a DV project. 

Following is a sample analysis built on the UT tables that shows # of sessions, # of queries,  most frequently used subject areas, most frequently used dashboards etc. 


Usage Tracking provides a mechanism for administrators to keep track of the usage of the OAC system. These statistics can be leveraged to take decisions to scale up, scale down, restrict access during certain time periods, pause/resume the system and so on.

Thursday, May 2, 2019

To Bat first or Bowl first? Strategy for the upcoming ICC Cricket World Cup 2019

2019 is finally here and along with it, the biggest extravaganza in the world of cricket - the ICC Cricket World Cup 2019!! The tournament is one of the world's most viewed sporting events and is considered the "flagship event of the international cricket calendar" by the International Cricket Council. This is the 12th edition of the Cricket World Cup scheduled to be hosted by England and Wales.

Have you ever thought about the innumerable factors influencing a cricket match outcome? 

  1. The weather on the day of the match dictates whether it would end up in a high scoring or a low scoring affair. 
  2. The condition of the pitch can also play an important role in the outcome. 
  3. The toss... who doesn't dread losing the toss? 

While there are several factors influencing the outcome, one of the most difficult decisions that captains have to make is whether to bat first or bowl first. This is a crucial decision and a strongly debated topic. One key aspect to this decision making would be to look at history. See how teams fared batting first vs chasing, how this pattern varies across venues, how some teams have done better batting first vs chasing etc ... 

Can any Analytics help in making this decision ?

Following are very simple OAC (Oracle Analytics Cloud) visualizations that were built on all the world cup matches played from 1975 to 2015. Using OAC, it only took a few minutes to gain interesting insights on strategy to adopt for teams competing at the 2019 World Cup :

Eleven World Cups have been played so far and England hosted four (4) of them. Australia won the World Cup 5 times, India and West Indies come a distant second by winning 2 each, Sri Lanka and Pakistan won it only once.

A Marker in Finalists Strategy

Teams batting first won 63% of the time while teams chasing first only won 37% of the time. So if a team is skilled enough to enter the finals, batting first doubles its chances of holding that cup!!

Team wise Strategies that worked

Digging a bit deeper, this pattern varies by teams: for Australia and India batting first or second didn't seem to matter when they won the World Cup. For West Indies however, they only won by batting first. So if West Indies makes it to final at Lords this time around, their bet should definitely be on batting first.

Team Strategy when England is hosting

Since England is the host nation in the upcoming event, let's shift our focus to only those World Cups which were played in England. Interestingly, when hosted in England, finals where won most of the time (75%) by teams batting first. 
When it comes to semi finals, it's the opposite: teams batting second have won most semi finals !!
These numbers contrasts vigorously with semi-finals and finals hosted outside of England:
In all the other countries, teams batting first have won most finals and semi finals.

Venue-wise Strategy

Could venue even play a part in this pattern ? Let's consider the premier venues of England where the majority of the games in 2019 world cup are about to be played: Birmingham, Leeds and Lords. Teams playing there have donw equally well batting first or second. 
But if you are playing in Manchester (where the 2019 semifinals would be played), history suggests its better to chase, while at The Oval (London), teams batting first have fared way better.

Let's look at the venues where teams have won with the highest average runs margin. Birmingham, Taunton and Chester Le Street are the preferred grounds to bat first where the winning margin is pretty high

Lets now look at those venues where teams bowling first have won with the highest wickets margin. Canterbury, Lords and Leeds are the venues where teams chasing first have fared well. 

Looking at overall batting win rate across premier venues in England, Nottingham and The Oval are the most preferred venues to bat first. 

Team strategy across World Cups games

Some individual teams historically fare better when batting first vs chasing at World Cup matches, irrespective of venues and location. Australia, the top performer has mostly won by batting first (60%). In contrast, arch rivals New Zealand are way better chasers (60%) !! 

In the early days of World Cups or even other cricket matches for that matter, teams always preferred to bat first. Bat first, score big and put the opposition under pressure. But noticing the trend from 2007 World Cup onwards, stats show us that there is not much of a difference in batting first or chasing. Does this indicate that teams are getting better playing under pressure while chasing? Perhaps they are !


Most of the cricket enthusiasts would argue that batting first has been the best winning strategy so far, but data does not always support that position : captains winning the toss would benefit from looking at historical stats to make an informed decision. 
For the upcoming World Cup in England, insights from historical data suggests a few distinct approaches : 

Semi Final 1 (in Manchester) : data shows that Manchester has been a ground where teams have dominated while chasing. It also shows that, historically in England, semi finals were won by chasing. So the team winning the toss is more likely to win the match if they let the opposition bat first.

Semi Final 2 (in Birmingham) : if you win the toss in Birmingham, though the venue doesn't favor teams batting first or bowing, data also suggests that there has been big wins with runs margin (ie by batting first). The team winning the toss is more likely to win the match if they bat first.

Finals (Lord's) : Historically teams have won in finals in England by batting first. The team winning the toss is more likely to have their chance of lifting the World Cup if they bat first.

May the best team win the World Cup!!