Oracle Underground BI & Dataviz: 2019

Merging 2 data sets into a single one is a common need for analysts manipulating diverse data sources. Oracle Analytics Cloud's Data Flow enables an friendly way to do this within the context of preparing data, just before building visualizations and deriving insights out of it.

This blog highlights the different scenarios in which union rows step of the data flow can be leveraged.

Prerequisites for using the Union Rows step in Data Flows

The Union-Rows step in Data Flows can merge 2 data sets into a single one, provided the two input sources have the same structure. It operates in a similar way to the SQL Union operation in the database, where result set of 2 queries are merged into a single output :

Both the data sets should have same number of columns,
The data types of the corresponding columns in data sets should match. For example column1 of data set 1 must have the same data type as column1 of data set 2 and so on.

In other words, you can easily merge datasets that are of similar format. For example sales data for different months are in different datasources, or HR data for various subsidiaries are in different files, etc.

So, for our example, lets consider 2 data sets with sales order information for 2 separate months, Jun-2019 and Jul-2019. This data has columns like Order Priority, Customer Name, Product Category, Product Name and Sales as a metric. There are some customers in these data sets who have purchased the same product across the two months.

Data Set - June 2019 (100 rows)

Data Set - July 2019 (100 rows)

We have 10 rows of customers who have purchased the same products across the two months. We will see how these 10 rows of duplicate data across the two months are handled in the various Union operations discussed below.

How to create a Union Row step in Data Flow?

Lets add both the data sets to an Oracle Analytics Data Flow, and further down in our Data Flow click on the "+" icon and choose Union Rows step. The Union Rows step gets enabled only when there is more than one data set added in the Data Flow.

As the step adds to the Flow, click on the suggested 'second' data node to complete the Union Step. that is just to indicate which two data-sets we want to union.
The details of this node presents various options in which an Union operation can be performed are presented along with a visual representation of the chosen option. The visual representation clearly specifies what would be the resulting data set when the options are chosen. The Options available are

All rows from input1 and input2 (Union All)
Unique rows from input1 and input2 (Union)
Rows common to input1 and input2 (Interesct)
Rows unique to input1 (Except)
Rows unique to input2 (Except)

Use Case 1: Merge both the data sets

Lets say we want to merge both sets into a single data set for reporting purposes and keep all the rows. In this scenario, all the rows from data set 1 has to be merged with all the rows from data set 2 (equivalent to a Union All operation in SQL). This operation will not eliminate any duplicate records.The visual representation of this operation is shown on the right of the screen, explaining how the resulting data set would look like.

So let's complete our Data Flow with the two input data sets, use the Unions Rows node with the first option selected and execute the flow.

When the data flow is executed, a Data Set is created with the merged output, and its total number of rows (which can be calculated using a formula like sum(1)) is 200. This shows that the union operation merged both the data sets without removing any duplicates.

Use Case 2: Merge without duplicates

Lets now say we want to merge both the data sets but only want to keep the distinct combinations of customers and products they purchased over the two months. In effect, we want to remove duplicates across the two data sets with these same combinations. This is similar to a Union operation in SQL. A Union operation merges the rows from both the data sets by eliminating the duplicates and retaining a single copy of the duplicate rows

In order to achieve this, lets trim down the data set by retaining only customer and product related columns. So we achieve this by using a 'Select Columns' node just before our Union Rows node, and we keep only the following columns : Customer Name, Product Category, Product Sub Category, Product Container and Product Name. Let's now perform a Union Rows on these two data-sets and select the second option which is 'Unique rows from Input 1 and Input 2.'

When the data flow is executed and creates the merged output, we notice that row count in resulting data set is 190. This shows that only one copy of the duplicate records (10) were retained in the output.

Use Case 3: Rows common from input1 and input2

Now lets say we want to find the list of customers who have ordered the same products twice, once in June and once in July. In this scenario, we have to find the common records with same customer and same product between the two data sets. This is similar to an intersect operation in SQL.

Let's define the data flow similar to the one described in Use Case 2 with just the customer and product related columns selected and apply a Union rows node with the third option 'Rows common to Input 1 and Input 2'.

When the data flow is executed and creates the merged output, we notice that row count in resulting data set is 10. Only one copy of the duplicate records were retained in the output. The detailed tabular reports shows the exact intersecting records between the data sets

Use Case 4 and 5 : Rows unique to input1 (Except)

Finally, lets say we are interested in those customers and products that were ordered only in June but were explicitly not ordered in July. Both the data sets have to be merged by retaining only the unique records in the first data set. This is equivalent to an except operation in SQL.

Let's again define the data flow and choose the option 'Rows unique to Input 1'

When the data flow is executed and creates the merged output, we notice that record count in resulting data set is 90, meaning only those records in the first data set except the records that are common to the second data set are saved in the output.

Note that if a user needs to merge more than two data sets, it can be done in multiple steps in the same Data Flow : first merge two data sets, then merge the output to the third data set and so on.

Union-Rows node allows a friendly way to merge two data sets in an Oracle Analytics Data Flow. It will work agnostic to the type of data source, as long as the respective structure of the data is identical.

Are you an Oracle Analytics customer or user?

We want to hear your story!

Please voice your experience and provide feedback with a quick product review for Oracle Analytics Cloud!

This blog shows a preview of an upcoming OAC 5.5 feature. OAC 5.5 is expected to be GA very soon by Oracle (Jan 2020), the feature will become available then.

In the 5.5 release of OAC (Oracle Analytics Cloud) creating custom Infographics visualizations is made simple through the use of custom dynamic canvas backgrounds and the new spacer visualization. Infographics created on OAC can then be used from any device (browser or mobile) and dynamically resize per the screen size.

In this example we are building a infographic showing stats about NFL fans attendance. We selected a custom picture of a stadium that we want to use as a custom background to our infographics.

1. Setting custom DV background with an image.

To configure that image as a custom background, we are using Canvas Properties menu by right-clicking on the Canvas tab itself and selecting the option. Once there, we are setting the value Cutom for Background property.

There, pick the image location, adjust transparency and make sure you set size parameters to Auto-Fit, this will ensure your DV sample scales regardless of your screen size and resolution.

All we have to do next is to start building our infographic by positioning visualizations on top of that picture.

2. Laying out visualizations

In our example, we will use a combination of vertical bar charts and horizontal bar charts, along with the tag cloud extension. As we build our visualizationsm we can use the spacer visualization (OAC5.5), to customize our layout exactly as we need it…

we can drag the spacer extension to create separations between different visualizations. It can be resized as needed, just like other vizs, flipped either horizontal or vertically and set background color or as an image. While we are organizing our vizs, the canvas properties keep to the default autofit configuration, which means the whole page will properly resize accoding to screen size of the consuming devices.

Once our visualizations content and layout completed we can customize additional display properties such as font color and size for all report components, i.e, title, values, x/y axis

Blog Collateral

Watch this short 3 minute video - Spacer Viz and Custom Background that comes OOB with OAC 5.5.

Are you an Oracle Analytics customer or user?

We want to hear your story!

Please voice your experience and provide feedback with a quick product review for Oracle Analytics Cloud!

Oracle Underground BI & Dataviz

Friday, December 13, 2019

Merge/Union rows in OAC Data Flows