Wednesday, October 10, 2018

Dataflows gets better and smarter

Data flows in latest version of Oracle Analytics Cloud is more powerful and useful than ever. Lets take a look.

Data flows in Oracle Analytics lets users take one ore more data sources, join them and transform the data to produce a curated set of data that users can use to visualize and analyze. Data flows have various inbuilt functions/nodes like Adding new columns/calculations, Removing columns, Grouping, Binning, Training different Machine Learning Models, Forecasting, Sentiment Analysis etc to transform and enrich the data.

In the latest version of Oracle Analytics more new features like Dataset Prompts, Branching, Incremental data processing, Output columns metadata management etc are added to dataflows. We will go through each of those new features in detail in this blog.

Dataset Prompts: 

Dataset Prompts feature lets users choose input or output datasets to a dataflow on the fly at the time of executing/running the dataflow. Prompt option is useful in cases where a user would like to reuse a complex dataflow with another dataset or to return output dataset with a different name without having to edit the flow by opening it. Prompt option can parametrize the input and output datasets of a dataflow. By default this option is disabled and users have to enable it by clicking on Prompt check box. Here are a few snapshots that show how to enable Prompts for dataflows:


                               
Name field takes the default dataset name as input. This should be the name of actual dataset present in the instance.
Prompt field takes the prompt text to be shown when running the dataflow.
Prompt option is available for both input and output datasets of the dataflow. If Prompt option is not selected dataflow will run with the dataset selected during dataflow creation/edit phase.

This is how the prompt window looks like when a dataflow with prompts enabled is run:


To summarize Prompt option adds a great deal of flexibility to running the dataflows by specifying the input and output datasets on the fly without having to edit the dataflow.

Here is a video which demonstrates Dataset prompt feature in dataflows:



Branching:
Branching option allows users to branch the output of a node(except Train ML Model node) in the dataflow into 2 or more branches. Users can apply different transformations on different branches and save the outputs of these different branches to different datasets. End node of each branch will always be Save Data. To add a branch node click on + and select branch node. This is how the branch node and its options look like:

                   
Number of branches can be incremented or decremented by entering the value or using UI options. Each branch can process disjoint subsets of data and return distinct outputs. For example in the below snapshot 3 branches are added to Sample Order Lines dataset.


1st branch computes Sales by Customer Segment and saves it in a dataset.
2nd branch computes Sales by Product Category and saves it in a different dataset and
3rd branch adds month column and saves the entire result in a different dataset.

On running/executing this dataflow three different datasets will be created. Here is a snapshot of the output of these 3 branches:

                                                 

To summarize Branch option is useful when user wants to do different transformations on different subsets of data and save it separately.

Here is a quick video tutorial that shows how branch option can be used:


Output Controls:
Output controls feature lets users decide how an output column of a dataflow should be treated and saved as (Attribute or Metric) in the output dataset. For metric columns default aggregation can also be chosen. On adding Save Data node to the dataflow users are provided with option to decide how each of the output columns should be treated as. Users can select Attribute or Measure from the drop down list. For metric columns users can select the default aggregation rule. Here is a quick snapshot that shows how users can change the data types and default aggregation rule for columns:



To summarize this feature provides more control to users over datatype of output column.
Here is a quick video tutorial that demonstrates this feature:


Incremental Data Processing:
Incremental data processing feature allows users to run the dataflow for incremental data/rows that become available between batch runs. This feature helps in efficient usage of resources to run dataflows only on incremental data rather than re-running on data which is processed already. This option is available only for datasets created from Database connections and this can be enabled only for a single input dataset within a dataflow.

Enabling incremental processing for a dataset is a two step process:
1) First step is to set New Data Indicator column while creating the dataset from a database. To enable incremental processing set New Data Indicator field to one of the columns from the dataset in configuration page. New data added to the database will be identified based on this indicator column. Here is a quick snapshot which shows how to configure the new data indicator column:



In this case New Data Indicator column is set to TIME_BILL_DT (date) column.

2) After adding the newly created dataset as an input to the dataflow, select "Add New Data Only" field to enable incremental processing for this dataset. Here is a quick snapshot that shows how this should done:

                                             

Now this dataset is enabled for incremental processing and any updates to this dataset/table will be processed incrementally in the dataflow when the dataflow is run after making changes to the dataset.

Output of the dataflow for the incremental data can either be appended to the existing output or can replace the existing output. Here is a quick snapshot which shows where to choose this option while saving the output dataset:

                                     
Now all the required parameters are set for incremental processing. When we save and run this dataflow for the first it will be run for the entire dataset and for subsequent runs, dataflow would be run only for the changed(added or removed) data in the configured dataset.

Here is a quick video tutorial that demostrates incremental data processing capabilities of dataflows:


Are you an Oracle Analytics customer or user?

We want to hear your story!

Please voice your experience and provide feedback with a quick product review for Oracle Analytics Cloud!
 

1 comment:

Alfred Avina said...

The main motive of the Big Data Solutions Developer is to spread the knowledge so that they can give more big data engineers to the world.

Post a Comment