Wednesday, April 24, 2019

How to perform incremental data loads in Oracle Analytics Cloud

Oracle Analytics Cloud offers the ability to create data flows and perform incremental loads on the target table. Data flows have the ability to operate only on the incremental data which becomes available in the source in between the current run and the previous run. 

In this blog, let's see how to perform incremental load on a database table. This video gives a sense of the whole experience. The blog shares a few more details as well


Prerequisites for performing incremental loads

1) Incremental loads can be performed only on those data sets where the source and target tables are database based. 
2) If there are multiple data sets in a data flow, then only one of the sources can be set for an incremental update.

Defining a source and new data identifier for the source

The first step is to have a source table and identify a column using which new data can be identified in the table. In this example, I have a revenue fact table with month key as a new data identifier


Next step is to create the Oracle Analytics datasource pointing to this table. In the process of creating the datasource, make sure you mark the new identifier column by clicking on the 3rd node in the process flow. This is an important step as this column defines how the system will be able to identify new rows in the dataset.



Define a data flow

Now that our datasource is created, let's define a data flow by importing the revenue tact table from the source connection. The key here is to check the "Add new data only" box to ensure that the source table is marked for incremental load.


To make my dataflow a bit more functionally representative, I will add some business example : converting currencies values. Let's bring in a spreadsheet which has exchange rates for every month to convert and let's join it based on the month key column. Let's add a new calculation to convert revenue. 

Finally lets select step 'Save Data' and specify a name for the resulting data set. Make sure you choose the target connection as a database and specify the table name where the result set needs to be saved. There are 2 options available to select in "When Run" drop down. 
  1. "Replace existing data" : that will make the data flow truncate the target table and reload all the records. 
  2. "Add new data to existing data" : keep the existing records intact and load only the new records in the source table. New records are identified by the column we defined in the source dataset above.
Lets set the When Run option to "Add new data to existing data" and save the data flow.

Now, let's run the data flow for the first time. As it completes, we can see in our underlying data base that the target table has been created. Since this was the first run of the data flow, all the records in the source table are inserted into the target table. 



Now, for the example of this blog, let's go and delete a month of data (201812) from our newly created target table. After doing this, our source table still has its 12 months of data (Jan to Dec) but our target table now only has 11 months, it is missing December. Notice that we did not change data in our source table, so there are no new records there since our last run of the data flow.

So, as we run the data flow for the second time now, the target table does not get incremented at all. The data flow was set to only bring across new data, but there is no new data in the source so nothing is changed in the target table. We can check that the target table is still not loaded with the deleted month's data. If the data flow had been set to full load, all the data would be in the target.



Now, to complete the test, let's manually load our source table with 3 more months of data. This will represent some incremental data. Then let's re-run the data flow once again. We can find that the target table has been incremented with the 3 new months of data coming from the source table. But notice that data in target table is still missing for the month where the records were deleted :

Just to confirm for the purpose of this blog, if we go back to the data flow definition and set Run option to "Replace Existing Data" in the target table, then when we run the data flow all the data gets loaded, including the the deleted month's data.
 



Are you an Oracle Analytics customer or user?

We want to hear your story!

Please voice your experience and provide feedback with a quick product review for Oracle Analytics Cloud!
 

No comments:

Post a Comment