Thursday, August 20, 2020

Mask or anonymize sensitive data in Oracle Analytics

Introduction

In the world of data driven decision making, analysts and managers should ensure appropriate privacy controls by restricting access to certain sensitive information in their data. Many times analysts work with sensitive information data like SSN (Social Security Numbers), CCN (Credit Card Numbers), Patient Names and so on. This data details should be masked before becoming available for further analysis and ability to easily mask data is of prime importance in an Analytics solution.

Oracle Analytics offers powerful data preparation capabilities and includes features that allows anonymizing or masking sensitive information. In this blog, let us explore different ways in which we can achieve this:

1. Using recommendation Feature
2. Using ObfuscateUpdate() Function
3. Using Regular expressions

1. Obfuscate using Recommendation feature

When a dataset is uploaded in Oracle Analytics, the system profiles the data as a first step. As part of this profiling process, the recommendation engine identifies columns that contain sensitive information and offers a recommendation to obfuscate it. Let's say you upload a dataset containing SSN data. The recommendation engine identifies this as sensitive information and offers a few recommendations that allow you to obfuscate or delete this data. 




When clicking on the recommendation ** Obfuscate ssn, a step is added to the Preparation Script and the data is obfuscated. 

This recommendation capability is currently leveraging a 'knowledge reference set' that is Oracle-Analytics-Vanilla list of recommendations. For example, it may not recognize social security numbers of every countries in the world. However very soon, the product will support uploading custom knowledge reference, which will enable administrators to tune the automatic recommendations that the product delivers.

2. Obfuscate using ObfuscateUpdate Function

There are instances where you may want to obfuscate data that has not been automatically identified by the recommendation engine for obfuscation. For instance, you may want to obfuscate a birthdate in your dataset. The recommendation engine will not identify this column for obfuscation. In order to manually apply obfuscation, you can use the ObfuscateUpdate function in data prep. 

This function has the signature ObfuscateUpdate(<column>,parameter). The Parameter can be 'ALL', 'FIRST_5_DIGITS' or 'FIRST_12_DIGITS'.

a.Obfuscate Parameter | ‘ALL’


In this dataset, column ‘birthdate’ is not automatically identified for obfuscation. We can still mask it by explicitly invoking the  ObfuscateUpdate function. 
Edit the colum and add a function ObfuscateUpdate(birthdate,’ALL’)


b. Obfuscate Parameter | ‘FIRST_5_DIGITS’ and ‘FIRST_12_DIGITS’

In addition to the ALL parameter, it also accepts ‘FIRST_5_DIGITS’ and ‘FIRST_12_DIGITS’ as options. These parameters are applied only on digits within the column and not to text. . Let's say we want to obfuscate the first 5 digits of the ccexpires column. The column data for example is 12/2015 and if you mask the first 5 digits, it changes from 12/2015  to ##/###5.

ObfuscateUpdate(ccexpires,'first_5_digits')

Similarly, if you want to obfuscate 12 digits you can set the parameter to TFIRS_12_DIGITS

Note that when you use any other parameters other than ALL, FIRST_5_DIGITS, FIRST_12_DIGITS, the function does not throw any error but returns a blank value. 

3. Using REGEX: More options to Obfuscate data

If we need to obfuscate specific number of digits(other than 5 and 12) , using a Regular Expression is an option. A regular expression (or regex) feature is available as part of "find and replace" feature of a column. Regex is a sophisticated feature with many use cases and is not limited to masking. You can read the following blog post (link) to understand regex feature more. 

Using REGEX | Mask all characters

Edit the column that you want to change and choose replace. 


Select 'Use regular expression' and use the following expression
  • In String to replace use “.”
  • In New String use “*”
All characters are replaced by *

Using REGEX | Mask all characters and show same number of characters in each row

In the above example, the number of characters in each row can guessed by counting number of *. If you do not want to show the length of the string, use the following expression

    • In String to replace use “^.*$”
    • In New String use “*****”

All the rows look the same “*****”

Using REGEX | Mask first n Characters

If you want to mask only first 4 characters, use the following expression

    • In String to replace use “^.{1,4}”. Replace 4 with any number you want.
    • In New String use “****”. Note: If you are replacing 4 characters use ****.

Using REGEX | Masking last n characters

    • In String to replace use “.{1,3}$”. Replace 3 with any number you want.
    • In New String use “***”. Note: If you are replacing 3 characters use ***.

Summary

Obfuscation feature is highly useful to all analysts who want to mask data quickly without resorting to complicated techniques. Oracle Analytics offers different mechanism to achieve this need.