Monday, June 8, 2020

How Regex can enhance Data Preparation?

Is simple find-and-replace not sufficient for your need in Oracle Analytics Data Preparation ? Regular Expression Replace functionality is bringing great new powerful abilities with pattern matching in Oracle Analytics Data Prepare. We will see what it has got to offer, in this blog.
First, take a quick look at the video which explains briefly about this feature.


Oracle Analytics now allows users to utilize the full potential of Regular Expressions aka Regex in pattern matching of column text in OAC Data Preparation or Data Set editor. Before we jump into the feature, let's recollect the basics of regular expressions. You can skip this part if you’re already familiar with Regular Expressions.

Regular Expressions in a nutshell

Regular Expressions is a powerful tool for matching patterns. A pattern can be a word or a specific character or sequence of characters, occurring once or multiple times. Regex can also help you find words occurring in a specific format, like email address. We will only address some of the useful Regex patterns in this blog entry, which will still help you kick-start Regex Replace in OAC Data Preparation.

For starters, let's see some of the commonly used Regex patterns :
Let's start with a pattern to match one or more digits. ‘\d+’ can be used to find the digits from 0-9 anywhere in the string. The backslash ‘\’ is used as escape character, ‘d’ is the shorthand code for digits and ‘+’ sign denotes one or more occurrence. Other useful patterns are:

  • w            Any alphanumeric character
  • W           Any non-alphanumeric character
  • d             Any digit
  • D            Any non-digit character
  • .              Any character
  • *             Zero or more repetitions
  • +             One or more repetitions
  • ?             Optional character

Character Classes or Character Sets
When you want to match only one out of several characters, Character classes or Character-sets can be used. For eg. to match an a or e, we can use [ae]. The pattern ‘gr[ae]y’ matches gray and grey but not Grey or graay.
It is also possible to specify a range with [] brackets. For eg. [0-9] matches a digit from 0-9, [a-z] matches a letter from a to z. Multiple character ranges can also be used. For Eg. [A-Z0-9] matches an uppercase alphabet followed by a digit. Remember, the patterns are case-sensitive!

Grouping Patterns
In Regex, group of characters can be defined for capturing and processing the data. Let's do this with an example. Consider the text ‘ABSD0028’ which has two parts – alphabets(ABSD) and digits(0028). If we have to write a pattern to split the text into two parts, we can do so by grouping the characters with parenthesis and each group can be captured using back-references. Here is our pattern.
([A-Z]+)(\d+)

Back-references
Now, having learnt about the grouping, we must know how useful are they in the real-world scenario. The main purpose of grouping in regular expressions is to use one or more groups in the output. Let’s consider the above example. We grouped the text into two parts. We can refer these parts using back-references and construct a new text. Naming of the back-reference parameters might vary depending upon the platform/programming language on which Regex engine works. In OAC, back-reference parameters are denoted by a ‘$’ sign followed by the order of occurrence of the group.  For eg. in the pattern ([A-Z]+)(\d+), ‘([A-Z]+)’ will be referred as $1 and ‘(\d+)’ will be referred as $2.

Alternation
Alternation is the ‘OR’ operator in regular expression. To match red or green or blue, use red|green|blue. To match ‘red color’ or ‘blue color’ use grouping, for eg. (red|blue) color.

Constructing a pattern
Now, let’s construct a pattern which we could use in Regex Replace. Consider extracting email handle from an email address. Email ids are in the following format: <some_text>@<org_name>.<com>. Let’s split this format into three parts – first part, the email handle, second part, the ‘@’ symbol and third part, the domain. The first part could include alphanumeric characters with special characters like ‘_’ and ‘.’. So, the pattern for this part would be: ‘[A-z0-9_\.]+’. We can use the second part as it is. And, the third part is going to be alphabets with ‘.’, the pattern would be ‘[a-z\.]+’. Now, if we combine the three parts, we get the following pattern:

([A-z0-9_\.]+)@([a-z\.]+)

So, what’s new with Replace in OAC Data Preparation?

Let’s see what’s new in Replace functionality. For the demo purpose, I have created a project with a data set called Customer Profile, which has customer details like JOB_LEVEL, Email address, Phone number, Occupation, Education etc.


Earlier, in Data Preparation, there was just plain search and replace feature in the context menu of the columns. There was no option to specify a partial or entire match. Now, there are three options available in Replace page as radio buttons.

       a.   Match partial values
       b.   Match entire values only
       c.   Use regular expression


When option a is selected, matches are found in both partial and entire values of the text. Let’s see how ‘Match partial values’ work in the column Occupation. In the Prepare tab of the project, right-click on the column Occupation and select Replace.


You can see the following values in Occupation: 


Enter ‘Manual’ in String to replace and ‘Worker’ in New String. Select ‘Match partial values’. You will notice that both ‘Manual’ and ‘Skilled Manual’ are modified to ‘Worker’ and ‘Skilled Worker’ respectively.


Now, select ‘Match entire values only’ with same input. You will notice that only ‘Manual’ is modified to ‘Worker’ and ‘Skilled Manual’ remains the same. This option comes in handy when the requirement is specific to replace the entire values alone.


We will explore option c, ‘Use regular expression’ with rest of the examples.

Use of Alternation or ‘|’
In the column ‘ENGLISH_EDUCATION’ let’s remove the words ‘Partial’ and ‘Degree’ from the values.


Enter ‘Partial|Degree’ in String to replace, leave New String as blank and select ‘Use regular expression’. Now you will see that ‘Partial’ and ‘Degree’ are removed from the values like ‘Partial High School’, ‘Partial College’ and ‘Graduate Degree’ correspondingly.


You can use the ‘|’ symbol as the ‘OR’ operator to include more than one value to match. As you might have noticed, this eliminates multiple steps when more than one matches need to be replaced with a single value.

Obfuscation, using literal matches
Obfuscation is a significant step in Data Preparation, as it improves the privacy and security of sensitive content. Popular candidates for obfuscation could be Social security numbers, phone numbers, account numbers etc. With powerful features of regular expressions, it becomes easy to obfuscate in OAC. Let’s see how to do that with an example. In our data set, let's obfuscate the column Phone. If you notice the values of this column, it is divided into three parts. The first three digits within parenthesis, next three digits separated with a dot (.) from the last four.


Let’s replace the digits with ‘x’ mark. Enter ‘\d+’ in String to replace and enter ‘xxx’ in New String (‘[0-9]+’ and ‘\d+’ can be used interchangeably). Select Use regular expression and you will see all the numbers are replaced with ‘xxx’. Various patterns can be used according to the requirement.


Extraction with Back-reference
Let’s see how to extract specific part of the text using regular expression. Consider the column Email_address from our data set. Let’s extract the email handle from the email address.


As you might recall the pattern we constructed in the Regex example for the email address – ([A-z0-9_\.]+)@([a-z\.]+) can be divided into three parts. We’re interested in the first part, which is the email handle. Enter the pattern ‘([A-z0-9_\.]+)@([a-z\.]+)’ in String to replace and enter $1 in the New String and select Use regular expression. Now, you will just see the first part of the email address in the column. You may rename the column name as email_handle.


So, once we group and divide a text according to our need, we can use them in the output as back-references. This might be helpful where the data has multiple columns combined into one. For eg. IL-Chicago, where State and City combined into one column. Columns with log entries, timestamps etc could be interesting candidates for extraction using back-references.

Enriching data with Regular Expressions
When we want to enrich the existing data in a highly customized format, we can use grouping and back-references together to make a powerful combination to transform data in one shot. Let’s enrich the column JOB_LEVEL by inserting some text in between.


Similar to the example we saw in Grouping Patterns, JOB_LEVEL has two parts – alphabets first and digits next. The pattern ‘([A-Z]+)([0-9]+)’ can be used to match and split into two parts. Let's insert the text ‘_000’ between two parts. Enter ‘([A-Z]+)([0-9]+)’ in String to replace, ‘$1_000$2’ in New String and select Use regular expression. You will see the JOB_LEVEL is transformed into an enriched column. This could be helpful when the data is expected to be in a specific format.


What we have seen here in this blog is, just the tip of the ice-berg. A lot more can be done with Regular Expression Replace and it can be a powerful combination in OAC Data Preparation.

Thanks for reading this blog!

No comments:

Post a Comment