Wednesday, March 31, 2021

Database privileges required for Text Tokenization in Database Analytics

Oracle Analytics leverages several advanced analytics capabilities of Oracle Database and makes them available across various components within the product. Database Analytics step is one such option within the Dataflows that exposes several database functions like dynamic predictions, sampling, unpivoting, frequent itemset detection, text tokenization and so on. All these operations are directly sent (function-shipped) to the underlying database where certain DDL and DML statements are executed to process the data and return appropriate results back to Oracle Analytics. 

As this heavy lifting processing happens in the underlying database, Administrators need to make sure certain privileges and permissions are granted once and for all in order to execute the various SQLs statements required. The privileges needed depend on the type of Analytics function under consideration. Text Tokenization, for instance, is one such function that leverages Oracle Text capabilities to extract tokens from a column containing descriptive textual data. Refer to this video to understand more about Text tokenization within Oracle Analytics and its benefits.


In this blog, let’s look at what are the various steps involved when a dataflow with Text tokenization is executed and what are the database privileges required to execute them.

Following is a simple example of a dataflow with Text Tokenization step in the middle. In case of dataflows with Database Analytics step, the OAC connection used in the input dataset is automatically set as the connection in the Save Data step as well. This is done to ensure that there is no data transfer and the entire tokenization happens in the database.

Following are the privileges required to perform Text Tokenization: 

1) Access to CTX_DDL package 

        CTX_DDL is a PL/SQL package used to create and manage the preferences, section groups, and stoplists required for Text indexes. This package contains several stored procedures and functions. Text tokenization uses some of these stored procedures like CREATE_STOPLIST, ADD_STOPWORD, ADD_STOPCLASS, ADD_STOPTHEME etc to identify and manage words that are not to be indexed.  

Privileges – The user used in the OAC connection where the input dataset is sourcing from needs to have execute privileges on CTX_DDL package.  For example, let’s say the following is a screenshot of the OAC connection used to create the input dataset. Demouser is the database user used in this connection. demouser needs to have execute privileges on CTX_DDL package.      

Following is the SQL to grant the privilege.

Grant execute on CTXSYS.CTX_DDL to demouser;

Note:- Grant privilege needs to be executed explicitly to demouser. Doing something similar to grant execute on CTXSYS.CTX_DDL to public will not work.

2)   Create Table/Index/View 

As part of the dataflow execution, there are create table, create index and create view statements execution in the underlying database. Ensure that the user of the OAC connection has privileges to create table, create view and create index. If one of these is missing, you will encounter the following error during execution 

[nQSError: 17001] Oracle Error code: 1031, message: ORA-01031: insufficient privileges

Note: All the objects created as part of the dataflow like the intermediate view, stoplist, index etc will be deleted at the end of the tokenization process. Only the output dataset containing the tokens will be retained. 

Since the dataflow execution for tokenization involves executing several SQL statements, recommended approach is to create an OAC connection with username being the same as the owner of the input dataset table in order to avoid running into privilege issues during execution. However, if they are different, let’s see what happens during dataflow execution.

1) What happens when database table owner is different from OAC connection user ?

Let’s assume that a table called COMPLAINTS with a complaint_description column that needs to be tokenized and this table exists in a database schema called sourceuser. In OAC, a connection to this database is created with a different username called demouser.  In this scenario, let’s see if it is possible to use this OAC connection to tokenize the table and what privileges are required.

a) Select privilege on the table – To begin with, OAC Connection username demouser needs to have select privilege on the table in the underlying database. This is required for the table to even show up during create dataset step.  

b) Once the user has select privileges on the table in a different schema, tokenization should work as expected. All the database objects like the resultant table with the extracted tokens, intermediate steps involving view creation, index creation etc will all happen on the OAC connection username which is demouser. This dataflow execution will not create any objects in the source schema i.e sourceuser 

2) What happens when database table already has a text index on the column ? 

The tokenization process first checks if the input table already has an existing token. If it exists, then the dataflow will use this index and retrieve the tokens from the column using this index. Since this index has not been created by the dataflow process, it will remain untouched after the dataflow execution as well.  

Thanks for reading this blog !

Logistic Regression : Understanding model hyper-parameters

Introduction

Oracle Analytics enables data analysts to train Machine Learning (ML) Models and score their datasets. It offers several algorithms that help to build  different types of ML models, and with each algorithm comes a list of hyper-parameters to control the model training process. All these parameters can be manually tuned to the improve overall model design and accuracy. Understanding these hyper-parameters is critical in order to be able to quickly get to the most accurate predictive model.

Logistic Regression is one such ML algorithms within Oracle Analytics used to perform binary classification. In this blog, let's understand the parameters used in creation of  a Logistic Regression model.

Linear Regression Hyper Parameters

Following are the hyper-parameters available for Linear Regression for Binary Classification. Let's understand what each of these mean.

       

Target

The column which we are predicting.

Positive Class

This parameter allows us to specify a value of the Target column that is interpreted as a positive class in the target. These values can vary according to datasets. It could be "Yes" or "No", "1" or "0" or some other binary values.

Predict Value Threshold %

Logistic Regression predicts Values from 0 to 1 which in-turn will be be classified as one of the output classes. In other words, values closer to 0 will be classified as Negative Class and closer to 1 as Positive Class. Predict Value Threshold % allows us to specify the cut off value at which the predicted values will be classified into one of the two classes. For Example if we have Threshold as 50%, then any values scoring above 0.5 will be classified as 'Positive Class' and below 0.5 as 'Negative Class'. 

Column Imputation [Numerical, Categorical]

This parameter allows us to specify how to handle, NA or NULL values in our dataset. When we have columns with NA/NULL values, we may want to impute those columns with valid values.  For numeric columns, NULL values can be replaced with Mean (Default Value), Maximum, Minimum and Median Values of that column. For categorical columns, NULL/NA values can be replaced by most frequent or least frequent items for imputations. 

Categorical Encoding Method

In order to perform Logistic Regression, all categorical variables will to be encoded numerically by the algorithm. Two methods are available to do this encoding: 

a) Indexer : In this method, input categories are indexed. For example, if the input is a categorical variable say Region with values Africa, Asia, Europe and North America. In this method, each of these variables are coded as integers. Africa - 1, Asia - 2, Europe - 3 and North America - 4.Further processing will be performed on these encoded numerical values. 

b) Onehot: In this method, each category is converted into a column of 0a and 1s. For example, if the input is a categorical variable say Region with values Africa, Asia, Europe and North America. In this case, Region becomes 4 columns: Region_Africa (1 for Africa Value, 0 Not Africa Value), Region_Asia, Region_Europe and Region_North_America with 1s and 0s as encoded values.

Number of K Folds

Cross-validation is a re-sampling procedure used to evaluate machine learning models on a limited data sample. It is a technique used to test and improve the effectiveness of the model. The procedure has a single parameter called K that refers to the number of groups that a given data sample is to be split into. This parameter is used to specify the number of K folds to be used in K fold validation. Specifying the number of K folds will split the data into K which in turn reduces the bias in the model. Too large or too small K can create ineffective ML Models. Therefore it is usually suggested for K to be between 5 and 10. This parameter is defaulted to 5 in Oracle Analytics.

Train Partition Percent

During the model creation process, the input data set is split into two parts to train and test the model This parameter is used to specify the percentage of training data that should be used to build model and the remaining will be used for internal testing purpose of the model. The model uses the test portion of the data set to test the accuracy of the model that is built. Default value is 80 which means 80% of the data is used to train the model and 20% to test model accuracy.

Standardization of Data

This is used to standardize the data. In the dataset, one could have have metrics with different scales and that could impact the model training process. In such cases, you can standardize the data by setting this parameter to True.

With this understanding of parameters, let us create a linear regression model and see how to tune it by changing the parameters.

Case Study

To begin with, let's take a look at a sample dataset. We will use the Titanic Dataset (modified for this blog) that contains a list of 750 of Titanic passengers and some of their details. Each row in the dataset represents one person. The goal of this exercise is to predict if the passenger survived the disaster or not using Logistic Regression.


Let us add the dataset to a dataflow. Next add Train Binary Classifier. 

 

Select Logistic Regression 

 

We see the parameters available for this algorithm. Let us understand these parameters and their impact on the model by going through the model building process over a few iterations.

Iteration 1

First let us Select the Target column as 'Survived' and select the positive class as '1'. Let us leave all the other parameters as default and then run the dataflow. 

 

Once the dataflow completes successful, let us inspect the model details by navigating to the Machine Learning tab and clicking on the Model->Inspect option. 


Click on the Quality tab

 

Now we have the model from iteration 1 with the following statistics. 

  • Model accuracy is 80% - (True Positive + True Negative)/Total = (39+81)/150
  • Precision is 70%   - (True Positive)/(True Positive+False Positive) = 39/(39+17)
  • Recall is 75%   - (True Positive)/(True Positive+False Negative) = 39/(39+13)
  • False Positive Rate is 17% - (False Positive)/(Actual Number of Negatives) = 17/98

Iteration 2

Let us go back to the dataflow and change certain parameter values. Let us change the Predicted Threshold % to 52% from default value of 50% and rebuild the model by executing the dataflow. On inspecting the model once again, we see an improvement of Model Accuracy to 81% and changes in Precision, Recall and False Positive Rates.

Iteration 3

Let us change the Predicted Threshold % to 66% from previous value of 52% and rebuild the model. Now there is an improvement of Model Accuracy to 84%. Notice that  Precision and Accuracy are close to each other False Positive Rate is small. This is a reasonably good model.

 

Iteration 4 

Let us change categorical encoding from Indexer to Onehot method and rebuild the model. The quality of the model does not change much as the number of values is low. Datasets with a large number of categorical variables will benefit from changing this parameter.

 

 Iteration 5

Let us change Standardization of Data = True and rebuild the model. This changes the model statistics slightly. Precision and False Positive Rates are improved.

 Iteration 6

Let us change the Number of K-Folds to 7. We can see the Accuracy and Precision numbers become equal and Accuracy is slightly reduced. Usually by increasing the number of Cross Validation Folds, you create a more balanced model. However, depending on the datasets and model scenarios, this approach can change. Kindly refer to Cross Validation statistical documentation to arrive at the right parameter setting for your data.

After a few iterations, we arrive at a model that is satisfactory.  Then we can apply this model on a new dataset with same parameters and do the scoring process.

Conclusion:

Every Machine Learning algorithm in OAC comes with a set of model training hyper-parameters that can be tuned to improve overall accuracy of the model created. Given that the model building process via dataflows is a fairly simple and intuitive process, it becomes easy to iteratively change the model hyperparameter values, inspect the model quality and subsequently arrive at a model of desired accuracy in a reasonable amount of time.