Introduction
Oracle Analytics enables data analysts to train Machine Learning (ML) Models and score their datasets. It offers several algorithms that help to build different types of ML models, and with each algorithm comes a list of hyper-parameters to control the model training process. All these parameters can be manually tuned to the improve overall model design and accuracy. Understanding these hyper-parameters is critical in order to be able to quickly get to the most accurate predictive model.
Logistic Regression is one such ML algorithms within Oracle Analytics used to perform binary classification. In this blog, let's understand the parameters used in creation of a Logistic Regression model.
Linear Regression Hyper Parameters
Following are the hyper-parameters available for Linear Regression for Binary Classification. Let's understand what each of these mean.
Target
The column which we are predicting.
Positive Class
This parameter allows us to specify a value of the Target column that is interpreted as a positive class in the target. These values can vary according to datasets. It could be "Yes" or "No", "1" or "0" or some other binary values.
Predict Value Threshold %
Logistic Regression predicts Values from 0 to 1 which in-turn will be be classified as one of the output classes. In other words, values closer to 0 will be classified as Negative Class and closer to 1 as Positive Class. Predict Value Threshold % allows us to specify the cut off value at which the predicted values will be classified into one of the two classes. For Example if we have Threshold as 50%, then any values scoring above 0.5 will be classified as 'Positive Class' and below 0.5 as 'Negative Class'.
Column Imputation [Numerical, Categorical]
This parameter allows us to specify how to handle, NA or NULL values in our dataset. When we have columns with NA/NULL values, we may want to impute those columns with valid values. For numeric columns, NULL values can be replaced with Mean (Default Value), Maximum, Minimum and Median Values of that column. For categorical columns, NULL/NA values can be replaced by most frequent or least frequent items for imputations.
Categorical Encoding Method
In order to perform Logistic Regression, all categorical variables will to be encoded numerically by the algorithm. Two methods are available to do this encoding:
a) Indexer : In this method, input categories are indexed. For example, if the input is a categorical variable say Region with values Africa, Asia, Europe and North America. In this method, each of these variables are coded as integers. Africa - 1, Asia - 2, Europe - 3 and North America - 4.Further processing will be performed on these encoded numerical values.
b) Onehot: In this method, each category is converted into a column of 0a and 1s. For example, if the input is a categorical variable say Region with values Africa, Asia, Europe and North America. In this case, Region becomes 4 columns: Region_Africa (1 for Africa Value, 0 Not Africa Value), Region_Asia, Region_Europe and Region_North_America with 1s and 0s as encoded values.
Number of K Folds
Cross-validation is a re-sampling procedure used to evaluate machine learning models on a limited data sample. It is a technique used to test and improve the effectiveness of the model. The procedure has a single parameter called K that refers to the number of groups that a given data sample is to be split into. This parameter is used to specify the number of K folds to be used in K fold validation. Specifying the number of K folds will split the data into K which in turn reduces the bias in the model. Too large or too small K can create ineffective ML Models. Therefore it is usually suggested for K to be between 5 and 10. This parameter is defaulted to 5 in Oracle Analytics.
Train Partition Percent
During the model creation process, the input data set is split into two parts to train and test the model This parameter is used to specify the percentage of training data that should be used to build model and the remaining will be used for internal testing purpose of the model. The model uses the test portion of the data set to test the accuracy of the model that is built. Default value is 80 which means 80% of the data is used to train the model and 20% to test model accuracy.
Standardization of Data
This is used to standardize the data. In the dataset, one could have have metrics with different scales and that could impact the model training process. In such cases, you can standardize the data by setting this parameter to True.
With this understanding of parameters, let us create a linear regression model and see how to tune it by changing the parameters.
Case Study
To begin with, let's take a look at a sample dataset. We will use the Titanic Dataset (modified for this blog) that contains a list of 750 of Titanic passengers and some of their details. Each row in the dataset represents one person. The goal of this exercise is to predict if the passenger survived the disaster or not using Logistic Regression.
Let us add the dataset to a dataflow. Next add Train Binary Classifier.
Select Logistic Regression
We see the parameters available for this algorithm. Let us understand these parameters and their impact on the model by going through the model building process over a few iterations.
Iteration 1
First let us Select the Target column as 'Survived' and select the positive class as '1'. Let us leave all the other parameters as default and then run the dataflow.
Once the dataflow completes successful, let us inspect the model details by navigating to the Machine Learning tab and clicking on the Model->Inspect option.
Click on the Quality tab
Now we have the model from iteration 1 with the following statistics.
- Model accuracy is 80% - (True Positive + True Negative)/Total = (39+81)/150
- Precision is 70% - (True Positive)/(True Positive+False Positive) = 39/(39+17)
- Recall is 75% - (True Positive)/(True Positive+False Negative) = 39/(39+13)
- False Positive Rate is 17% - (False Positive)/(Actual Number of Negatives) = 17/98
Iteration 2
Let us go back to the dataflow and change certain parameter values. Let us change the Predicted Threshold % to 52% from default value of 50% and rebuild the model by executing the dataflow. On inspecting the model once again, we see an improvement of Model Accuracy to 81% and changes in Precision, Recall and False Positive Rates.
Iteration 3
Let us change the Predicted Threshold % to 66% from previous value of 52% and rebuild the model. Now there is an improvement of Model
Accuracy to 84%. Notice that Precision and Accuracy are close to each other False Positive
Rate is small. This is a reasonably good model.
Iteration 4
Let us change categorical encoding from Indexer to Onehot method and rebuild the model. The quality of the model does not change much as the number of values is low. Datasets with a large number of categorical variables will benefit from changing this parameter.
Iteration 5
Let us change Standardization of Data = True and rebuild the model. This changes the model statistics slightly. Precision and False Positive Rates are improved.
Iteration 6
Let us change the Number of K-Folds to 7. We can see the Accuracy and Precision numbers become equal and Accuracy is slightly reduced. Usually by increasing the number of Cross Validation Folds, you create a more balanced model. However, depending on the datasets and model scenarios, this approach can change. Kindly refer to Cross Validation statistical documentation to arrive at the right parameter setting for your data.
After a few iterations, we arrive at a model that is satisfactory. Then we can apply this model on a new dataset with same parameters and do the scoring process.
Conclusion:
Every Machine Learning algorithm
in OAC comes with a set of model training hyper-parameters that can be
tuned to improve overall accuracy of the model created. Given that the model
building process via dataflows is a fairly simple and intuitive process, it
becomes easy to iteratively change the model hyperparameter values, inspect the
model quality and subsequently arrive at a model of desired accuracy
in a reasonable amount of time.