Wednesday, March 31, 2021

Logistic Regression : Understanding model hyper-parameters

Introduction

Oracle Analytics enables data analysts to train Machine Learning (ML) Models and score their datasets. It offers several algorithms that help to build  different types of ML models, and with each algorithm comes a list of hyper-parameters to control the model training process. All these parameters can be manually tuned to the improve overall model design and accuracy. Understanding these hyper-parameters is critical in order to be able to quickly get to the most accurate predictive model.

Logistic Regression is one such ML algorithms within Oracle Analytics used to perform binary classification. In this blog, let's understand the parameters used in creation of  a Logistic Regression model.

Linear Regression Hyper Parameters

Following are the hyper-parameters available for Linear Regression for Binary Classification. Let's understand what each of these mean.

       

Target

The column which we are predicting.

Positive Class

This parameter allows us to specify a value of the Target column that is interpreted as a positive class in the target. These values can vary according to datasets. It could be "Yes" or "No", "1" or "0" or some other binary values.

Predict Value Threshold %

Logistic Regression predicts Values from 0 to 1 which in-turn will be be classified as one of the output classes. In other words, values closer to 0 will be classified as Negative Class and closer to 1 as Positive Class. Predict Value Threshold % allows us to specify the cut off value at which the predicted values will be classified into one of the two classes. For Example if we have Threshold as 50%, then any values scoring above 0.5 will be classified as 'Positive Class' and below 0.5 as 'Negative Class'. 

Column Imputation [Numerical, Categorical]

This parameter allows us to specify how to handle, NA or NULL values in our dataset. When we have columns with NA/NULL values, we may want to impute those columns with valid values.  For numeric columns, NULL values can be replaced with Mean (Default Value), Maximum, Minimum and Median Values of that column. For categorical columns, NULL/NA values can be replaced by most frequent or least frequent items for imputations. 

Categorical Encoding Method

In order to perform Logistic Regression, all categorical variables will to be encoded numerically by the algorithm. Two methods are available to do this encoding: 

a) Indexer : In this method, input categories are indexed. For example, if the input is a categorical variable say Region with values Africa, Asia, Europe and North America. In this method, each of these variables are coded as integers. Africa - 1, Asia - 2, Europe - 3 and North America - 4.Further processing will be performed on these encoded numerical values. 

b) Onehot: In this method, each category is converted into a column of 0a and 1s. For example, if the input is a categorical variable say Region with values Africa, Asia, Europe and North America. In this case, Region becomes 4 columns: Region_Africa (1 for Africa Value, 0 Not Africa Value), Region_Asia, Region_Europe and Region_North_America with 1s and 0s as encoded values.

Number of K Folds

Cross-validation is a re-sampling procedure used to evaluate machine learning models on a limited data sample. It is a technique used to test and improve the effectiveness of the model. The procedure has a single parameter called K that refers to the number of groups that a given data sample is to be split into. This parameter is used to specify the number of K folds to be used in K fold validation. Specifying the number of K folds will split the data into K which in turn reduces the bias in the model. Too large or too small K can create ineffective ML Models. Therefore it is usually suggested for K to be between 5 and 10. This parameter is defaulted to 5 in Oracle Analytics.

Train Partition Percent

During the model creation process, the input data set is split into two parts to train and test the model This parameter is used to specify the percentage of training data that should be used to build model and the remaining will be used for internal testing purpose of the model. The model uses the test portion of the data set to test the accuracy of the model that is built. Default value is 80 which means 80% of the data is used to train the model and 20% to test model accuracy.

Standardization of Data

This is used to standardize the data. In the dataset, one could have have metrics with different scales and that could impact the model training process. In such cases, you can standardize the data by setting this parameter to True.

With this understanding of parameters, let us create a linear regression model and see how to tune it by changing the parameters.

Case Study

To begin with, let's take a look at a sample dataset. We will use the Titanic Dataset (modified for this blog) that contains a list of 750 of Titanic passengers and some of their details. Each row in the dataset represents one person. The goal of this exercise is to predict if the passenger survived the disaster or not using Logistic Regression.


Let us add the dataset to a dataflow. Next add Train Binary Classifier. 

 

Select Logistic Regression 

 

We see the parameters available for this algorithm. Let us understand these parameters and their impact on the model by going through the model building process over a few iterations.

Iteration 1

First let us Select the Target column as 'Survived' and select the positive class as '1'. Let us leave all the other parameters as default and then run the dataflow. 

 

Once the dataflow completes successful, let us inspect the model details by navigating to the Machine Learning tab and clicking on the Model->Inspect option. 


Click on the Quality tab

 

Now we have the model from iteration 1 with the following statistics. 

  • Model accuracy is 80% - (True Positive + True Negative)/Total = (39+81)/150
  • Precision is 70%   - (True Positive)/(True Positive+False Positive) = 39/(39+17)
  • Recall is 75%   - (True Positive)/(True Positive+False Negative) = 39/(39+13)
  • False Positive Rate is 17% - (False Positive)/(Actual Number of Negatives) = 17/98

Iteration 2

Let us go back to the dataflow and change certain parameter values. Let us change the Predicted Threshold % to 52% from default value of 50% and rebuild the model by executing the dataflow. On inspecting the model once again, we see an improvement of Model Accuracy to 81% and changes in Precision, Recall and False Positive Rates.

Iteration 3

Let us change the Predicted Threshold % to 66% from previous value of 52% and rebuild the model. Now there is an improvement of Model Accuracy to 84%. Notice that  Precision and Accuracy are close to each other False Positive Rate is small. This is a reasonably good model.

 

Iteration 4 

Let us change categorical encoding from Indexer to Onehot method and rebuild the model. The quality of the model does not change much as the number of values is low. Datasets with a large number of categorical variables will benefit from changing this parameter.

 

 Iteration 5

Let us change Standardization of Data = True and rebuild the model. This changes the model statistics slightly. Precision and False Positive Rates are improved.

 Iteration 6

Let us change the Number of K-Folds to 7. We can see the Accuracy and Precision numbers become equal and Accuracy is slightly reduced. Usually by increasing the number of Cross Validation Folds, you create a more balanced model. However, depending on the datasets and model scenarios, this approach can change. Kindly refer to Cross Validation statistical documentation to arrive at the right parameter setting for your data.

After a few iterations, we arrive at a model that is satisfactory.  Then we can apply this model on a new dataset with same parameters and do the scoring process.

Conclusion:

Every Machine Learning algorithm in OAC comes with a set of model training hyper-parameters that can be tuned to improve overall accuracy of the model created. Given that the model building process via dataflows is a fairly simple and intuitive process, it becomes easy to iteratively change the model hyperparameter values, inspect the model quality and subsequently arrive at a model of desired accuracy in a reasonable amount of time.

10 comments:

Dynamic Sales Solutions said...

"I visited your blog you have shared amazing information, i really like the information provided by you, You have done a great work. I appreciate your work.
Thanks" Eric

Home Services said...

Current (present) and previous (past) analysis is one of the most common examples of marketing analytics at work. Marketers often find great value in comparing the past period with the current one. It is not enough to understand the different examples of marketing analytics. You must also recognize the challenges associated with conducting each type of analysis and how to overcome them.
https://ppcexpo.com/blog/examples-of-marketing-analytics

Ramesh Sampangi said...

Thanks for sharing this blog, Good content and informative. If ant one looking for machine learning training, click here Machine Learning Training with Placements

Vikas kumar said...

hey that's great article post. If you have know idea about Marketing Reporting Tools then click it. Digital marketing agency and social media marketing agency near me

lakshmibhucynix said...

I cannot thank you enough for the blog.Thanks Again. Keep writing.
data analytics courses in ameerpet
data scientist course in hyderabad

lakshmibhucynix said...

nice post.
Data Science Online Training
Data Science Online Training in Hyderabad

sam kirubakar said...


I am really very happy to visit your blog. Directly I am found which I truly need. please visit our website for more information
Data Visualization Service

Anonymous said...

일산출장샵
파주출장샵
평택출장샵
화성출장샵
의정부출장샵
Hi there! I just want to offer you a huge thumbs up for the great information you have here on this post. I’ll be coming back to your website for more soon.

Anonymous said...

Media Foster is the best SEO company who is known for providing result-oriented white hat & best SEO services at Mohali for all businesses. if you need more information visit our website.
나주출장안마
목포출장안마
순천출장안마
여수출장안마
익산출장안마

LindaRoss said...

It’s the best time to make a few plans for the long run and it is time to be happy. I have read this post and if I may just I desire to suggest you few fascinating issues or advice. Maybe you can write next articles referring to this article. I wish to learn more issues approximately it!.
situs judi bola
agen bola parlay
bandar judi bola online
agen bola hepibet
agen slot parlay
judi bola resmi
agen judi bola

Post a Comment