Oracle Analytics includes several Machine Learning algorithms that enable data analysts to identify patterns in their data and make predictions on it. These various algorithms support different type of models : classifications, clustering or numeric predictions. Each algorithm offers several hyper-parameters to control the model training process and that can be manually tuned to improve overall model accuracy. It is important to understand these hyper-parameters to tune your training process in order to quickly get to the most accurate predictive model.
If we look in particular at classification models -both binary and multi-classification, Oracle Analytics offers several algorithms to perform it :
Each of these come with a set of hyper-parameters that can be configured. In
this blog, let us understand in detail on the Random Forest algorithm and review these parameters, they are actually fairly simple to understand and leverage :
Random Forest for model training
Random Forest is a classification algorithm that builds several Decision Tree models on the data and predicts using all these trees, called 'a forest of trees'.This algorithm builds each individual decision tree is by choosing a random sample data from the data set as the input. At each node of a tree, only a random sample of predictors is chosen for computing the split point. This introduces variation in the data used by the different trees in the forest.
Following is the list of
hyperparameters available for this algorithm (for both binary classifications and
multi-classifications):
- Target: This is the column with actual values which we wish to learn the model on. In case of binary classifications, the target must have two possible values (Yes/No, 0/1 and so on). In case of multi classifications, it can have more than two outcomes ( Low/Medium/High, North/East/South/West and so on).
- Positive class in Target: This allows the user to specify a value of the Target column that is interpreted as a positive class in the target. In order to classify the possible output values, the algorithm needs Positive and Negative classes. ‘Yes’ is the default option.
- Number of Trees in the forest: This hyper-parameter is considered to build the ensemble or forest of the model. It accepts a number between 2 and 100, with a default value of 10. A general guidance is, the more trees in the forest, the better the results. A higher number of trees means means more samples created from the initial training data-set. That reduces the biased-ness in the data used for training and can achieve more realistic results. However, increasing number of trees value will have performance implications for large datasets. 10 is the default value.
- Sample Size for a Tree: this determines the size of the sample data that is used for any single tree to predict the outcome. You can control the degree of randomness with the sample size parameter as increasing the sample size results in less 'variation' (randomness) in data for the individual trees in the forest. The best sample size is actually equal in size to the size of the original dataset : some rows are not selected while others are selected more than once. This typically provides near-optimal performance. However, in real-world applications, you may find that adjusting the sample size iteratively can lead to improved performance. This accepts a number between 500 and 10,000 with 500 as the default value.
- Number of Features for a Tree: This accepts a number between 2 and 10. Increasing the number of features generally improves the performance of the model as there are higher number of options to be considered at each node. However, this can decrease the diversity of an individual tree which is the significant aspect of random forest. Also, by increasing the features, the speed of algorithm decreases. Hence, we must find the right balance and choose the optimal value. 3 is the default value.
- Minimum Node Size: This is the minimum size of each leaf nodes in the decision tree and accepts any value between 10 and 100. This will constrain the tree into not being able to have a terminal node with less records than this number. When the limit is set high, it causes decision trees to be smaller and hence take less time to compute the samples. Setting it low leads to trees with a larger depth which means that more splits are performed until the terminal nodes, which can provide deeper insights on the data. Lower values are believed to generally provide good results, but performance can potentially be improved by tuning it. 50 is the default value.
- Maximum Depth: This represents the depth of each tree in the ensemble or the forest. The depth is the number of levels in the tree 'splits' hierarchy : how many levels from the first split to the furthest leaf level. The deeper the tree, the more the tree captures variance information about the data. The objective should be to expand each tree until every leaf is as pure as possible. A pure leaf is one where all the data comes from exactly the same class. This parameter combines wiht Node Size as well, and has a direct impact on training performance : more levels require more processing time to train a model. Maximum Depth can be set up to 10 with 5 being the default.
- Maximum Confidence: This represents the confidence interval the model can take and can be used to rank the rules and hence the predictions. It is a measure of how sure the model is that the true value lies within a confidence interval computed by the model. This determines the likelihood of the predicted outcome, given that the rule has been satisfied. For example, consider a list of 1000 patients. Out of all the patients, 100 satisfy a given rule say, high fasting blood sugar level. Of these 100, 75 are likely to have heart disease, and 25 are not likely to have. The confidence of the prediction (likely to have) for the cases that satisfy the rule is 75/100 (75%). Maximum confidence parameter accepts a value between 1 and 100 with 80 being the default.
- Train Partition Percent: This is used to specify the percentage of training data that should be used to build model and the remaining will be used for internal testing purpose of the model. During the model creation process, the input data set is split into two parts to train and test the model based on the Train Partition Percent parameter. The model uses the test portion of the data set to test the accuracy of the model that is built. Default value is 80 which means 80% of the data is used to train the model and 20% to test model accuracy.
- Balancing method: This is used when there is an imbalance
classification of the given samples : a skewed number of records have a given value (vast majority of 'No' and very few 'Yes').... In such situations, a method is applied
in the algorithm to re-balance the set for better training : Under Sample, Over Sample and None are the
options available. When the dataset is imbalanced, with majority of the records
having a target attribute value of ‘No’, with under-sampling, the set of
records where the target attribute is ‘No’ will be under-sampled to
provide the same number of records as those with a value of ‘Yes’. When
the dataset is small, an oversampling method can be used, and its effect is the
opposite of under-sampling. Over-sampling will look at records having a target
value of ‘Yes’ and over-sample these until it has the same number of
records as those of the other target values. Default is Under Sample.
Case Study
To explore more about
Random Forest algorithm and the effect of its parameters, let us consider a
customer demographic data set with 1500 records. This dataset contains customer attributes like cust_gender, cust_marital_status,
education, occupation, household_size and so on and we would like to train a
model that will predict the customer response to an affinity card program with two possible
outcomes Yes/No.
Here’s the sample
dataset:
First, upload this dataset in Oracle Analytics. Next, create a dataflow with this input dataset and choose Train Binary Classifier step. Choose Random Forest for Model Training as the script.
A list of hyperparameters to train the model with default values as described in the previous section is displayed. Let us understand the effect of these hyperparameters on the model quality in different iterations.
Iteration 1: Let us leave these hyperparameters with default values and execute the dataflow. Successful dataflow execution creates a Machine Model which can be viewed from the Machine Learning tab. Let us right click on the model and inspect it to know more about its quality. With default parameter values, following is the model’s quality.
Model accuracy is 76%
which is computed as (True Positive + True Negative)/Total = (65+163)/300
Precision is 54% which is
computed as (True Positive)/(True Positive+False Positive) = 65/(65+55)
False Positive Rate is 25% which is computed as False Positive/Actual Number of Negatives = 55/218
Iteration 2 : Let us
increase the Sample Size of the tree from default 300 to 1500( as we have 1500
records in our input dataset). Let’s rebuild the model by executing the
dataflow. Upon inspecting the model quality, we notice that while accuracy
remains at 76%, precision has improved from 54% to 55%.
Iteration 3: Let us
increase the number of trees from default 10 to 30 and Maximum depth from
default 5 to 7. Rebuild the model and inspect
model details.
We notice that the model
accuracy has improved to 77%, precision has improved to 56% and False Positive
Rate has dropped from 25% to 24%.
Conclusion:
Every Machine Learning algorithm
in Oracle Analytics comes with a set of model training hyperparameters that can be
tuned to improve overall accuracy of the model created. Given that the model
building process via Dataflows is a fairly simple and intuitive process, it
becomes easy to iteratively change the model hyperparameter values, inspect the
model quality and subsequently arrive at a model of desired accuracy
in a reasonable amount of time.
1 comment:
nice post.
data science course in hyderabad
data science training in hyderabad
Post a Comment