Wednesday, November 8, 2017

Understand Performance of Oracle DV Machine Learning models using Related Datasets feature

In this blog we dicuss Related datasets produced by Machine Learning algorithms in Oracle Data Visualization.

Related datasets are generated when we Train/Create a Machine learning model in Oracle DV (present in onwards, called V4 in short). These datasets contain details about the model like: Prediction rules, Accuracy metrics, Confusion Matrix, Key Drivers for prediction etc depending on the type of algorithm. Related datasets can be found in inspect model menu: Inspect Model -> Related tab.

These datasets are useful in more ways than one. These datasets let users examine/understand the rules used by model to do prediction/classification, this in-turn will help in fine tuning the model to get better results. Related datasets are also useful in comparing models, in determining which is better than others for solving the same problem.

Here is a pictorial representation of Related datasets generated by different out of the box Machine algorithms in Oracle Data Visualization V4:


Different ML algorithms generate similar Related datasets and all of them can be clubbed into 8 datasets. Individual parameters and column names may change in dataset depending on the type of algorithm, but the functionality of dataset remains the same for ex: columns in Statistics dataset may change Linear Regression and Logistic Regression, but statistics dataset contains accuracy metrics of the model. Here is a brief description of each of these datasets:

1) Drivers: This dataset gives information on columns that are key determinants/drivers of the target column value. Train/Create model performs linear regression and identifies columns that take part in predicting the values for target column. Each of the identified columns are assigned coefficient and correlation values. Coefficient value talks about the weight-age given to that column in determining the target column value and correlation refers to the direction of relationship with target column i.e., if the target value increases or decreases with corresponding change in dependent column.

2) Residuals: This dataset also gives information on the quality of model prediction, Residuals in particular. Residual is the difference between the measured value and the predicted value of a regression model. This dataset gives an aggregated(sum) value of absolute difference between Actual and Predicted values for all the columns in dataset. This dataset is visualized using a bar graph in the Quality tab Linear Regression model Inspect menu.

3) CARTree: This dataset is a tabular representation of Decision Tree computed to predict the target column values. It contains columns that represent the conditions and criteria for conditions in decision tree, prediction for each group, prediction confidence. Inbuilt Tree Diagram visualization can be used to visualize this decision tree.

4) Confusion.Matrix: Confusion Matrix also known as error matrix is a specific table(pivot) layout that allows visualization of performance of an algorithm. Each row of the matrix represents instances of predicted class while each column represents instances in an actual class. This table reports the number of false positives, false negatives, true positives, and true negatives based on which precision, recall, F1 accuracy metrics are computed.

5) Hitmap: This dataset contains information on leaf nodes in the decision tree. Each row in the table represents a leaf node and it contains information the criteria/Branch-segment that leaf node represents, Segment Size, Confidence and Expected # of rows i.e., expected number of correct predictions = Segment Size * Confidence.

6) ClassificationReport: This dataset is a tabular representation of accuracy metrics for each distinct value of target column. For ex: if the target column can have two distinct values 'Yes' and 'No' , this dataset shows accuracy metrics like F1, Precision, Recall, Support(number of rows in Training dataset with this value) for each and every distinct value of Target column.

7) Summary: This dataset contains a summary of input and optional parameters to the model specified during model creation and contains details like Target name and Model name.

8) Statistics: This dataset contains metrics that quantify model accuracy. Depending on the algorithm/model that generates this dataset metrics present in the dataset will vary. Here is a list of metrics based on the model:

  • Linear Regression, CART numeric, Elastic Net Linear:
    • R-Square, R-Square Adjusted, Mean Absolute Error(MAE), Mean Squared Error(MSE), Relative Absolute Error(RAE), Related Squared Error(RSE), Root Mean Squared Error(RMSE)
  • CART(Classification And Regression Trees), Naive Bayes Classification, Neural Network, Support Vector Machine(SVM), Random Forest, Logistic Regression:
    • Accuracy, Total F1

Now you know what the Related datasets are and how they can be useful for fine tuning your Machine Learning model or for comparing two different models.


No comments:

Post a Comment