In this blog we will talk about a technique of comparing performance of two Machine Learning Binary classification models using Cumulative Gains chart and Lift chart. In one of our earlier blogs we have seen how to compare performance of two machine learning models: Which ML model is right for me. This blog goes a little further and deeper to explain the capabilities of Oracle DV Machine Learning in performing advanced model comparison techniques.
What are Cumulative Gains Chart and Lift chart and what are they used for?
Let us suppose that a company wants to perform a direct marketing campaign to get a response (like a subscription , purchase etc) from users. It wants to run marketing campaign for around 10000 users out of which only 1000 users are expected to respond. But the company doesn't have a budget to reach out to all the 10000 customers. To minimize the cost company wants to reach out to as less customers as possible but at the same time reach out to most of the customers who are likely to respond. Company can create ML models to predict which users are likely to respond and with what probability. Then the question comes which model should I choose ? Which ML model is likely to give me the most of number of respondents with as less selection of original respondents as possible? Cumulative Gains and Lift chart answers these questions.
Cumulative Gains and Lift chart are a measure of effectiveness of a binary classification predictive model calculated as the ratio between the results obtained with and without the predictive model. They are visual aids for measuring model performance and contain a lift curve and baseline. Effectiveness of a model is measured by the area between the lift curve and baseline: Greater the area between lift curve and baseline better the model. One academic reference on how to construct these charts can be found here. Gains & Lift charts are popular techniques in direct marketing.
Sample Project for Cumulative Gains and Lift chart computation
Oracle Analytics Store has an example project for this that was build using Marketing Campaign data of a bank. This is how the charts look like:
Scenario: This Marketing Campaign aims to identify users who are likely to subscribe to one of their financial services. They are planning to run this campaign for close to 50,000 individuals out of which only close to 5000 people i.e., ~10% are likely to subscribe for the service. Marketing Campaign data is split into Training and Testing data. Using training data we created two ML models using Naive Bayes and Logistic regression to identify the likely subscribers along with prediction confidence(Please note that the Actual values i.e., whether a customer actually subscribed or not is also available in the dataset). Now they want to find out which model is good at identifying most number of likely subscribers by selecting relatively small number of campaign base(i.e., 50,000).
ML
models are applied on Test data and got the Predicted Value and
Prediction Confidence for each prediction. Using this prediction data
and Actual outcome data we have created dataflows to compute cumulative
Gain and lift.
How to interpret these charts and how to measure effectiveness of a Model:
Cumulative Gains chart depicts cumulative of percentage of Actual subscribers (Cumulative Actuals) on Y-Axis and Total population(50,000) on X-Axis in comparison with random prediction (Gains Chart Baseline) and Ideal prediction (Gains Chart Ideal Model Line) which depicts all the 5000 likely subscribers are identified by selecting first 5000 customers sorted based on PredictionConfidence for Yes. Model with greater area between Cumulative Actuals line and Baseline is more effective in identifying larger portion of subscribers by selecting relatively smaller portion of total population.
Lift Chart depicts how much more likely we are to receive respondents than if we contact a random sample of customers. For example, by contacting only 10% of customers based on the predictive models we will reach 2.09 and 3.20 times as many respondents as if we use no model for Logistic Regression and Naive Bayes models respectively.
Max Gain shows at which point the difference between cumulative gains and baseline is maximum. For Logistic Regression this occurs when population percentage is 23% and maximum gain recorded is 41.84% for Naive Bayes this occurs when population percentage is 41% and maximum gain is 83.88%
By simply examining we can see that Naive Bayes model has larger area between cumulative gains curve and baseline and is a better model for prediction between the two.
What are Cumulative Gains Chart and Lift chart and what are they used for?
Let us suppose that a company wants to perform a direct marketing campaign to get a response (like a subscription , purchase etc) from users. It wants to run marketing campaign for around 10000 users out of which only 1000 users are expected to respond. But the company doesn't have a budget to reach out to all the 10000 customers. To minimize the cost company wants to reach out to as less customers as possible but at the same time reach out to most of the customers who are likely to respond. Company can create ML models to predict which users are likely to respond and with what probability. Then the question comes which model should I choose ? Which ML model is likely to give me the most of number of respondents with as less selection of original respondents as possible? Cumulative Gains and Lift chart answers these questions.
Cumulative Gains and Lift chart are a measure of effectiveness of a binary classification predictive model calculated as the ratio between the results obtained with and without the predictive model. They are visual aids for measuring model performance and contain a lift curve and baseline. Effectiveness of a model is measured by the area between the lift curve and baseline: Greater the area between lift curve and baseline better the model. One academic reference on how to construct these charts can be found here. Gains & Lift charts are popular techniques in direct marketing.
Sample Project for Cumulative Gains and Lift chart computation
Oracle Analytics Store has an example project for this that was build using Marketing Campaign data of a bank. This is how the charts look like:
Scenario: This Marketing Campaign aims to identify users who are likely to subscribe to one of their financial services. They are planning to run this campaign for close to 50,000 individuals out of which only close to 5000 people i.e., ~10% are likely to subscribe for the service. Marketing Campaign data is split into Training and Testing data. Using training data we created two ML models using Naive Bayes and Logistic regression to identify the likely subscribers along with prediction confidence(Please note that the Actual values i.e., whether a customer actually subscribed or not is also available in the dataset). Now they want to find out which model is good at identifying most number of likely subscribers by selecting relatively small number of campaign base(i.e., 50,000).
How to interpret these charts and how to measure effectiveness of a Model:
Cumulative Gains chart depicts cumulative of percentage of Actual subscribers (Cumulative Actuals) on Y-Axis and Total population(50,000) on X-Axis in comparison with random prediction (Gains Chart Baseline) and Ideal prediction (Gains Chart Ideal Model Line) which depicts all the 5000 likely subscribers are identified by selecting first 5000 customers sorted based on PredictionConfidence for Yes. Model with greater area between Cumulative Actuals line and Baseline is more effective in identifying larger portion of subscribers by selecting relatively smaller portion of total population.
Lift Chart depicts how much more likely we are to receive respondents than if we contact a random sample of customers. For example, by contacting only 10% of customers based on the predictive models we will reach 2.09 and 3.20 times as many respondents as if we use no model for Logistic Regression and Naive Bayes models respectively.
Max Gain shows at which point the difference between cumulative gains and baseline is maximum. For Logistic Regression this occurs when population percentage is 23% and maximum gain recorded is 41.84% for Naive Bayes this occurs when population percentage is 41% and maximum gain is 83.88%
By simply examining we can see that Naive Bayes model has larger area between cumulative gains curve and baseline and is a better model for prediction between the two.
Are you an Oracle Analytics customer
or user?
We want to hear your story!
Please voice your experience and provide feedback
with a quick product review for Oracle Analytics Cloud!