Predicting Customer Satisfaction using H20.ai Auto ML running on Azure Data Science Virtual Machine

In this tutorial, you will learn how to quickly predict customer satisfaction using H20.ai AutoML running on Azure Data Science Virtual Machine.

Which customers are happy customers?

Happy or Unhappy customers

Around 5 years ago, Santander Bank created a prediction competition at Kaggle to predict which customers are happy customers. The money prize was $60.000.

Overview about the use case / competition from Kaggle.

From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don’t stick around. What’s more, unhappy customers rarely voice their dissatisfaction before leaving.

Santander Bank is asking Kagglers to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer’s happiness before it’s too late.

In this competition, you’ll work with hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience.

Competition link: Santander Customer Satisfaction | Kaggle

The dataset

The dataset provided by Santander Bank is anonymized and contains 371 variables (all continuous variables).

A continuous variable is a variable that has an infinite number of possible values.

The TARGET column is the variable to predict. It equals 1 (one) for unsatisfied customers and 0 for satisfied customers.

The Kaggle Competition objective is to predict who are satisfied and unsatisfied clients.

Numbers of observations (Row number):

  • Train: 76020 rows
  • Test: 75818 rows

Number of 1s (train): 3008 (3.95%) (Imbalanced Dataset Problem)

Variables:

  • 34 variables with one single value; (Suggested action: Delete all of them)
  • 100 variables with two unique values; (binary variables)
  • 157 variables with values between 3 y 101 unique values; (categorical variables)
  • 80 variables has more than 101 distinct values; (continuous variables)

Files:

Using H20.ai Auto ML running on Azure Data Science Virtual Machine to predict Customer Satisfaction.

In this tutorial, we will use a very quick approach to predict who are satisfied and unsatisfied clients, we will use H20.ai Automated Machine Learning running on Azure Data Science Virtual Machine (Linux).

H20 AutoML

H2O AutoML is a function in H2O that automates the process of building a large number of models, with the goal of finding the “best” model without any prior knowledge or effort by the Data Scientist.

Azure Data Science Virtual Machine

The Azure Data Science Virtual Machine (DSVM) is a virtual machine image pre-loaded with data science & machine learning tools. Use this VM to build intelligent applications for advanced analytics.

For more details about the DSVM and how to install/set up it on Azure and in your machine, please scroll down this article.

The Demo

Using the Azure Data Science Virtual Machine (DSVM) you will have a local Jupiter notebook environment that you can access inside the Linux virtual machine using (X2GO) or you can access using an external browser.

The image below demo the Auto ML code running inside the VM.

Jupyter inside Azure Data Science Virtual Machine for Linux

The Azure Data Science Virtual Machine (DSVM) will have already installed R and you can run the same code using the familiar R Studio that most developers are familiar.

R Studio inside Azure Data Science Virtual Machine for Linux

Running Jupyter Hub is also possible as you can see below.

AutoML Code running on jupyter hub

To have access to the Auto ML code used in the demo hosted in GitHub, please click here.

I also have another blog post where I used Azure Automated Machine Learning to predict Credit Card Fraud, click here to see the post.

What is the Azure Data Science Virtual Machine for Linux and Windows?

The Data Science Virtual Machine (DSVM) is a customized VM image on the Azure cloud platform built specifically for doing data science. It has many popular data science tools preinstalled and pre-configured to jump-start building intelligent applications for advanced analytics.

The DSVM is available on:

  • Windows Server 2019
  • Ubuntu 18.04 LTS

Quickstart: Set up the Data Science Virtual Machine for Linux (Ubuntu)

To learn how to set up the VM for Linux, please use this link.

The DSVM running in Azure and accessed by my local computer using X2GO

The DSVM accessed by a Linux SSH terminal from my local machine.

SSH Terminal to access the DSVM

The Notebooks examples are located inside /home/youruser/notebook

Notebooks examples to learn about Data Science

Using the terminal to view the Notebooks available inside the DSVM

Notebooks available inside the DSVM

Have fun and try it yourself!

Links:

Auto ML Demo Code
https://github.com/caiomsouza/microsoft-big-data-scientist-and-ai/blob/master/samples/azure-notebooks/r/auto-ml-h2o/01_H2o_AutoML.r

Quickstart: Set up the Data Science Virtual Machine for Linux (Ubuntu)
https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro

Senior Cloud Solution Architect and Data Scientist @microsoft | PhD Student @unicomplutense (Opinions are my own)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store