Bank Marketing Strategy

Posted by Christopher Mertin on August 21, 2017 in Project • 13 min read

This data comes from the UCI Machine Learning Repository and has anonymous customer bank information.

The data set contains 41,188 samples with 20 features. The features are the following:

  1. age (numeric)
  2. job: type of job (categorical: ‘admin.’,’blue-collar’,’entrepreneur’,’housemaid’,’management’,’retired’,’self-employed’,’services’,’student’,’technician’,’unemployed’,’unknown’)
  3. marital : marital status (categorical: ‘divorced’,’married’,’single’,’unknown’; note: ‘divorced’ means divorced or widowed)
  4. education (categorical: ‘basic.4y’,’basic.6y’,’basic.9y’,’high.school’,’illiterate’,’professional.course’,’university.degree’,’unknown’)
  5. default: has credit in default? (categorical: ‘no’,’yes’,’unknown’)
  6. housing: has housing loan? (categorical: ‘no’,’yes’,’unknown’)
  7. loan: has personal loan? (categorical: ‘no’,’yes’,’unknown’)
  8. contact: contact communication type (categorical: ‘cellular’,’telephone’)
  9. month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
  10. day_of_week: last contact day of the week (categorical: ‘mon’,’tue’,’wed’,’thu’,’fri’)
  11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
  12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  14. previous: number of contacts performed before this campaign and for this client (numeric)
  15. poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,’nonexistent’,’success’)
  16. emp.var.rate: employment variation rate - quarterly indicator (numeric)
  17. cons.price.idx: consumer price index - monthly indicator (numeric)
  18. cons.conf.idx: consumer confidence index - monthly indicator (numeric)
  19. euribor3m: euribor 3 month rate - daily indicator (numeric)
  20. nr.employed: number of employees - quarterly indicator (numeric)

With the output variable y being if the client has subscribed to a term deposit.

The goal with this data is to create a model using this customer data to determine who is more likely to subscribe for a term deposit. This bank can utilize this model to see who to advertise a term deposit to in order to increase their Return On Investment (ROI) for their advertising resources.

Exploratory Analysis

First of all, the most important thing is to understand the underlying data. My initial suspicion was that this would be a mostly mixed distribution and would be heavily unbalanced.

Due to it being heavily unbalanced, a classifier not prone to over fitting on the unbalanced category is required. Of course things like upsampling could be used, but I wanted to see how good I could do with a model resistant to this.

Therefore, I explored the use of Support Vector Machine (SVM), Gaussian Mixture Models, Gaussian Naïve-Bayes, Convolutional Neural Network (CNN), and Random Forests.

Firstly, the most important thing is to verify that my suspicion is correct and see how the data is correlated in regards to the classes.

We can read in the data as so

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv("bank-additional-full.csv", sep=';')

# Separate the data into yes/no groups
d_yes = data[data['y'] == "yes"]
d_no = data[data['y'] == "no"]

# Get the range of the ages
min_bin = min(min(d_yes["age"]), min(d_no["age"]))
max_bin = max(max(d_yes["age"]), max(d_no["age"]))

Finally, there’s two different plots we can look at for the first case. A histogram to see the density of the distribution for the given values, and then a box plot to better understand the underlying distribution of the data.

1
2
3
4
5
bins = range(min_bin, max_bin+1, 5)
plt.hist(d_no["age"], bins, alpha=0.5, label="no")
plt.hist(d_yes["age"], bins, color='g', alpha=0.55, label="yes")
plt.legend(loc="best")
plt.savefig("age_dist_hist.png", bbox_inches="tight")

Age Histogram

The above figure is a histogram of the ages. In green are those customers which did sign up for a term loan, and blue are those that didn’t. We can see that it is heavily unbalanced.

1
2
sns.boxplot(x="age", y="y", data=data)
plt.savefig("age_dist.png", bbox_inches="tight")

Age Box Plot

The above box plot shows that the distribution of the ages are relatively similar. For example, the mean and the Inner Quartile Range are roughly the same, however the “yes” category is a slightly larger range.

These plots are starting to show that it is a mixed distrbution, and it is heavily unbalanced. If we look at a pair plot, we can see that this is so for all categories (right click and go to “open image in new tab” for an enlarged picture)

1
2
3
4
sns.pairplot(data, hue='y')
fig = plt.gcf()
fig.set_size_inches(20, 20)
fig.savefig("pairplot.png", dpi=100, bbox_inches="tight")

Pairplot

As the pair plot shows, there is a lot of overlap in 2 dimensional space between the data. This could be resolved in higher dimensional space, however the goal of this project was to see how well I could do without upsampling.

Finally, we should look at how unbalanced the data is.

1
2
sns.countplot(x='y', data=data)
plt.savefig("no_yes_count.png", bbox_inches="tight")

countplot

The above shows that the ratio is roughly 7:1 for those that don’t sign up to those that do. Our goal is to therefore attempt to be able to separate the two as much as possible.

Building Models

In the first part, I am simply building basic classifiers with SVM, Gaussian Mixture, Gaussian Naïve-Bayes, CNN, and a random forest. After this, I will take the best models and customize them relative to the data to increase the accuracy on determining who would sign up for a term deposit. We can start with importing models.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
import pandas as pd
from copy import deepcopy
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import seaborn as sns

from keras.models import Sequential
from keras.layers import Dense, Activation, Conv1D, Flatten
from keras.layers import Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from keras.constraints import maxnorm
from keras.optimizers import Adam
from keras.utils.np_utils import to_categorical

from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, roc_curve, auc

from sklearn.ensemble import RandomForestClassifier as RandomForest
from sklearn.model_selection import GridSearchCV
from sklearn.mixture import GaussianMixture
from sklearn.naive_bayes import GaussianNB

First of all we need to convert the data. Some of the categories are categorical, and others are numerical. This can be fixed using this function I wrote.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def TransformCategorical(data, data_types):
    df = deepcopy(data)
    columns = df.columns
    categories = {}
    for idx, dtype in enumerate(data_types):
        if dtype not in ["int64", "float64", "int32", "float32"]:
            col = columns[idx]
            unique = np.sort(np.asarray(data[col].unique()))
            unique_list = list(unique)
            categories[col] = unique
            for item in unique:
                df[col].replace(item, unique_list.index(item), inplace=True)
    return df, categories

We also need a function for pulling out a validation set from our training set. This can be done with this function

1
2
3
4
5
6
7
8
9
def Make_X_y(data, col="y"):
    y = data[col].as_matrix()
    X = deepcopy(data)
    del X[col]
    X = X.as_matrix()

    X, X_v, y, y_v = train_test_split(X, y, test_size=0.2, random_state=8)

return X, y, X_v, y_v

No we can read in the data like so

1
2
3
4
5
6
7
8
data = pd.read_csv("bank-additional-full.csv", sep=';')
data_types = data.dtypes.as_matrix()
# Convert the data frame to a matrix
Xy = data.as_matrix()
# Transform categories to numerical values
Xy, categories = TransformCategorical(data, data_types)
# Extract the training set and validation set
X, y, X_v, y_v = Make_X_y(Xy, col="y")

Support Vector Machine (SVM)

The reason I chose SVMs is because they are very resistant to unbalanced data sets. This is due to their inherent nature to make the bisecting plane equal in width between both classes.

As this is a messy set, I opted to use the Radial Basis Function (rbf) kernel in order to separate the data. This is a very good kernel and is useful for sets like these. I didn’t think that a linear or polynomial kernel would fit as well without upsampling the data. Another aspect is the rbf kernel runs much quicker on this larger data set since the classic SVMs are not naturally parallel.

In order to increase the speed of the SVM classifier, we can tell it to utilize more memory/have a larger cache size. This is done by

1
svm = SVC(cache_size=7000, kernel="rbf", probability=True)

Another thing that is going to slow this classifier down is the probability attribute. While it’s not technically needed for a simple classifier, it is needed for the more complex classifier I build at the end.

We can fit and predict with our model, giving the results of

1
2
3
svm.fit(X, y)
predict_y = svm.predict(X_v)
print(classification_report(y_v, predict_y))
Precision Recall f1-score support
0 0.88 1.00 0.94 7274
1 0.27 0.00 0.01 964
avg/total 0.81 0.88 0.83 8238

The above took roughly 13 minutes to run on my system, with 12.5 minutes being confined in the fit function. As the table shows, it was only able to capture 88% of those who wouldn’t sign up and 27% of those who would. This is relatively bad for a classifier in order to target users. However, we can look at the distribution of the predicted probabilities to see how the model is actually performing on the validation set.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
svm_prob_y = svm.predict_proba(X_v)

class0_0 = []
class1_0 = []
class0_1 = []
class1_1 = []

for idx, y_ in enumerate(y_v):
    if y_ == 0:
        class0_0.append(svm_prob_y[idx,0])
        class0_1.append(svm_prob_y[idx,1])
    else:
        class1_1.append(svm_prob_y[idx,1])
        class1_0.append(svm_prob_y[idx,0])

The above code returns the prediction probabilities for the validation set and separates the values into two lists. These lists are the probabilities from SVM, and in plotting we get

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
bins = np.linspace(0,1,100)

plt.hist(class0_0, bins, alpha=0.5, label="No")
plt.hist(class1_0, bins, color='g', alpha=0.55, label="Yes")
plt.ylim(0,1000)
plt.legend(loc="upper left")
plt.xlabel("Probability")
plt.ylabel("Count")
plt.title("SVM Prediction Probability Histogram")
plt.savefig("svm_probability_hist.png", bbox_inches="tight")

svm probability histogram

The above figure shows the prediction probabilities for both the no and the yes class. This information can be used later on. This can be helpful since it can be assumed that any future data is going to be drawn from the same distribution as the test set was.

Gaussian Mixture Model

Gaussian Mixture Models are nice for the fact which they assume all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. They are also very fast at training over large datasets.

We can train our Gaussian Mixture Model as follows, where we set n_components=2 since we’re assuming two mixed distributions (yes and no). In predicting on our validation set, we get the following results.

1
2
3
4
5
gmix = GaussianMixture(n_components=2, covariance_type="spherical", tol=1e-4)
gmix.fit(X, y)

predict_y = gmix.predict(X_v)
print(classification_report(y_v, predict_y))
Precision Recall f1-score support
0 0.97 0.78 0.87 7274
1 0.34 0.82 0.48 964
avg/total 0.90 0.79 0.82 8238

Again, we can look at the predicted probabilities, giving

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
gmix_prob_y = gmix.predict_proba(X_v)

class0_0 = []
class1_0 = []
class0_1 = []
class1_1 = []

for idx, y_ in enumerate(y_v):
    if y_ == 0:
        class0_0.append(gmix_prob_y[idx,0])
        class0_1.append(gmix_prob_y[idx,1])
    else:
        class1_1.append(gmix_prob_y[idx,1])
        class1_0.append(gmix_prob_y[idx,0])

plt.hist(class0_0, bins, alpha=0.5, label="No")
plt.hist(class1_0, bins, color='g', alpha=0.55, label="Yes")
plt.ylim(0,1000)
plt.xlabel("Probability")
plt.ylabel("Count")
plt.title("Gaussian Mixture Model Prediction Probability Histogram")
plt.legend(loc="upper center")
plt.savefig("gmix_probability_histogram.png", bbox_inches="tight")

Gaussian Mixture Model Probability

Gaussian Naïve-Bayes

Gaussian Naïve-Bayes works by assuming that the variables are independent. This allows for the probability density to be characterized by a product of univariate densities.

While it isn’t really the case that the variables are completely independent, it is a good approximation for most instances as long as the variables aren’t derived from each other. It is also just as fast as the Gaussian Mixture Models.

The nice part about using the Gaussian Naïve-Bayes in scikit-learn is that you don’t have to define the priors as it learns it during training time. Therefore, you can build the model in the following way

1
2
3
4
5
gnb = GaussianNB()
gnb.fit(X, y)

predict_y = gnb.predict(X_v)
print(classification_report(y_v, predict_y))
Precision Recall f1-score support
0 0.94 0.87 0.91 7274
1 0.40 0.61 0.48 964
avg/total 0.88 0.84 0.86 8238

While this model did overall perform worse than the Gaussian Mixture Model, the overall precision for the yes class is more important. However, 40% is relatively poor for building a recommendation model, though we can look at the predicted probabilities, which are used later to build a better model.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
predict_proby = gnb.predict_proba(X_v)

class0_0 = []
class1_0 = []
class0_1 = []
class1_1 = []

for idx, y_ in enumerate(y_v):
    if y_ == 0:
        class0_0.append(predict_proby[idx,0])
        class0_1.append(predict_proby[idx,1])
    else:
        class1_1.append(predict_proby[idx,1])
        class1_0.append(predict_proby[idx,0])

  plt.hist(class0_0, bins, alpha=0.5, label="No")
  plt.hist(class1_0, bins, color='g', alpha=0.55, label="Yes")
  plt.ylim(0,1000)
  plt.xlabel("Probability")
  plt.ylabel("Count")
  plt.title("Gaussian Naïve-Bayes Prediction Probability Histogram")
  plt.legend(loc="upper center")
  plt.savefig("gnb_probability_histogram.png", bbox_inches="tight")

gnb predicted probabilities

Convolutional Neural Network (CNN)

In my original code, I had first started with a normal feedforward neural network, then built to a CNN, and then also weighted the yes class 100:1, though the results were quite lackluster. None of the models were able to pick up in successfully predicting those in the yes class. Therefore, the following code is just for a typical CNN that I used to predict with. There are a few comments in the code, though I won’t spend too much time on this section.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# First we need to add another dimension to the input so it will work with a CNN. The output needs to become categorical as well.
XX = X[:,:,np.newaxis]
XX_v = X_v[:,:,np.newaxis]
y_cat = to_categorical(y)
y_v_cat = to_categorical(y_v)

# Building our model
model = Sequential([
Conv1D(input_shape=XX.shape[1:], nb_filter=128, filter_length=4),
Flatten(),
Dense(100, activation="relu"),
Dropout(.25),
Dense(256, activation="relu"),
Dropout(.25),
Dense(512, activation="relu"),
Dropout(.25),
Dense(1024, activation="relu"),
Dropout(.25),
Dense(2048, activation="relu"),
Dropout(.25),
Dense(2),
Activation("softmax")
])

model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

model.fit(XX, y_cat, nb_epoch=2)

# Make prediction
pred_conv = model.predict(XX_v)

# Look at the output to see how many were "captured"
data_all = []
for x in pred_conv:
    if(x[1] > 0):
        data_all.append(1)
    else:
        data_all.append(0)

The above code builds the model and predicts with it on the validation set, though we need one final function to tie everything together to look at the output. While the accuracy could be validated by simply running model.evaluate(XX_v, y_v_cat), it doesn’t give the performance of each class.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def GetAccuracy(true, pred):
    pos_count = 0
    neg_count = 0
    correct = 0
    for idx, given in enumerate(true):
        if given == pred[idx]:
            correct += 1
            if given == 0:
                neg_count += 1
            else:
                pos_count += 1

    print("Accuracy: " + str(correct/len(pred)) + "\nPositive correct: " + str(pos_count) + "\nNegative correct: " + str(neg_count))

Then we can check the accuracy by running GetAccuracy(y_v, data_all), which gives an accuracy of 88.3%, though nothing in the positive class, but was able to correctly predict the negative class.

Random Forest

For the last model, I decided to throw in a random forest to see how it would perform. I suspected that it would perform quite well in the prediction, as long as the given parameters were right. Therefore, I had to use Cross Validation to find the right parameters.

Random Forests are not prone to overfitting the data and deal with unbalanced sets relatively well. We can initially run our model to find the right parameters with the following

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
clf = RandomForest(n_jobs=-1, oob_score=True)
param_grid = {
  "n_estimators":[50, 100, 200, 300],
  "criterion":["gini","entropy"],
  "max_features":[3,5,7,11,15],
  "max_depth":[5,10,15,20],
  "min_samples_split":[2,4,6]
}

grid_search = GridSearchCV(clf, param_grid, n_jobs=-1)
grid_search.fit(X, y)
print(grid_search.best_params_)

This will run cross validation on the training set to see which parameters are the best for the random forest, using all iterations. The output can be seen as the parameters for the input in the next line, which gave

1
2
3
4
5
clf = RandomForest(n_jobs=-1, oob_score=True, criterion="gini", max_depth=10, max_features=11, min_samples_split=6, n_estimators=100)

clf.fit(X, y)
y_pred = clf.predict(X_v)
print(classification_report(y_v, y_pred))
Precision Recall f1-score support
0 0.94 0.96 0.95 7274
1 0.70 0.56 0.61 964
avg/total 0.92 0.92 0.91 8238

Which obtained a 70% prediction accuracy for the yes class. This is much better than the other models so far, but how does it fare in comparison to the predicted probabilities?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
y_prob = clf.predict_proba(X_v)

class0_0 = []
class1_0 = []
class0_1 = []
class1_1 = []

for idx, y_ in enumerate(y_v):
    if y_ == 0:
        class0_0.append(y_prob[idx,0])
        class0_1.append(y_prob[idx,1])
    else:
        class1_1.append(y_prob[idx,1])
        class1_0.append(y_prob[idx,0])

plt.hist(class0_0, bins, alpha=0.5, label="No")
plt.hist(class1_0, bins, color='g', alpha=0.55, label="Yes")
plt.ylim(0,1000)
plt.xlabel("Probability")
plt.ylabel("Count")
plt.title("Random Forest Prediction Probability Histogram")
plt.legend(loc="upper center")
plt.savefig("rf_probability_histogram.png", bbox_inches="tight")

random forest probability histogram

As this shows, the probabilities are much more spread out than the other models that were used above.

Building a “Better” Model

This section is using the above models in a unique way to increase the chance of classifying as a yes. As we want to optimize this as much as possible. In order to do so, the following code is needed. This code is basically the same for all the models, with a few changes, which are explained below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
true_positive = []
false_positive = []

for c in class0_0:
    if c <= 0.8:
        false_positive.append(c)

for c in class1_0:
    if c <= 0.8:
        true_positive.append(c)

print("Captured " + str(len(true_positive))+ " true positives")
print("Captured " + str(len(false_positive)) + " false positives")
print("Percentage of True: " + str(len(true_positive)/len(class1_0)))

The above code is for the SVM model. Essentially, what we are doing, is looking at the predicted probabilities and trying to determine what to “gate” at instead. In this instance, what we’re doing is looking at all the data and classifying anything less than or equal to 0.80 probability as being in the yes class. As the SVM probability histogram showed, this would “capture” most of the yes, with minimal no as well.

This could probably be played with to make a better heuristic, but this is a good place to start. The table below denotes the accuracy and the gate parameters that were used for the models. Random Forest was left out as the yes class was very spread out and it wouldn’t be very useful.

Model Gate True Positives False Positives Accuracy Marketing Accuracy
Support Vector Machine <=0.80 748 965 0.78 0.44
Gaussian Mixture Model <=0.05 791 1507 0.82 0.34
Gaussian Naïve-Bayes <=0.05 432 482 0.45 0.47

In the above table, the “Accuracy” is defined as the number of yes users that were “captured.” It would be better to test this on an untested set, but the assumption that I’m using is that the data is going to be pulled from the same distribution.

The “Marketing Accuracy” is the number of true positives over the total number of true and false positives. This is essentially the “overall accuracy” of the model.

Using this method, we were able to increase the probabilities from 70% on the Random Forest, to up to 82%. Depending on the cost of marketing to each user, the SVM model gives the best “bang for your buck,” as it only advertises to 1713 customers. However, if you want to obtain almost 50 more adopters of a term deposit and the marketing isn’t going to cost too much per customer, the Gaussian Mixture Model is better as it obtains 82% of the base while advertising to 2298 users.

However, if the marketing cost is very high, the random forest classifier would be better as it has the best “overall accuracy.”