Coupon usage prediction on In-Vehicle Recommendation systems — (A ML Classification case-study)

Devansh Verma
13 min readMar 8, 2022

--

Photo by google search

From Retailers to Manufacturers to even small businesses , everybody uses discount coupons in their marketing strategies to not only boost sales but also increase Customer Retention rate. Even us buyers, it’s hard to think of an online purchase we made without any discount/offer (ecommerce, food delivery etc.). Even though it doesn’t guarantee that the consumer is in profit(which we most likely aren’t) but it kinds of gives us a feeling of a ‘deal’ which is reassuring and satisfying. Here’s a study showing that 90% of users use coupons for purchases actively. Therefore it becomes crucial for businesses to capitalize on this marketing strategy to meet their Revenue and Profit goals.
Now coming to our case-study, how does it helps here? So basically coupon marketing involves money, the coupon is digital but the cost lies in WHERE this coupon is hosted, which is, cost involved in showing online ads, affiliate marketing etc. Here the data we have is specifically recommending coupons to users on their In-Vehicle mobile systems (so think of the system provider charging some money) and we predict whether the customer will accept the coupon or not.(Assumption : Acceptance of coupon means user actually used the coupon).

PROBLEM STATEMENT
Predicting whether the user will accept the coupon or not is a hard problem and we cannot just recommend it to everyone because of the costs involved.
So some local businesses want us to build a system which will accurately predict whether the user will accept their discount coupon or not while keeping the marketing costs as low as possible.

DATASET OVERVIEW
So the data provided to us is a part of research paper where some scholars & working professionals were working on a new Bayesian-RuleSet based model for classification and collected this dataset from a survey on Amazon Mechanical Turk (all high rated turkers with a score > 95% were selected) to use it for benchmarking this new model. Since the survey was done on actual individuals, dataset can safely be assumed to be of real world(and independent).
Get the data(and research paper if you’re interested) from here -: https://archive.ics.uci.edu/ml/datasets/in-vehicle+coupon+recommendation

The dataset consists of :

  • User-context features like Gender, Marital-Status, Income, Education-level, general preference of user with respect to the venue etc.
  • Demographic features like Weather, temperature, Is user driving in same direction as coupon venue etc.
  • And some General(but very useful) features like Type of coupons, Time before coupon expires, Driving distance to the coupon venue etc.
  • And finally the labels/target (0-Not Accepted & 1-Accepted)

MAPPING TO ML PROBLEM/METRIC
For a given customer, we need to predict whether he/she will accept the coupon or not using a mix of user, demographics and general context features. We can pose it as a simple Binary Classification problem.
The Primary ML metric is F1-Score (Reason :- Because we want Higher-Precision as we want to optimize the cost of promotional marketing and to some extent also not hurt customer-experience & Higher-Recall as we want to increase the revenue as much as possible so we shouldn’t miss out on potential customers)

F1-Score

and Secondary ML metric is AUC (Reason :- how discriminative my model is between classes).

Area under curve (AUC), reference

BUSINESS CONSTRAINTS

  • High Recall and High Precision, both are very important.
  • Low latency(prediction) is required as we want to recommend the coupon to the user while he is in his vehicle and in the neighborhood of the venue.
  • Interpretability will help but not super important.

We will tackle this problem by breaking it into 2 subparts i.e. Featurization and Modelling, we’ll use a wide range of techniques and conduct lots of experiments to see what works. Let’s go step by step.

EXPLORATORY DATA ANALYSIS
This is the very first but the most important part of our case-study (or any project!). This will give us interesting insights on as to what features are important and mainly what are general trends/patterns which decide whether a person accepts OR rejects a coupon. Let’s have a look at the data:

Note :- (All columns not present)

Our dataset size is (12684,26), just 12684 instances! This is a big problem as our data size is too small and we can easily run into problems like Curse of Dimensionality and Overfitting. Moreover, each and every attribute of this dataset is of categorical/discreet nature. That’s right, not a single numerical column. Even attributes like Age and temperature had been bucketized before providing the data.

Missing values

Luckily, we don’t have a lot of missing values except of few columns. Car attribute has more than 99% of missing values so we will simply drop it (probably because this must have been an optional question in the survey) as imputing at so many places could risk changing the distribution of data. Other attributes have around 1% missing values only so we’ll impute them with mode.

Distribution of classes

We don’t suffer from severe class imbalance, but balancing can be a part of modelling experimentations.

Univariate Analysis
Since all attributes are categorical, we will look at discreet distributions of various variables with the target variable.

  1. Temperature and Weather effect on target variable

As we can see we have more coupon acceptance when temperature is on the higher side and people have 24hrs before the coupon expires.

2. Distribution of Coupon types in data

The most number of coupons is of Coffee House followed by cheap restaurants and take-away.

3. Coupon type with target

As clearly evident from above plot, Cheap restaurants and carry out/ takeaway have the highest coupon acceptance rates. For coffeehouse we’re unsure and for bars and Expensive restaurants, people mostly reject their coupons.

Bivariate Analysis
Since there are no numerical variables, we don’t exactly do bivariate analysis but we concatenate categorical variables together and check it’s effect on target variable.
1. Passenger and destination on target variable

When people are with Friends and have no urgent place to go, they will most likely accept the coupon.

2. Gender and time of day

We can observe that at night, Female are less likely to accept the coupons as compared to male, whereas in evening both the genders are likely to accept the coupons.

Q. Let’s ask a simple question, if people have a general interest in a particular venue(measured by times they visit that venue per month) then will they accept it’s coupon?

We can clearly see from the above plots that there is a general trend that people who are frequent goers of a venue will accept their discount coupons.

  • For cheap restaurants we can see there are very few people who never visited one and almost everyone accepts the cheap restaurant coupons more often than not
  • For coffee-houses, expensive restaurants and bars people who have visited Atmost 1 times or never visited have rejected their respective coupons. Acceptance increases as people’s monthly frequency of going to these venues increases.
  • For Carryaway coupons, since it’s a quick grab, mostly people accept the coupons.

Preprocessing
From EDA we decided to drop perfectly correlated columns like direction_opposite and direction_same (when one is 0 the other is 1) and combine toCoupon_GEQ5min , toCoupon_GEQ15min & toCoupon_GEQ25min into one using simple if-else.

A lot more EDA is done (like using Decision trees and combining columns for visualization) and some useful inferences are made so if you’re interested you can have a look here.

FEATURE ENGINEERING/ENCODING

In Feature engineering we create 4 separate sets of feature-encoded datasets :-

  1. One-Hot Encoding

2. K-fold target encoding. (Since the dataset was small just using target encoding would include unnecessary variance into the data leading to overfitting so we take a cross-validation approach for encoding as well, refer below image)

IMG_Reference

3. Binary Encoding + Label Encoding

4. Mixed Encoding.
This is something that I tried after combining all the above methods of encoding. Logic was K-FOLD encoding(for nominal variables with high cardinality) + OHE(for nominal variables with low cardinality) + Label Encoding(for ordinal variables). All the above functions were used for building it’s code.

Note :- You might notice that why am I creating a separate copy of dataframe for each encoding set and this might create memory issues. You are absolutely right on that. I did this mainly because our dataset size is extremely small, if you’re dealing with large size dataset then consider adding encodings as a separate column in the original dataframe itself. For something like OHE, maintain a sparse matrice.

MODELLING

Base Model
Initially we created a Base model, Logistic regression on OHE encoded dataset and got both F1-Score and AUC of ~0.73. This tells us that any model that scores below this will be useless for us.

Simple Models
We try simple models like KNN,DT etc. without tuning on each feature encoded dataset and check their performance:

We can see Linear-SVM performs best on all sets with highest F1-Score. We tune Linear-SVM and now this will serve as baseline for all our advanced models.
Hyperparameter tuned Linear-SVM gave an F1-Score of 0.747

Advanced models
Starting out we used these 3 models with extensive hyperparameter-tuning using Randomized search. Scores are from test data.

  • Random-Forest (300 trees) gave an F1-Score = 0.780 and AUC = 0.80 (on Binary encoding + Label encoding dataset)
  • RBF-SVM gave F1-Score = 0.788 and AUC = 0.80 (on OHE dataset)
  • XGBoost gave F1-Score = 0.784 and AUC = 0.81 (on Binary encoding + Label encoding dataset)

RBF-SVM gave the highest F1-Score and XGBoost gave the highest AUC.
One more thing to note is RBF-SVM took ~30 mins for hyperparameter tuning on just 5 parameter-combinations (n_iter=5) on all datasets whereas XGBoost took the same time for 100 possible combinations(n_iter=100) on all datasets. (except OHE as dimensionality is huge and tree based models are very slow on it)

CATBOOST (Hero of this case-study !)

Generated from here

Before I illustrate the performance of Catboost on the data, let’s discuss what is Catboost and why should we use it when we already have XGBoost?
Catboost(categorical-boosting) is a boosting framework developed by Yandex , it tries to solve the problem of Overfitting using ordered boosting and ordered encoding which isn’t addressed or solved in other implementations of gradient boosting like LGBM, XGBoost etc.

  • Ordered-Encoding and Boosting : It handles encoding of Categorical features automatically. What it does is, it first creates a random permutation of the categorical feature and then target statistic of each value in a particular feature is calculated using the values of that category right above it (kind of making it an artificial time series, refer the below image). Similar technique is applied to boosting as well, where each model is trained on ( i-1)th point and calculates residuals for the ith point. These 2 methods reduces variance greatly.
Img_Reference

This is a very high level overview of CatBoost and their is a lot more important details that goes in. Explaining it here will make this article unnecessary long so I’ll point you to 2 awesome resources which will explain it way better than I could.

Now this algorithm is very useful for our problem as our dataset suffers exactly from these 2 problems i.e. small data which causes overfitting in modelling AND all attributes are categorical so proper categorical encoding without introducing variance is important.

Catboost, after extensive hyperparameter tuning (randomized search with n_iter=300) gave an F1-Score of 0.803 and AUC of 0.835. This is by far he highest we have received using any model. We also didn’t have to do any encoding on the dataset and purely raw data was given to catboost.

OTHER EXPERIMENTATIONS
I also tried a few other techniques like :-

  • Feature-Expansion : Used K-modes(think of it like K-means for discreet data) to cluster data into 4 categories(elbow point) and provide labels as a feature. Didn’t work.
  • Feature Selection : I tried Relief-F feature selection after reading this excellent paper where scholars were trying to solve a very similar coupon redemption problem using this feature selection method. This method takes target-variable into account, aware of feature interactions making no assumption on independence of attributes and goes very well with tree-based algorithms. Here is a detailed article on it.
    I tried it with XGBoost & CatBoost and it increased the performance of XGBoost slightly but wasn’t significant. Similar performance with Catboost as well so this too didn’t work.
  • Stacking : Yes, how could we miss this ! Even though this is not a Kaggle problem and the dataset is quite small (stacking might overfit for small data so be wary) let’s give it a go, here’s the full code for stacking -

I had kept the no. of base-estimators(all were hyper-parameter tuned on each dataset and best ones were chosen) to use in this stacking model a hyperparameter ‘k’ and here’s the scores

The highest the stacker could receive was an F1-Score of ~0.78. We could easily get more than this using a properly-tuned XGBoost. So this didn’t give a rise in performance as well.

FINAL RESULTS

Results Summary on Best performers

CatBoost is clearly the winner with the highest F1-Score of 0.80 and AUC of 0.83. Surprisingly, we can notice that all tree based models (except Catboost) perform best on Set-3 data (Binary+Label) and others on Set-1 data (OHE).

Note :- After hyperparameter tuning the best simple model changes from Linear-SVM to KNN with an F1-Score of 0.756.(but still not a big difference). K-fold encoding (Set-2 & Set-4) does not work well with any model.

DEPLOYMENT & PRODUCTIONIZATION

After doing extensive experiments and analyzing the results, CatBoost model was finally chosen. It’s the best choice for productionization even if we ignore the metrics as the prediction is lightning fast due to it’s use of Oblivious trees which reduces the tree to a list of conditions as these trees use one condition per layer/height and are balanced trees. Catboost also has a nice GPU support.
I made the API using Flask and the frontend (which is a disaster I know but I somehow made it work 😜) using HTML & Bootstrap-4 and deployed on Heroku.

Webpage

The first time when we use default values of form we get accepted/1 prediction and when we change a few attributes like gender = Female and Time=Night and Coupon-expiry = 2hrs (values inspired from EDA) we get a not accepted/0 prediction. The second form accepts the raw test data and test labels for performance evaluation.

Here’s the LINK to the site in case you might wanna play with it.

FUTURE WORK:

  • Since this was a machine learning case study and I have limited knowledge on deep-learning, I shied away from using any Deep-learning based approaches. But especially for categorical attributes one could try Entity embedding for categorical values and see if it increases performance.
  • More feature encoding techniques can be tried like Weighted-mean encoding and Hashed encoding. Different feature selection techniques like entropy and Variance thresholding(for binary vars) can be experimented with.
  • Try any dimensionality reduction techniques and balancing classes as well.

CONCLUSION

This was my first real world self-case study on Machine Learning. It taught me a wide range of feature-engineering and feature encoding techniques, state of the art ML models and most importantly how important it is to not overfit when you have less data. Also it’s very important to experiment as much as you can before coming to a final conclusion and not jumping to a model or method just because of our prior knowledge or experience about it, this is what I have tried to convey through this problem.
This will conclude my blog. Hope you enjoyed reading it !

Note :- I tried to summarize as much as I can in this blog from my project but a lot more inferences on data, error analysis and all the code is available here in this repository. If you have any concept/code related issue from this case-study, feel free to comment it below or DM on linkedin

If you’re someone who enjoys implementing real world projects, models and Algorithms from scratch, hit that green FOLLOW button as hard as you can !

REFERENCES:

Lets Connect!
LinkedIN.
Github

--

--

Devansh Verma
Devansh Verma

Written by Devansh Verma

I love learning and building things from scratch and write about it here. In my leisure, you may find me trying anything which pushes me out of my comfort zone.

No responses yet