Challenge #3 - Machine Learning

Tutorial

Introduction

Machine Learning Challenge #3 was held from July 22, 2017, to August 14, 2017. Unlike the last two competitions, this one allowed the formation of teams. More than 5000 participants joined the competition but only a few could figure out ways to work on a large data set in limited memory. Yes, the data set size was quite large, which caused some difficulty for a few participants.

With this challenge, participants also became familiar with using a new boosting library called CatBoost. It'll be interesting to see in coming days, how much of help can this library be.

In this challenge, CatBoost delivered one of the winning solutions.

In this post, we'll look into the approach the winners used, in detail.

Problem Given

In this challenge, participants had to build a classifier model to help a leading European affiliate network to improve their CPC (cost per click) performance. Future insights into ad performance will help them take necessary action to avoid adverse situations.

The train data set was given for 10 days (~ 10 million observations). The test data was given for the next 5 days. The train data consisted of mostly categorical features; this could be a reason that CatBoost delivered at-par (or better) results on this problem. You can read more about problem statement on the competition page.

Machine Learning Challenge #3 - Winners

Rank 1 - Tinku Dhull, India

Tinku Dhull is a final-year student of Aerospace Engineering at IIT, Kharagpur, but his area of interest is machine learning and data analysis. He came to know about it during his second year of college. Since then, he has been taking online courses. He finds it amazing how machine learning and data can impart immense intelligence to a business. He dreams of building a successful career in the ML / data science domain. In this challenge, Tinku used R.

We asked him few questions to understand how he solved the problem:

1. What approach did you follow to solve this problem?

Ans: I took the usual approach. I started by filling missing values, creating new features, splitting the training data into train and test, fitting various algorithms, choosing the best algorithm, checking the importance of all variables in model building, removing some features with very low importance, and repeating these steps to improve the model.

2. What data-preprocessing / feature engineering ideas really worked? How did you discover them?

Ans: Filling missing values in browserid with "other," devid with "-999," and so on. The next important step was to to create a unique name for different names of browsers, i.e., change "chrome," and "google chrome" to a single name such as "chrome." Then came extracting the day of the week and hour from the date-time variable. And finally, the most important part of machine learning, i.e., feature engineering. I created many new features. I created many count variables choosing any 2, 3, or 4 features. Because these types of count variables can extract some customer behavior for a particular advertisement, country, merchant, offerid, etc. I kept on adding new features to get good results.

3. Did you use cloud support for model training? How did you handle the big data set?

Ans: No. I did everything on my laptop( i7 ). Although the data was really huge to handle with a laptop, I used some R libraries for faster manipulation and handling of such a big data set. I used "data.table" library in R for reading and handling big data. It is a very fast library for such a big data set. For the model building process, I tried "XGBoost" first. But it was very slow. So, I switched to "LightGBM," which is again very fast and more efficient than any other algorithm.

4. How does your final model look? Is it a single model?

Ans: Yes, my final model is just a single model. I did not have enough time for the competition due to my academic commitments, so finally I created many new features and trained the model using "LightGBM." It got me rank 1 in the leaderboard, and I was well ahead of rank 2. So, I decided to keep this as my final model until someone got close to me in the leaderboard. But no one got close to my leaderboard score, so I did not care too much to make other changes in the model. Still, I believe I could have gone for some more improvements in terms of accuracy if I had had enough time.

5. According to you, what are the 5 things a participant must focus on while solving such problems?

Ans: I always tried to learn more and more about ML by reading blogs of winners of online competitions, their approaches, and solutions. From all that I have learned so far, and from what helped me win this competition, I can share my experience:

First of all, you should understand the data. Try to understand every feature in the data set and how it relates to the target variable.
Exploring the data using plots gives an idea about features, imputation of missing values( if present ), data cleaning and data, and pre-processing. The most important part is feature engineering. Good feature engineering can give you an edge over others in any competition.
Maximum time should be devoted to create new features.
Then, build the model with a suitable algorithm, generally boosting algorithms win most of the competitions. Also, try to remove less important features because they can make the model overfit; try ensemble of 2 or more algorithms to make the model perform better on a new data set.

6. Any feedback?

Ans: I learned a lot from this competition. Obviously, winning every competition isn't a certainty, but you can learn so much from each. This was first time I handled such a big data set. Overall, it was a great learning experience. Thanks so much for this opportunity.

Source Code: Click Here

Rank 2 - Munish Bansal, India

Munish is working in Bangalore, India, in a consulting firm which provides decision support for clients. He loves travelling, playing chess and badminton, and working on anything he thinks is challenging. The whole idea of predicting something using available information always sounded cool and interesting to him. But how exactly people do that has always intrigued him. So, he started studying about it and made a career of it. In his pursuit of knowledge and thrill, he started participating in different competitions for learning about various advancements and modern algorithms. In this challenge, he used python.

Following is the approach described by Munish:

1. What approach did you follow to solve this problem?

Ans: I started with a benchmark model and played around with different features. I tried ensembling and FFM models. In the end, it was a 2-segment model that worked the best.

2. What data-preprocessing / feature engineering ideas really worked? How did you discover them?

Ans: I played around with a lot with different features which might help in determining whether a user will click or not. I tried eliminating the offerids which are not present in test data ( as we had to make prediction only for those offerids). The idea of competing offers really helped in improving the AUC. I also experimented with different counts at site, merchant,offer level, and their interactions.

3. Did you use cloud support for model training? How did you handle the big data set?

Ans: I used 16gb RAM on i7 core model. I had to train on the sample as training such big data with many features on the local machine can be a really frustrating task.

4. How does your final model look? Is it a single model?

Ans: It was a 2-segment model, where I made separate predictions for countries (A,B) and the rest. I think there was still more scope for improvement but required lot of time.

5. According to you, what are the 5 things a participant must focus on while solving such problems?

Ans: Here's my two cents' worth:

Not all problems are the same, so even if you have solved similar problems in the past, it might not work. So, thoroughly investigate the data you are dealing with.
Think around the problem and how the inputs are related to that. If there are ways, we can improve that relationship between target and independent features.
Overfitting can be a big curse while dealing with large data and many features. So always do a local cross- validation to remove such variables/treatments.
Follow the solutions of winners of past challenges. Try to understand their thought process while trying to solve the problems.
Practice, practice, and practice. Even if you don't win, you get lot of learning out of it.

7. Any feedback?

I think like Kaggle, HackerEarth can have a discussion forum where people can exchange ideas and maybe build a better solution.

Source Code - Click Here

Summary - What do you learn from winners?

Data Understanding: Always spend as much time as possible to understand the given data and the relationship of the features with the target variable.
Feature Engineering: Create more and more features but also do feature selection.
Big Data: In case of big data problems, first build a model on a sample of the data. This should give you some idea about how well or not the model is performing on a reduced sample size.

Have anything to say ? Feel free to drop your suggestions, recommendations, or concerns in comments below.

Contributed by: Manish Saraswat

View all comments