Machine Learning
Topics:
Challenge #2 - Machine Learning
• Statistics
• Data Manipulation & Visualisa...
• Machine Learning Algorithms
• Machine Learning Projects
• Challenges Winning Approach

# Challenge #2 - Machine Learning

• Tutorial

## Introduction

This was the second edition of machine learning challenges organised by HackerEarth. This challenge was held from June 15, 2017 to June 30, 2017. More than 6000 participants around the world tried their knowledge (and luck) at solving the given problem. Surprisingly, alongside cliched xgboost, logistic regression models, deep neural nets such as MLP, LSTM were used by winners to add diversity to their base models.

We'll look into it in detail in this post. In this challenge, the participants build a predictive model to help Kickstarter understand which projects in future are going to get fully funded. This will be immensely helpful for them, as they can channelise their resources to the ones which are going to get failed. You can read more about problem description on Challenge Page.

## Quick Preview of Dataset

In this challenge, the given train data has 108129 rows and 14 columns. The test data has 63465 rows and 12 columns. The data set size wasn't big so participants were able to train models on as lower as 4GB laptops.

The train data consisted of information of past projects which got successfully funded or not on kickstarter.  The data set comprised of mix of test, categorical and numerical features, which gave people a lot of scope to do feature engineering.

## Machine Learning Challenge #2 -  Winners

### Rank 1 - Roman Pyaknov, Russia

Roman is a 23 year old young grad from Saint Petersburg State University, Russia. He started using machine learning 6 months ago. He said, " a good friend of mine asked to participate in one of these competitions, where I got introduced to the world of machine learning and is very enthusiastic."

Roman build an weighted ensemble of 5 models (3 XGBoost and 2 LightGBM) by deriving weights using linear regression. Following is the summary of his approach to solve this problem:

1. Normalised 'goal' to USD.
2. Feature Engineering:
• Created time based features such as:
• The difference between a time features pairs: deadline, created_at,state_changed_at, launched_at
• Hour, day, weekday of time features
• Combine hour + weekday, hour + country, weekday + country, day + country of time features and calculate mean by target for these features
• All the features obtained in a) are divided into a normalised goal
• Created Text Features such as:
• TfidfVectorizer by keyword, desc and names with max_features equal 3500
• Length of desc, name and keyword
• Tried to use word2vec for text features, but this did not improve LB and CV score.
3. Model Building:
• LinearSVC and LogisticRegression on Tfidf features:
• by words (used name, desc, keywords)
• by chars (used name, desc, keywords)
• 3 Xgboost and 2 LightGBM on Tfidf features and time features without 2(c)
4. Blending:
• Finally, blending all models by time (split all data set on equal parts by time intervals) with weight from Linear regression.

According to him, 5 things which participant should focus while solving such problems are:

• Build a nice reliable local cross validation scheme
• Write good robust code
• Create New features
• Understand the places where you can overfit, avoid them
• Learn how to build model ensembles

Roman says ,'My past experience of participating in such competition helped me formulate the winning strategy for this challenge. In the past, I've participated in 10 similar contest and was able to workout a similar approach'.

### Rank 2 - Sergazy Kalmurzayev, Kazakhstan

Sergazy is a senior year computer science student at Nazarbayev University, Kazakhstan. He has 2 years experience in competing at university math competitions and ACM ICPC.

He says, " In free time I love surfing the internet and reading interesting things. That's how I came to machine learning. Being 1st year student, I've attended presentations from research professors on Kazakh NLP. They explained TF-IDF for document author recognition at that time. It was very interesting topic but since I hadn't programming experience at all, I switched to studying algorithms.

But by the end of this year, one of my friends invited me to participate in one of Kaggle competitions, that's how I actually started learning ML. His final model was an ensemble of LSTM and LightGBM model, with a lot of creative feature engineering. Sergazy has shared the following insights from his approach:

1. Feature Engineering
• I used Bag of words technique for creating features. It was strange, but removing stopwords gave slightly worse results. CountVectorizer was the best one from all available on sklearn. I thought about trying out sent2vec averaged from word2vec or Glove vectors but didn't have time to do so.
• I also calculated Readability score of the text description. There are nice packages out there in github for computing readability score they include FleschKincaidGradeLevel, Automated Readability Index etc.
• I counted the number of valid english words in the description with enchant library. Also used nltk.Vader to computer sentiment scores for description and name of the campaign. The compoundScore especially was always in the top of feature importance chart.
• I computed many other text features like number of digits, commans, len, count etc.
• For some reason kkstid was also in top 5 of feature importances. Inspecting, What to learn next ?  ID of project I thought that it is somehow related to time or category, I decided to leave it anyway.
• Here are mysterious features: hardness and potentiality. Their formulas are money/duration, money*duration respectively. At first glance hardness should be more important than potentiality, but guess what? It is not alway true. Potentiality was top 1 feature in my model:)
• I've also tried to cluster description by k-means clustering based on tf-idf vectors, but it gave me no gain.
2. Model Building
• I found out-of-box LightGBM better than XGBoost both in performance and accuracy.
• I also tried RandomForest but it gave 2-3% decrease in accuracy, others I've tried are SVM, LinearSVM, Logistic Regression, Vowpal Wabbit's logistic regression(max_features = inf), AdaBoost, ExtraTrees, kNN, LSTM, RNN, FNNs etc. None of them stood nearby Gradient Boosting Machine.
• Since I'm newcomer to ML competitions I didn't know how to properly stack/ensemble models and didn't save models that gave poor results, sooner I apologised on my mistake but didn't want to rerun those complex models and decided to keep only one LSTM and LightGBM.

His suggestions for participants to do well in current and upcoming competitions:

• Time! One should always use time efficiently, I mean computing time. Building lightweight models is very important. I think I've gave up in the end of competition because I've found testing my ideas very time inefficient.
• Features. One should focus on generating features that worked in other similar competitions. For example I got some ideas from top solutions in Quora challenge in Kaggle.
• Code. I really messed up writing good code in this competition. I started thinking about writing some general ensembling method for python like Stacknet in Java.
• Reproducibility. As stated in the rules of any ML challenge your code should be reproducible. I had problems with that because of using LightGBM.

Finally, many thanks to Hackerearth team for these challenges. I know that hosting challenges is very tough work and needs much more skills than for winning them.