Machine Learning
In very simple terms, Machine Learning is about training or teaching computers to take decisions or actions without explicitly programming them. For example, whenever you read a tweet or movie review, you can figure out if the views expressed are positive or negative. But can you teach a computer to determine the sentiment of that text? This has many real-life applications. For instance, when Donald Trump makes a speech, Twitter responds with a range of sentiments, and his campaign team can assess the overall sentiment using machine learning.
Another example: Baidu predicted that Germany would win the 2014 World Cup even before the match was played.
Weather Problem
Consider this small dataset of favorable weather conditions for playing a game. The goal is to forecast whether one can play the game based on the given conditions.
Outlook | Temperature | Humidity | Windy | Play |
---|---|---|---|---|
Sunny | Hot | High | False | No |
Rainy | Mild | High | False | Yes |
Sunny | Cool | Normal | False | Yes |
Definitions
Feature/Attribute: Outlook, Temperature, Humidity, and Windy are features or attributes that influence the outcome.
Outcome/Target: The result to be predicted, i.e., whether you can play or not.
Vector: A row in the dataset representing an ordered collection of features (e.g., Sunny, Hot, High, False).
ML Model: The algorithm or process generated from the learning process (e.g., Decision Trees, SVM, Naive Bayes).
Error Metric/Evaluation Metric: Used to assess the accuracy of an ML model’s predictions. Different types exist for different problems.
Supporting ML Problems on HackerEarth
HackerEarth’s ML platform supports a typical machine learning flow. A dataset is split into training and test sets. Users train their models on the training set and predict outcomes on the test set. The test set does not include the target variable.
Example Dataset
Outlook | Temperature | Humidity | Windy | Play |
---|---|---|---|---|
Sunny | Hot | High | False | No |
Rainy | Mild | High | False | Yes |
Sunny | Cool | Normal | False | Yes |
Overcast | Hot | High | False | Yes |
Rainy | Mild | High | False | Yes |
Overcast | Hot | Normal | False | Yes |
Sunny | Mild | Normal | True | Yes |
Sunny | Mild | High | False | No |
Overcast | Cool | Normal | True | Yes |
Rainy | Mild | High | True | Yes |
Train Dataset (train.csv)
Outlook | Temperature | Humidity | Windy | Play |
---|---|---|---|---|
Sunny | Hot | High | False | No |
Rainy | Mild | High | False | Yes |
Sunny | Cool | Normal | False | Yes |
Overcast | Hot | High | False | Yes |
Rainy | Mild | High | False | Yes |
Overcast | Hot | Normal | False | Yes |
Test Dataset (test.csv)
Id | Outlook | Temperature | Humidity | Windy |
---|---|---|---|---|
1 | Sunny | Mild | Normal | True |
2 | Sunny | Mild | High | False |
3 | Overcast | Cool | Normal | True |
4 | Rainy | Mild | High | True |
Notice the absence of the target variable in the test data.
User Prediction File (user_prediction.csv)
Id | Play |
---|---|
1 | Yes |
2 | Yes |
3 | No |
4 | No |
Correct Prediction File (correct_prediction.csv)
Id | Play |
---|---|
1 | Yes |
2 | No |
3 | Yes |
4 | Yes |
Evaluation Metric
During the contest, only 50% of the test dataset is used for evaluation to discourage overfitting. The evaluation metric is defined as:
Score = Number of correct predictions / Total rows
In this case, only ID 1 is predicted correctly out of the first two, so:
Score online = 1 / 2 = 0.5
After the contest, the model is evaluated on the full test dataset:
Score offline = 1 / 4 = 0.25
This demonstrates how overfitting can reduce real-world model performance. Online evaluations using partial data help encourage more generalizable solutions.