HackerEarth wants to improve its customer experience by suggesting tags for any idea submitted by a participant for a given hackathon. Currently, tags can only be manually added by a participant. HackerEarth wants to automate this process with the help of machine learning. To help the machine learning community grow and enhance its skills by working on real-world problems, HackerEarth challenges all the machine learning developers to build a model that can predict or generate tags relevant to the idea/ article submitted by a participant.
You are provided with approximately 1 million technology-related articles mapped to relevant tags. You need to build a model that can generate relevant tags from the given set of articles.
The dataset consists of ‘train.csv ’, ‘test.csv’ and ‘sample_submission.csv’. Description of the columns in the dataset is given below:
Variable |
Description |
id |
Unique id for each article |
title |
Title of the article |
article |
Description of the article (raw format) |
tags |
Tags associated with the respective article. If multiple tags are associated with an article then they are seperated by '|'. |
The submission file submitted by the candidate for evaluation has to be in the given format. The submission file is in .csv format. Check sample_submission for details. Remember, incase of multiple tags for a given article, they are seperated by '|'.
id,tags
HE-efbc27d,java|freemarker
HE-d1fd267,phpunit|pear|osx-mountain-lion
HE-ffd4152,javascript|jquery|ajax|onclick
HE-d3ab268,forms|select|dojo
HE-ed2fa45,php|mysql|login|locking|ip-address
For challenge related queries, discussions and announcements join our Slack channel.
The predicted tags will be evaluated on the metrics F1 score. For each article, F1 score is calculated as
F1 score=2∗recall∗precisionrecall+precision
where,
Precision(u)=|Recommended(u) ∩ Testing(u)||Recommended(u)|
Recall(u)=|Recommended(u) ∩ Testing(u)||Testing(u)|
The final score is calculated as:
Leaderboard score=1n⋅∑ni=1(F1 score)i
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Update: 16th October 2018: Corrections to the dataset have been made, please re-download the dataset.
Update: 21st October 2018 : The leaderboard metrics has been updated to F1 score.
Update: 10th December 2018 : The final leaderboard has been updated. You can check your final standings.