Kaggle: TalkingData

A brief retrospective of my submission for Kaggle data science competition that predicts the gender and age group of a smartphone user based on their usage pattern.


TalkingData is China’s largest third-party mobile data platform. Using a SDK that’s integrated with smartphone apps they collect events generated by the smartphone user. This information is used for targeted advertising. The objective of this competition is build a model that will predict the smartphone user’s gender and age given their app usage, geolocation and smartphone properties.



gender_age: This is the training data. Each smartphone is identified by device_id and the gender, age and the age group of the user. The age group is the label to be predicted and can take these values : F23-,F24-26,F27-28,F29-32,F33-42,F43+,M22-,M23-26,M27-28,M29-31,M32-38,M39+

events: Each smartphone can generate one or more events identified by event_id. Each event is tagged a timestamp and the latitude/longitude location co-ordinates

app_events: For each event, the status of all the mobile apps that’s integrated with the SDK is captured.  There can be one or more app for each event_id.

app_labels and label_categories: Categories of the mobile app. An app can have one or more labels

phone_brand_device_model: Attributes of the smartphone

More details of the data can be found here


My solution was based on Apache Spark 2.0 ML pipeline and coded in Scala. The spark jobs were executed on AWS EMR cluster to generate the predictions. The EMR cluster was 10 node cluster running on r3xlarge EC2 instances. The Spark cluster was a mix of on demand and spot instances. The data files were read from S3 and the output written to S3 as well. The Scala code for this project is available in GitHub


From the events data, the following features were derived for each device_id:

  • Earliest Hour of use (min_hour)
  • Latest hour of use (max_hour)
  • Count of events for each day of the week (mon_count, tue_count, wed_count,… sun_count)
  • Count of events for during weekdays (weekday_count)
  • Count of events for during weekends (weekend_count)
  • Count of events in AM and PM (am_count, pm_count)
  • Count of events in each hour of day (h0_count, h1_count, h2_count…h23_count)
  • Number of events (eventsPerDevice_count)

gender_age, events and app_events datasets were joined an for each device_id the following features were derived

  • No of apps per device (appsPerDevice_count)
  • No of apps active per device (appsActivePerDevice_count)
  • Average no of apps per event (appsPerEvent_avg)
  • Average no of apps active during an event (appsActivePerEvent_avg)

ML Algorithms

This submission required the probabilities to be calculated for each age group given a device id. I started with Naive Bayes classifier. Naive Bayes model performed faster but suffered from poor accuracy for this scenario. So I switched to Random Forest Classifier and Decision Tree Classifier. Between the two I found Decision Tree classifier performed faster, however Random Forest classifier was more accurate. The sweet spot for Decision Tree was achieved when the parameter were maxBins=(32,50), maxDepth=10, maxIterations=20 and entropy=gini. The best score came from Random Forest classifier with maxBins=(32,50), maxDepth=10, numTrees=20 and entropy=gini.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s