A brief retrospective of my submission for Kaggle data science competition that forecasts inventory demand for Grupo Bimbo.
Grupo Bimbo is a bakery product manufacturing company that supplies bread and bakery products to its clients in Mexico on a weekly basis. Usually the sales agent calculates the supply for each product for each store. Each transaction consists of sales and returns. Returns are the products that are unsold and expired. The demand for a product in a certain week is defined as the sales this week subtracted by the return next week. In this competition, the objective is to forecast the demand of a product for a given week, at a particular store.
The training and test data sets along with related tables are as shown below. More details about the dataset is available here
My solution was based on Apache Spark ML pipelines and coded in Scala. I started with Spark 1.6.2. During the course of this project Spark 2.0 was released, so I migrated the code over to Spark 2.0. The spark jobs were executed on AWS EMR cluster. Input data files for the job resides at S3 and outputs written to S3 as well. The EMR cluster was 8 node cluster running on r3xlarge EC2 instances. A combination of on demand and spot instances were used. The Scala code for this project is available in GitHub
- From the dataset the following features were available straight away – Semana, Agencia_ID, Canal_ID, Ruta_SAK, Cliente_ID, Producto_ID.
- Zip code was derived by joining train/test dataset with town dataset and sub stringing the town description
- The training/test data set was grouped by Agencia_ID, Canal_ID, Ruta_SAK, Cliente_ID, Producto_ID respectively and standard deviation of Venta_uni_hoy, Venta_hoy, Dev_uni_proxima, Dev_proxima was derived as features (20 features)
- The training/test data set was again grouped this time as combination of two of Agencia_ID, Canal_ID, Ruta_SAK, Cliente_ID and Producto_ID. The standard deviation of Venta_uni_hoy, Venta_hoy, Dev_uni_proxima, Dev_proxima was derived as set of 40 additional features.
- In total there were 67 features to predict Demanda_uni_equil (label)
The label to be predicted – Demanda_uni_equil isa continuous numeric value, so regression algorithms from Spark’s Scala libraries were used. Three different regression algorithms were used with various parameters – Linear Regression, Random Forest and Gradient-Boosted Trees.
Out of the three algorithms, Linear Regression was ruled out early in the project due to lower accuracy and longer execution time. This indicated that the predictor features set vectors and the predicted label do not exhibit linear relationship.
Random Forest Regression performed the best among the three algorithms although not as accurate. The best outcome for Random Forest was achieved by tuning the algorithm to maxBins=[32,50], maxDepth=10 and maxIterations=20.
The best possible accuracy for my submission came from Gradient-Boosted Trees (GBT). The performance of GBT algorithm was better than linear regression although not as fast as random forest regressor. The best outcome for GBT was achieved by tuning the algorithm to maxBins=[70,90], maxDepth=10, maxIterations=20.
Overall for my submission I found Random Forest regression to be the fastest while GBT closest to expected outcome.