Kaggle: TalkingData

A brief retrospective of my submission for Kaggle data science competition that predicts the gender and age group of a smartphone user based on their usage pattern. Continue reading

Kaggle: Grupo Bimbo

A brief retrospective of my submission for Kaggle data science competition that forecasts inventory demand for Grupo Bimbo. Continue reading

Common Type 2 SCD Anti-patterns


Slowly Changing Dimension (SCD) is great for tracking historical changes to dimension attributes. SCDs have evolved over the years and besides the conventional type 1 (update), type 2 (add row) and type 3 (add column), now there are extensions up to type 7 including type 0. Almost every DW/BI project has at least few type 2 dimensions where a change to an attribute causes the current dimension record to be end dated and creates a new record with the new value. Continue Reading

Forecasting Exchange Rates Using R Time Series

Time Series is the historical representation of data points collected at periodic intervals of time. Statistical tools like R use forecasting models to analyse historical time series data to predict future values with reasonable accuracy. In this post I will be using R time series to forecast the exchange rate of Australian dollar using daily closing rate of the dollar collected over a period of two years. Continue reading

Energy Rating Analysis of Air conditioners using R Decision Trees

Decision tree is a data mining model that graphically represents the parameters that are most likely to influence the outcome and the extent of influence. The output is similar to a tree/flowchart with nodes, branches and leaves. The nodes represent the parameters, the branches represent the classification question/decision and the leaves represent the outcome (Screen Capture 1). Internally, decision tree algorithm performs a recursive classification on the input dataset and assigns each record to a segment of the tree where it fits closest. Continue Reading

R: Box Plot

Box plot is an effective way to visualize the distribution of your data.It only takes a few lines of code in R to come up with a basic box plot. Continue Reading

Pig: Using CUBE Operator to Analyse Energy Rating of Air Conditioners


CUBE operator in Pig computes all possible combination of the specified fields. In this post I will demonstrate the use of Cube operator to analyse energy rating of air conditioners in Hortonworks Data Platform (HDP). Continue Reading