A brief retrospective of my submission for Kaggle data science competition that predicts the gender and age group of a smartphone user based on their usage pattern. Continue reading
Slowly Changing Dimension (SCD) is great for tracking historical changes to dimension attributes. SCDs have evolved over the years and besides the conventional type 1 (update), type 2 (add row) and type 3 (add column), now there are extensions up to type 7 including type 0. Almost every DW/BI project has at least few type 2 dimensions where a change to an attribute causes the current dimension record to be end dated and creates a new record with the new value. Continue Reading
Time Series is the historical representation of data points collected at periodic intervals of time. Statistical tools like R use forecasting models to analyse historical time series data to predict future values with reasonable accuracy. In this post I will be using R time series to forecast the exchange rate of Australian dollar using daily closing rate of the dollar collected over a period of two years. Continue reading
Decision tree is a data mining model that graphically represents the parameters that are most likely to influence the outcome and the extent of influence. The output is similar to a tree/flowchart with nodes, branches and leaves. The nodes represent the parameters, the branches represent the classification question/decision and the leaves represent the outcome (Screen Capture 1). Internally, decision tree algorithm performs a recursive classification on the input dataset and assigns each record to a segment of the tree where it fits closest. Continue Reading
Box plot is an effective way to visualize the distribution of your data.It only takes a few lines of code in R to come up with a basic box plot. Continue Reading
CUBE operator in Pig computes all possible combination of the specified fields. In this post I will demonstrate the use of Cube operator to analyse energy rating of air conditioners in Hortonworks Data Platform (HDP). Continue Reading
In this post I will demonstrate how to use Pig’s GROUP operator to analyse credit card expenses and determine the top expenses for the year and their percentage of the total expense. This exercise was done in Hortonworks Data Platform (HDP). Continue Reading
Hive implements MapReduce using HiveQL. The built-in capabilities of HiveQL abstracts the implementation of mappers and reducers with a simple yet powerful SQL like query language. To demonstrate the inbuilt capabilities of HiveQL, I will be analysing hashtags from a twitter feed on Hortonworks Data Platform (HDP). Continue Reading
Power Query can discover and import data from websites. Often data warehouses rely on external data which is readily available in public websites for e.g. public holidays, school holidays, daylight savings, SIC codes, SWIFT codes, post codes etc. Power Query is perfectly suitable for such situations. Power Query can discover, fetch and transform data from a HTML table in a web page into a format that can be easily imported into data warehouses using SSIS package. It’s like an ETL tool for the web page data source.
Storage cluster (HDFS) in Hadoop is also the Processing cluster (MapReduce). Azure provides two different options to store data:
Option 1: Use HDInsight cluster to store data as well as to process MapReduce requests. For e.g. a Hive database hosted in an HDInsight cluster which also executes HiveQL MapReduce queries. In this instance data is stored in the cluster’s HDFS.
Option 2: Use HDInsight cluster to only process MapReduce requests whereas data is stored in Azure blob storage. For e.g. the Hive data is stored in Azure storage while the HDInsight cluster executes HiveQL MapReduce queries. Here the metadata of Hive database is stored in the cluster whereas the actual data is stored in Azure storage. The HDInsight cluster is co-located in the same datacentre as the Azure storage and connected by high speed network.
There are several advantage of using Azure storage (Option 2). Continue Reading
Hive implements MapReduce using HiveQL. The built-in capabilities of HiveQL abstracts the implementation of mappers and reducers with a simple yet powerful SQL like query language. To demonstrate the inbuilt capabilities of HiveQL, I will be analysing hashtags from a twitter feed on Azure HDInsight platform. Continue Reading
This post is a tutorial to get started on Hive in HDInsight.
The steps to be followed are given below. As a pre-requisite you would need a subscription to Microsoft Azure to try out these steps
- Provision Azure Storage Account
- Provision HDInsight Cluster
- Create Hive Database and Tables
- Prepare Data as Ctrl-A separated Text Files
- Upload Text Files to Azure Storage
- Load Data to Hive
- Execute HiveQL DML Jobs
Partitions improve the query response of SSAS cube by narrowing the MDX query to a pre-defined subset of data. Usually cubes are developed with few basic partitions to begin with and new partitions are added over time. Partitions are created using XMLA commands. The rest of this post explains the steps to create partitions using metadata and SSIS package. Continue Reading
SSAS Partitions are primarily geared towards better query performance by narrowing the MDX query to a pre-defined subset of data. Each partition must be in processed state for the cube to be available for browsing. However in a delta ETL run, it is more than likely that only some partitions would be refreshed with data while the rest remain unchanged. For example in a hospital admissions cube partitioned by month, data movements are frequent for admissions made during the current/last month compared to earlier months. This means if the partitions updated by the ETL run can be identified, then we can tailor SSIS package to dynamically process only those partitions that require data refresh while leaving the remaining partitions untouched. Continue Reading
Canned reports/dashboards aka custom reports or out of box reports are inevitable part of any BI ecosystem. It’s a good point of reference for conversations with business users especially at the start of a project when requirements are often a bit hazy. In many instances, business users see report layouts as an effective and convenient medium of communication with BI project teams. This however should not be considered as THE business requirement. Instead of dwelling on canned reports, the focus should be on building a well-designed DW/BI system that empowers users to perform analysis on a self-service basis. The quality and the range of analysis that a self-service BI platform provides far outweighs the benefits of just rolling out a set of canned reports. It empowers business users to do their own analysis while the BI team can work towards expanding and refining the analytical platform instead of operating as a “reporting team”. Continue Reading
The execution plan of SSAS processing engine triggers a sequence of processing tasks based on the object’s current processed state and its dependencies with other objects. This is what happens when the SSAS cube is processed from SQL Server Management Studio (SSMS). It is possible to replicate this execution plan close enough in an SSIS package so that the cube is always processed and refreshed with latest data.
While SSAS allows you to build a cube straight of the fact and dimension tables, however there are benefits in creating the cube from database views that sits on top of the underlying fact and dimension tables. Here’s why. Continue reading
In SSRS report, when a report parameter is role playing dimension then the parameter value passed to each dataset has to be customized depending on the context in which the dimension is being used. In an earlier post, I explained how this is done in PerformancePoint dashboard. In this post an equivalent solution for reporting services is provided. Continue Reading
Date dimension is inevitably used in most dashboards. Due to its role playing nature, a specific date dimension used as slicer in one report is different from the date dimension used for another report on the same dashboard page. For example sales reports are based on order date whereas delivery reports are based on delivery dates. On a dashboard that displays both sales and delivery reports, it would not be good for navigation purposes to have both order date and deliver date filters on the same page. The choice of which date dimension to use as filter among the available dates is a tactical decision. Having chosen a particular date dimension, the challenge is to provide filter value to those reports that use a different date dimension. This is where PerformancePoint filter Connection Formula comes into play as explained in the example below.
Dashboard filters is one of the key components of BI Platform that drive reports and KPI’s in various dashboards. MDX expressions to a great extent enables to customize PerformancePoint filters and enhance the end user experience.
First – a quick look at member selection option. Member selection is the easiest way to build dashboard filters which does not require MDX expressions (Screen capture 1). Member selection option allows to select specific members, although this is not usually recommended as the filter can become quite static. A better method is to select Children, Leaves or Descendants with additional option to include or exclude default member. For hierarchies, member selection option also allows selection members from a particular level of hierarchy.
Screen Capture 1 – Member Selection
While member selection option may be sufficient for some business scenarios, using MDX Expressions (Screen Capture 2) enables to customize dashboards without hardcoding the member selection.
Screen Capture 2 – MDX Expression Option for PerformancePoint Filters
Some common and often repeated dashboarding scenarios where MDX expressions for PerformancePoint filters are useful is discussed below
Display only members that have data
It’s not particularly intuitive for business users to select an item from drop down only to find that there is no data for the dashboard to display.
The following example uses FILTER function to select only Product Categories that have a sale.
FILTER([Product].[Category].Children,[Measures].[Internet Order Count] > 0)
Screen Capture 3 – MDX Expression for PerformancePoint Filter
Display only last N time periods
Often business is only keen on trend analysis for the last few years. All time periods from beginning of time need not be displayed in dashboard.
The following MDX expression displays the last 4 fiscal years from current fiscal year. When the fiscal year changes so does the filter data. Note that for this MDX expression to work, the date dimension must have current attributes defined.
(EXISTS([Date].[Fiscal Year].Children,[Date].[Current Fiscal Year].&[Y]).Item(0).LAG(-4) : EXISTS([Date].[Fiscal Year].Children,[Date].[Current Fiscal Year].&[Y]).Item(0))
Display time periods in reverse chronology – latest dates first
Another common requirement for business users is to see the latest dates first.
Using the ORDER function in the previous MDX expression, the latest fiscal years can be displayed as the first item in the drop down.
ORDER((EXISTS([Date].[Fiscal Year].Children,[Date].[Current Fiscal Year].&[Y]).Item(0).LAG(-4) : EXISTS([Date].[Fiscal Year].Children,[Date].[Current Fiscal Year].&[Y]).Item(0)) , [Date].[Fiscal Year].CurrentMember.Properties('key'),DESC)
Hide unknown members
It a good to handle unknown dimension members in datamarts which in turn helps to address data quality issues. However bad data need not be displayed in dashboard filters.
The following MDX expression gets all the Sales Reason members except the unknown members.
([Sales Reason].[Sales Reason].Children -[Sales Reason].[Sales Reason].&[UNAVAILABLE] -[Sales Reason].[Sales Reason].&[NOT APPLICABLE] -[Sales Reason].[Sales Reason].&[UNKNOWN])
The above examples demonstrate the ease with which seriously rocking PerformancePoint filters can be created using simple MDX expressions that create business friendly dashboards.
João Lopes (@SQLSniper) January 15, 2013
Derek Goodridge (@workerthread) January 14, 2013
In a data warehouse, Periodic Snapshot table is a type of fact table that “freezes” the measure value at recurring reporting periods. This enables reporting of measures AS AT a point in time. Periodic Snapshots are usually created from Transaction Fact tables that support reporting FOR the period. Using MDX it is possible to create snapshot values for measures from transaction fact without creating Periodic Snapshot tables. Continue Reading
Often there is requirement to calculate measures based on two different date dimensions. For e.g. how many orders that were delivered this month was actually ordered last month? How many orders were delivered in the same month as they were ordered? How many orders were carried forward to next month?
Analysis Service supports role based model to secure cube data. Data can be secured at object level, cell level (Cell Data Access) and at member level (Dimension Data Access). There are relative merits of each approach, which isn’t the topic of this post but more details can be found here. A user can be a member of more than one role and it is important to understand the behaviour of Analysis Services role permissions in such a scenario. Analysis Services role permissions are additive in nature. This means if you are a member of a role that has been granted access to an Analysis Service object; you have access to that object even if you belong to another role which denies access to the same object. In other word the least restrictive roles permissions override the most restrictive role permissions. This could throw some nasty surprises that you never expected. Continue Reading
The greatest strength of PowerPivot is it’s ability to create relationship with different sources of data and piggyback on this relationship for various kinds of analysis. PowerPivot relationship requires one-to-many relationship between the parent and child tables. But what happens if the relationship between parent and child tables is many-to-many instead? For e.g. you might want to establish a relationship between a fact table that has a Business Key to the dimension table and that dimension table maintains history. In this post I explain how to establish a relationship between two tables that do not exhibit one-to-many relationship on their Business Key using a technique that is similar to an ETL Type 2 SCD lookup using Data Analysis Expressions (DAX). Continue Reading
SSIS comes with an out-of-box SCD Wizard to handle Type 1 and Type 2 Slowly Changing Dimensions (SCD) which is a fundamental ETL requirement. However the SCD wizard component has some serious drawbacks – both from operational and functional perspectives that make it unusable for practical purposes. A good summary on the shortcomings of SCD Wizard component can be found here
Several workarounds have evolved over time and in this post I would like explore the different alternative options to handle Type 1 and Type 2 SCD without using the out-of-box SCD Wizard Component. Continue Reading