Time Series is the historical representation of data points collected at periodic intervals of time. Statistical tools like R use forecasting models to analyse historical time series data to predict future values with reasonable accuracy. In this post I will be using R time series to forecast the exchange rate of Australian dollar using daily closing rate of the dollar collected over a period of two years. Continue reading
Decision tree is a data mining model that graphically represents the parameters that are most likely to influence the outcome and the extent of influence. The output is similar to a tree/flowchart with nodes, branches and leaves. The nodes represent the parameters, the branches represent the classification question/decision and the leaves represent the outcome (Screen Capture 1). Internally, decision tree algorithm performs a recursive classification on the input dataset and assigns each record to a segment of the tree where it fits closest. Continue Reading
Hive implements MapReduce using HiveQL. The built-in capabilities of HiveQL abstracts the implementation of mappers and reducers with a simple yet powerful SQL like query language. To demonstrate the inbuilt capabilities of HiveQL, I will be analysing hashtags from a twitter feed on Azure HDInsight platform. Continue Reading
Partitions improve the query response of SSAS cube by narrowing the MDX query to a pre-defined subset of data. Usually cubes are developed with few basic partitions to begin with and new partitions are added over time. Partitions are created using XMLA commands. The rest of this post explains the steps to create partitions using metadata and SSIS package. Continue Reading
SSAS Partitions are primarily geared towards better query performance by narrowing the MDX query to a pre-defined subset of data. Each partition must be in processed state for the cube to be available for browsing. However in a delta ETL run, it is more than likely that only some partitions would be refreshed with data while the rest remain unchanged. For example in a hospital admissions cube partitioned by month, data movements are frequent for admissions made during the current/last month compared to earlier months. This means if the partitions updated by the ETL run can be identified, then we can tailor SSIS package to dynamically process only those partitions that require data refresh while leaving the remaining partitions untouched. Continue Reading
The execution plan of SSAS processing engine triggers a sequence of processing tasks based on the object’s current processed state and its dependencies with other objects. This is what happens when the SSAS cube is processed from SQL Server Management Studio (SSMS). It is possible to replicate this execution plan close enough in an SSIS package so that the cube is always processed and refreshed with latest data.
In SSRS report, when a report parameter is role playing dimension then the parameter value passed to each dataset has to be customized depending on the context in which the dimension is being used. In an earlier post, I explained how this is done in PerformancePoint dashboard. In this post an equivalent solution for reporting services is provided. Continue Reading
Date dimension is inevitably used in most dashboards. Due to its role playing nature, a specific date dimension used as slicer in one report is different from the date dimension used for another report on the same dashboard page. For example sales reports are based on order date whereas delivery reports are based on delivery dates. On a dashboard that displays both sales and delivery reports, it would not be good for navigation purposes to have both order date and deliver date filters on the same page. The choice of which date dimension to use as filter among the available dates is a tactical decision. Having chosen a particular date dimension, the challenge is to provide filter value to those reports that use a different date dimension. This is where PerformancePoint filter Connection Formula comes into play as explained in the example below.
Dashboard filters is one of the key components of BI Platform that drive reports and KPI’s in various dashboards. MDX expressions to a great extent enables to customize PerformancePoint filters and enhance the end user experience.
First – a quick look at member selection option. Member selection is the easiest way to build dashboard filters which does not require MDX expressions (Screen capture 1). Member selection option allows to select specific members, although this is not usually recommended as the filter can become quite static. A better method is to select Children, Leaves or Descendants with additional option to include or exclude default member. For hierarchies, member selection option also allows selection members from a particular level of hierarchy.
Screen Capture 1 – Member Selection
While member selection option may be sufficient for some business scenarios, using MDX Expressions (Screen Capture 2) enables to customize dashboards without hardcoding the member selection.
Screen Capture 2 – MDX Expression Option for PerformancePoint Filters
Some common and often repeated dashboarding scenarios where MDX expressions for PerformancePoint filters are useful is discussed below
Display only members that have data
It’s not particularly intuitive for business users to select an item from drop down only to find that there is no data for the dashboard to display.
The following example uses FILTER function to select only Product Categories that have a sale.
FILTER([Product].[Category].Children,[Measures].[Internet Order Count] > 0)
Screen Capture 3 – MDX Expression for PerformancePoint Filter
Display only last N time periods
Often business is only keen on trend analysis for the last few years. All time periods from beginning of time need not be displayed in dashboard.
The following MDX expression displays the last 4 fiscal years from current fiscal year. When the fiscal year changes so does the filter data. Note that for this MDX expression to work, the date dimension must have current attributes defined.
(EXISTS([Date].[Fiscal Year].Children,[Date].[Current Fiscal Year].&[Y]).Item(0).LAG(-4) : EXISTS([Date].[Fiscal Year].Children,[Date].[Current Fiscal Year].&[Y]).Item(0))
Display time periods in reverse chronology – latest dates first
Another common requirement for business users is to see the latest dates first.
Using the ORDER function in the previous MDX expression, the latest fiscal years can be displayed as the first item in the drop down.
ORDER((EXISTS([Date].[Fiscal Year].Children,[Date].[Current Fiscal Year].&[Y]).Item(0).LAG(-4) : EXISTS([Date].[Fiscal Year].Children,[Date].[Current Fiscal Year].&[Y]).Item(0)) , [Date].[Fiscal Year].CurrentMember.Properties('key'),DESC)
Hide unknown members
It a good to handle unknown dimension members in datamarts which in turn helps to address data quality issues. However bad data need not be displayed in dashboard filters.
The following MDX expression gets all the Sales Reason members except the unknown members.
([Sales Reason].[Sales Reason].Children -[Sales Reason].[Sales Reason].&[UNAVAILABLE] -[Sales Reason].[Sales Reason].&[NOT APPLICABLE] -[Sales Reason].[Sales Reason].&[UNKNOWN])
The above examples demonstrate the ease with which seriously rocking PerformancePoint filters can be created using simple MDX expressions that create business friendly dashboards.
João Lopes (@SQLSniper) January 15, 2013
Derek Goodridge (@workerthread) January 14, 2013
In a data warehouse, Periodic Snapshot table is a type of fact table that “freezes” the measure value at recurring reporting periods. This enables reporting of measures AS AT a point in time. Periodic Snapshots are usually created from Transaction Fact tables that support reporting FOR the period. Using MDX it is possible to create snapshot values for measures from transaction fact without creating Periodic Snapshot tables. Continue Reading
Often there is requirement to calculate measures based on two different date dimensions. For e.g. how many orders that were delivered this month was actually ordered last month? How many orders were delivered in the same month as they were ordered? How many orders were carried forward to next month?
Analysis Service supports role based model to secure cube data. Data can be secured at object level, cell level (Cell Data Access) and at member level (Dimension Data Access). There are relative merits of each approach, which isn’t the topic of this post but more details can be found here. A user can be a member of more than one role and it is important to understand the behaviour of Analysis Services role permissions in such a scenario. Analysis Services role permissions are additive in nature. This means if you are a member of a role that has been granted access to an Analysis Service object; you have access to that object even if you belong to another role which denies access to the same object. In other word the least restrictive roles permissions override the most restrictive role permissions. This could throw some nasty surprises that you never expected. Continue Reading
The greatest strength of PowerPivot is it’s ability to create relationship with different sources of data and piggyback on this relationship for various kinds of analysis. PowerPivot relationship requires one-to-many relationship between the parent and child tables. But what happens if the relationship between parent and child tables is many-to-many instead? For e.g. you might want to establish a relationship between a fact table that has a Business Key to the dimension table and that dimension table maintains history. In this post I explain how to establish a relationship between two tables that do not exhibit one-to-many relationship on their Business Key using a technique that is similar to an ETL Type 2 SCD lookup using Data Analysis Expressions (DAX). Continue Reading
SSIS comes with an out-of-box SCD Wizard to handle Type 1 and Type 2 Slowly Changing Dimensions (SCD) which is a fundamental ETL requirement. However the SCD wizard component has some serious drawbacks – both from operational and functional perspectives that make it unusable for practical purposes. A good summary on the shortcomings of SCD Wizard component can be found here
Several workarounds have evolved over time and in this post I would like explore the different alternative options to handle Type 1 and Type 2 SCD without using the out-of-box SCD Wizard Component. Continue Reading
A brief retrospective of my submission for Kaggle data science competition that predicts the gender and age group of a smartphone user based on their usage pattern. Continue reading
Slowly Changing Dimension (SCD) is great for tracking historical changes to dimension attributes. SCDs have evolved over the years and besides the conventional type 1 (update), type 2 (add row) and type 3 (add column), now there are extensions up to type 7 including type 0. Almost every DW/BI project has at least few type 2 dimensions where a change to an attribute causes the current dimension record to be end dated and creates a new record with the new value. Continue Reading
Box plot is an effective way to visualize the distribution of your data.It only takes a few lines of code in R to come up with a basic box plot. Continue Reading
CUBE operator in Pig computes all possible combination of the specified fields. In this post I will demonstrate the use of Cube operator to analyse energy rating of air conditioners in Hortonworks Data Platform (HDP). Continue Reading