HdfsBridge::recordReaderFillBuffer – Unexpected error encountered filling record reader buffer: IllegalArgumentException: Must be 12 bytes

Parquet is my preferred format for storing files in data lake. Parquet’s columnar storage and compression makes it very efficient for in-memory processing tasks like Spark/Databricks notebooks while saving cost on storage. Parquet also supports almost all encoding schemes out there. Perhaps the coolest thing in Parquet is unlike CSV there is no such thing as column/row separator. So there is no need to escape those characters if they are part of data.

Azure SQL Data Warehouse supports Parquet data format for External (PolyBase) tables. External tables reference the underlying storage blobs and gives an option to query the data lake using SQL. In fact this is recommended in Microsoft’s reference architecture. With some Parquet files this error gets thrown when the External Table is queried

HdfsBridge::recordReaderFillBuffer – Unexpected error encountered filling record reader buffer: IllegalArgumentException: Must be 12 bytes

In the absence of clear exception message it took a while to figure this out. This error usually happens on a Timestamp column specifically when the data is in yyyy/MM/dd hh:mm:ss format. For some reason SQL Data Warehouse expects the Timestamp data to be in yyyy-MM-dd hh:mm:ss format. Changing the date separator from / to – resolved this issue, although it must be mentioned the underlying file is a perfectly valid Parquet file.

Forecasting Exchange Rates Using R Time Series

Time Series is the historical representation of data points collected at periodic intervals of time. Statistical tools like R use forecasting models to analyse historical time series data to predict future values with reasonable accuracy. In this post I will be using R time series to forecast the exchange rate of Australian dollar using daily closing rate of the dollar collected over a period of two years. Continue reading

Energy Rating Analysis of Air conditioners using R Decision Trees

Decision tree is a data mining model that graphically represents the parameters that are most likely to influence the outcome and the extent of influence. The output is similar to a tree/flowchart with nodes, branches and leaves. The nodes represent the parameters, the branches represent the classification question/decision and the leaves represent the outcome (Screen Capture 1). Internally, decision tree algorithm performs a recursive classification on the input dataset and assigns each record to a segment of the tree where it fits closest. Continue Reading

HDInsight: Analysing Twitter Hashtags using Hive

Twitter

Hive implements MapReduce using HiveQL. The built-in capabilities of HiveQL abstracts the implementation of mappers and reducers with a simple yet powerful SQL like query language. To demonstrate the inbuilt capabilities of HiveQL, I will be analysing hashtags from a twitter feed on Azure HDInsight platform. Continue Reading

SSIS: Creating Dynamic Data-driven SSAS Partitions

Partitions improve the query response of SSAS cube by narrowing the MDX query to a pre-defined subset of data. Usually cubes are developed with few basic partitions to begin with and new partitions are added over time. Partitions are created using XMLA commands. The rest of this post explains the steps to create partitions using metadata and SSIS package. Continue Reading

SSIS: Dynamic Processing of SSAS Partitions

SSAS Partitions are primarily geared towards better query performance by narrowing the MDX query to a pre-defined subset of data. Each partition must be in processed state for the cube to be available for browsing. However in a delta ETL run, it is more than likely that only some partitions would be refreshed with data while the rest remain unchanged. For example in a hospital admissions cube partitioned by month, data movements are frequent for admissions made during the current/last month compared to earlier months. This means if the partitions updated by the ETL run can be identified, then we can tailor SSIS package to dynamically process only those partitions that require data refresh while leaving the remaining partitions untouched. Continue Reading

SSIS Package to Process SSAS Cube

The execution plan of SSAS processing engine triggers a sequence of processing tasks based on the object’s current processed state and its dependencies with other objects. This is what happens when the SSAS cube is processed from SQL Server Management Studio (SSMS). It is possible to replicate this execution plan close enough in an SSIS package so that the cube is always processed and refreshed with latest data.

Continue Reading

SSRS: Parameter Expression for MDX Datasets Having Role Playing Dimension

In SSRS report, when a report parameter is role playing dimension then the parameter value passed to each dataset has to be customized depending on the context in which the dimension is being used. In an earlier post, I explained how this is done in PerformancePoint dashboard. In this post an equivalent solution for reporting services is provided. Continue Reading

PerformancePoint: Using Connection Formula to Pass Filter Values to Report Having Role Playing Dimension

Date dimension is inevitably used in most dashboards. Due to its role playing nature, a specific date dimension used as slicer in one report is different from the date dimension used for another report on the same dashboard page. For example sales reports are based on order date whereas delivery reports are based on delivery dates. On a dashboard that displays both sales and delivery reports, it would not be good for navigation purposes to have both order date and deliver date filters on the same page. The choice of which date dimension to use as filter among the available dates is a tactical decision. Having chosen a particular date dimension, the challenge is to provide filter value to those reports that use a different date dimension. This is where PerformancePoint filter Connection Formula comes into play as explained in the example below.

Continue Reading

SharePoint: MDX Expressions for PerformancePoint Filters

Dashboard filters is one of the key components of BI Platform that drive reports and KPI’s in various dashboards. MDX expressions to a great extent enables to customize PerformancePoint filters and enhance the end user experience.

First – a quick look at member selection option. Member selection is the easiest way to build dashboard filters which does not require MDX expressions (Screen capture 1). Member selection option allows to select specific members, although this is not usually recommended as the filter can become quite static. A better method is to select Children, Leaves or Descendants with additional option to include or exclude default member. For hierarchies, member selection option also allows selection members from a particular level of hierarchy.

Screen Capture 1 – Member Selection

While member selection option may be sufficient for some business scenarios, using MDX Expressions (Screen Capture 2) enables to customize dashboards without hardcoding the member selection.

Screen Capture 2 – MDX Expression Option for PerformancePoint Filters

Some common and often repeated dashboarding scenarios where MDX expressions for PerformancePoint filters are useful is discussed below

Display only members that have data

It’s not particularly intuitive for business users to select an item from drop down only to find that there is no data for the dashboard to display.

The following example uses FILTER function to select only Product Categories that have a sale.

FILTER([Product].[Category].Children,[Measures].[Internet Order Count] > 0)

Screen Capture 3 – MDX Expression for PerformancePoint Filter

Display only last N time periods

Often business is only keen on trend analysis for the last few years. All time periods from beginning of time need not be displayed in dashboard.

The following MDX expression displays the last 4 fiscal years from current fiscal year. When the fiscal year changes so does the filter data. Note that for this MDX expression to work, the date dimension must have current attributes defined.

(EXISTS([Date].[Fiscal Year].Children,[Date].[Current Fiscal Year].&[Y]).Item(0).LAG(-4)
: EXISTS([Date].[Fiscal Year].Children,[Date].[Current Fiscal Year].&[Y]).Item(0))

Display time periods in reverse chronology – latest dates first

Another common requirement for business users is to see the latest dates first.

Using the ORDER function in the previous MDX expression, the latest fiscal years can be displayed as the first item in the drop down.

ORDER((EXISTS([Date].[Fiscal Year].Children,[Date].[Current Fiscal Year].&[Y]).Item(0).LAG(-4)
: EXISTS([Date].[Fiscal Year].Children,[Date].[Current Fiscal Year].&[Y]).Item(0))
, [Date].[Fiscal Year].CurrentMember.Properties('key'),DESC)

Hide unknown members

It a good to handle unknown dimension members in datamarts which in turn helps to address data quality issues. However bad data need not be displayed in dashboard filters.

The following MDX expression gets all the Sales Reason members except the unknown members.

([Sales Reason].[Sales Reason].Children
-[Sales Reason].[Sales Reason].&[UNAVAILABLE]
-[Sales Reason].[Sales Reason].&[NOT APPLICABLE]
-[Sales Reason].[Sales Reason].&[UNKNOWN])

The above examples demonstrate the ease with which seriously rocking PerformancePoint filters can be created using simple MDX expressions that create business friendly dashboards.

Benny Austin


SSAS: Using MDX to Calculate Snapshot Measures

In a data warehouse, Periodic Snapshot table is a type of fact table that “freezes” the measure value at recurring reporting periods. This enables reporting of measures AS AT a point in time. Periodic Snapshots are usually created from Transaction Fact tables that support reporting FOR the period. Using MDX it is possible to create snapshot values for measures from transaction fact without creating Periodic Snapshot tables. Continue Reading

SSAS: MDX Calculated Measures that Require Date Comparison

Often there is requirement to calculate measures based on two different date dimensions. For e.g. how many orders that were delivered this month was actually ordered last month? How many orders were delivered in the same month as they were ordered? How many orders were carried forward to next month?

Using MDX set functions EXISTS(),FILTER() and Range operator these calculations can be easily achieved provided the Date Dimension conform to same key columns. Continue Reading

SSAS: Membership in Multiple Roles Could Throw Unexpected Surprises

Analysis Service supports role based model to secure cube data. Data can be secured at object level, cell level (Cell Data Access) and at member level (Dimension Data Access). There are relative merits of each approach, which isn’t the topic of this post but more details can be found here. A user can be a member of more than one role and it is important to understand the behaviour of Analysis Services role permissions in such a scenario. Analysis Services role permissions are additive in nature. This means if you are a member of a role that has been granted access to an Analysis Service object; you have access to that object even if you belong to another role which denies access to the same object. In other word the least restrictive roles permissions override the most restrictive role permissions. This could throw some nasty surprises that you never expected. Continue Reading

PowerPivot: DAX Expression for Type 2 SCD Lookup

The greatest strength of PowerPivot is it’s ability to create relationship with different sources of data and piggyback on this relationship for various kinds of analysis. PowerPivot relationship requires one-to-many relationship between the parent and child tables. But what happens if the relationship between parent and child tables is many-to-many instead? For e.g. you might want to establish a relationship between a fact table that has a Business Key to the dimension table and that dimension table maintains history. In this post I explain how to establish a relationship between two tables that do not exhibit one-to-many relationship on their Business Key using a technique that is similar to an ETL Type 2 SCD lookup using Data Analysis Expressions (DAX). Continue Reading

Alternatives to SSIS SCD Wizard Component

SSIS comes with an out-of-box SCD Wizard to handle Type 1 and Type 2 Slowly Changing Dimensions (SCD) which is a fundamental ETL requirement. However the SCD wizard component has some serious drawbacks – both from operational and functional perspectives that make it unusable for practical purposes. A good summary on the shortcomings of SCD Wizard component can be found here

Several workarounds have evolved over time and in this post I would like explore the different alternative options to handle Type 1 and Type 2 SCD without using the out-of-box SCD Wizard Component.   Continue Reading

Kaggle: TalkingData

A brief retrospective of my submission for Kaggle data science competition that predicts the gender and age group of a smartphone user based on their usage pattern. Continue reading

Kaggle: Grupo Bimbo

A brief retrospective of my submission for Kaggle data science competition that forecasts inventory demand for Grupo Bimbo. Continue reading

Common Type 2 SCD Anti-patterns

Anti-pattern

Slowly Changing Dimension (SCD) is great for tracking historical changes to dimension attributes. SCDs have evolved over the years and besides the conventional type 1 (update), type 2 (add row) and type 3 (add column), now there are extensions up to type 7 including type 0. Almost every DW/BI project has at least few type 2 dimensions where a change to an attribute causes the current dimension record to be end dated and creates a new record with the new value. Continue Reading

R: Box Plot

Box plot is an effective way to visualize the distribution of your data.It only takes a few lines of code in R to come up with a basic box plot. Continue Reading