Hive implements MapReduce using HiveQL. The built-in capabilities of HiveQL abstracts the implementation of mappers and reducers with a simple yet powerful SQL like query language. To demonstrate the inbuilt capabilities of HiveQL, I will be analysing hashtags from a twitter feed on Azure HDInsight platform.
This post is a tutorial to get started on Hive in HDInsight. The steps to be followed are given below. As a pre-requisite you would need a subscription to Microsoft Azure to try out these steps
1. Provision Azure Storage Account
2. Provision HDInsight Cluster
3. Create Hive Database and Tables
4. Prepare Data as Ctrl-A separated Text Files
5. Upload Text Files to Azure Storage
6. Load Data to Hive
7. Execute HiveQL DML Jobs
Partitions improve the query response of SSAS cube by narrowing the MDX query to a pre-defined subset of data. Usually cubes are developed with few basic partitions to begin with and new partitions are added over time. Partitions are created using XMLA commands. The rest of this post explains the steps to create partitions using metadata and SSIS package.
SSAS Partitions are primarily geared towards better query performance by narrowing the MDX query to a pre-defined subset of data. Each partition must be in processed state for the cube to be available for browsing. However in a delta ETL run, it is more than likely that only some partitions would be refreshed with data while the rest remain unchanged. For example in a hospital admissions cube partitioned by month, data movements are frequent for admissions made during the current/last month compared to earlier months. This means if the partitions updated by the ETL run can be identified, then we can tailor SSIS package to dynamically process only those partitions that require data refresh while leaving the remaining partitions untouched.
Canned reports/dashboards aka custom reports or out of box reports are inevitable part of any BI ecosystem. It’s a good point of reference for conversations with business users especially at the start of a project when requirements are often a bit hazy. In many instances, business users see report layouts as an effective and convenient medium of communication with BI project teams. This however should not be considered as THE business requirement. Instead of dwelling on canned reports, the focus should be on building a well-designed DW/BI system that empowers users to perform analysis on a self-service basis. The quality and the range of analysis that a self-service BI platform provides far outweighs the benefits of just rolling out a set of canned reports. It empowers business users to do their own analysis while the BI team can work towards expanding and refining the analytical platform instead of operating as a “reporting team”. A well-designed DW/BI will satisfy most of the analytical requirements of business which includes but not restricted to canned reports. Canned reports are a given, not the end game to any BI strategy. Reporting tools available today are incredibly easy to use. A power user with a little bit of training can build reports and dashboards with reasonable ease while BI teams can focus on expanding and enriching the self-service BI Platform.
The execution plan of SSAS processing engine triggers a sequence of processing tasks based on the object’s current processed state and its dependencies with other objects. This is what happens when the SSAS cube is processed from SSMS. It is possible to replicate this execution plan close enough in an SSIS package so that the cube is always processed and refreshed with latest data.
While SSAS allows you to build a cube straight of the fact and dimension tables, however there are benefits in creating the cube from database views that sits on top of the underlying fact and dimension tables. Here’s why.
It’s quite common to analyse a measure by categorizing it into different buckets or bands. For example to list the customers who are due for payment in the next week or fortnight, to decide which income group to target for a particular product campaign, which age group of students are most likely to buy a smartphone and so on. These bands are usually not supplied by source systems but derived from the underlying measure and they exist only for analytic purposes. While this banding could be achieved in reports or dashboards, there is significant value to model them as dimensions especially to promote self-service BI.
Dashboard filters is one of the key components of BI Platform that drive reports and KPI’s in various dashboards. MDX expressions to a great extent enables to customize PerformancePoint filters and enhance the end user experience.
Dimension Tables usually have a member with Surrogate Key (SK) value of -1 to handle unknown members. Unknown SK is usually set when SSIS look-up on a dimension table fails to return a match based on a business/natural key. Given that several factors contribute to unknown SKs, it would be beneficial to customize the unknown SK into different types rather than clubbing into a generic -1 value
PerformancePoint Dashboard Designer is usually synchronized with the SharePoint Site from where it was first downloaded. The PerformancePoint Content – Data Connections, Filters, Reports, Scorecards, KPIs and Dashboards from the workspace gets deployed to this SharePoint Site. In most implementations often there is a need to deploy the PerformancePoint Content to a different Business Intelligence SharePoint Site. For instance, deploy dashboards from test to production environment.If you need to deploy your PerformancePoint content to a different SharePoint Site, follow the steps in this post.
Often there is requirement to calculate measures based on two different date dimensions. For e.g. how many orders that were delivered this month was actually ordered last month? How many orders were delivered in the same month as they were ordered? How many orders were carried forward to next month?. Using MDX set functions EXISTS(), FILTER() and Range operator these calculations can be easily achieved provided the Date Dimension conform to same key columns.
The default SSIS log provider for SQL Server captures package trace in sysssislog table. It does capture some useful but limited trace information about the package execution. This trace is of little use without user defined variable values like record count, status flags, control metadata, selection from and to dates etc. By combining your execution log with sysssislog, you could take advantage of SSIS logging capabilities and customize it for your specific logging requirements.
Analysis Service supports role based model to secure cube data. Data can be secured at object level, cell level (Cell Data Access) and at member level (Dimension Data Access). A user can be a member of more than one role and it is important to understand the behaviour of Analysis Services role permissions in such a scenario. Analysis Services role permissions are additive in nature. This means if you are a member of a role that has been granted access to an Analysis Service object; you have access to that object even if you belong to another role which denies access to the same object. In other word the least restrictive roles permissions override the most restrictive role permissions. This could throw some nasty surprises that you never expected.
The greatest strength of PowerPivot is the ability to create relationship with different sources of data and piggyback on this relationship for various kinds of analysis. PowerPivot relationship requires one-to-many relationship between the parent and child tables. But what happens if the relationship between parent and child tables is many-to-many instead? For e.g. you might want to establish a relationship between a fact table that has a Business Key to the dimension table and that dimension table maintains history. In this post I explain how to establish a relationship between two tables that do not exhibit one-to-many relationship on their Business Key using a technique that is similar to an ETL Type 2 SCD lookup using Data Analysis Expressions (DAX).
SSIS comes with an out-of-box SCD Wizard to handle Type 1 and Type 2 Slowly Changing Dimensions (SCD) which is a fundamental ETL requirement. However the SCD wizard component has some serious drawbacks – both from operational and functional perspectives that make it unusable for practical purposes. Several workarounds have evolved over time and in this post I would like explore the different alternative options to handle Type 1 and Type 2 SCD without using the out-of-box SCD Wizard Component.