Big Data: Create and Load Hive Tables on Windows Azure HDInsight

Update 18-June-2014

Please refer to Working with Hive in HDInsight which the updated version of this post. The orginal post was written when HDInsight was a separate entity and in preview mode. Since then HDInsight has been completely integrated into Microsoft Azure cloud services. The concepts explained in this post still holds, however some of the instructions and screen captures have changed significantly. So I would recommend you to refer to the updated version of this post

Original post

This post gives step by step instruction to create and load a Hive table on Windows Azure HDInsight. Steps to be executed in sequence

  1. Provision a Hadoop Cluster
  2. HiveQL DDL Statement
  3. Data preparation/ETL
  4. HiveQL DML for Data Loading
  5. HiveQL DML for Data Verification

Step1: Provision a Hadoop Cluster

For this post, preview version of Windows Azure HDInsight is used. At the time of writing this post, access to preview version is available by invitation. Once you get access to Windows Azure HDInsight, the first step is to create a Hadoop Cluster. Follow the instruction in this article to create a Hadoop cluster. I created a Hadoop cluster with DNS Name northstep for this exercise
(Screen Capture 1).

Screen Capture 1 – Hadoop Cluster

Step 2: HiveQL DDL Statement

The next step is to create a Hive table using a HiveQL DDL statement. For this demonstration I am using a subset of data from BOCSAR. Specifically I am using crime data by Premise Type by Offence which is available for download in csv format.

Login to your Hadoop Cluster and navigate to Interactive Console ->Hive. Copy the HiveQL DDL given below in the text box area of Interactive Hive and hit evaluate (Screen Capture 2).

CREATE TABLE IF NOT EXISTS incident_by_premise_type(
	incident_year INT,
	division STRING,
	incident STRING,
	premise_type STRING,
	incident_count INT)
	ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\001';

The above HiveQL DDL statement creates the table incident_by_premise_type in the default database if it does not already exist. Rows are delimited by newline character and columns are delimited by ctrl-A character.

Check out these wikis for more information on Hive primitive data types and DDL Statements

Screen Capture 2 – Interactive Hive – DDL

Step 3: Data Preparation/ETL

Hive tables are usually loaded from txt files or from another HiveQL DML. I used an SSIS package to prepare BOCSAR data in the format required by Hive tables. Hive tables expect column data to be delimited by ctrl-A which is ‘/001’ in or char (1). I used SSIS script component as destination after unpivoting BOCSAR data. The code for the script component is given below. It must be noted that column headers are not included as first row of the text file.

#region Namespaces
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.IO;

#endregion

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{

   StreamWriter hiveFileStream;

    public override void PreExecute()
    {
        base.PreExecute();

        try
        {
            String hiveFileName = Variables.hiveFilePath +  Variables.csvFileName.Replace(".csv", ".txt");
            if (File.Exists(hiveFileName))
            {File.Delete(hiveFileName);}

            hiveFileStream = File.CreateText(hiveFileName);

        }
        catch (UnauthorizedAccessException uauth)
        { throw uauth; }
        catch (DirectoryNotFoundException notfound)
        { throw notfound; }
        catch (IOException ioException)
        { throw ioException; }

    }

    public override void PostExecute()
    {
        base.PostExecute();
        hiveFileStream.Close();
    }

 public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        String rowData;

        rowData = Row.dcYEAR.ToString() + Convert.ToChar(1).ToString() +
            Row.DIVISION.ToString() +Convert.ToChar(1).ToString() +
            Row.INCIDENT.ToString() +Convert.ToChar(1).ToString() +
            Row.dcPREMISETYPE.ToString() +Convert.ToChar(1).ToString() +
            Row.INCIDENTCOUNT.ToString() + Convert.ToChar(1).ToString();

        hiveFileStream.WriteLine(rowData);
    }

}

The hive data files created from the SSIS package can be downloaded from here

Step 4: HiveQL DML for Data Loading

When data is loaded to a Hive table, all it does is move the data from ctrl-A delimited file to a HDFS folder structure which is in form of database>table>partition. In this exercise I am loading data from a local folder on Hadoop cluster. To move the ctrl-A formatted hive data files created by SSIS package to Hadoop Cluster, remote desktop using the cluster user id/password (Screen Capture 3). Create a local folder called bocsar_data in C: and copy the files over to this folder.

Screen Capture 3 – Remote Desktop

Navigate to Interactive Console -> Hive and execute the following HiveQL DML statement. Note that multiple HiveQL statements can be executed from the interactive console provided each statement is separated by semicolon

LOAD DATA LOCAL INPATH '/bocsar_data/Arson_Sydney_TressPass_Data_By_PremiseType.txt' INTO TABLE incident_by_premise_type;
 LOAD DATA LOCAL INPATH '/bocsar_data/TransportOffence_Sydney_Data_By_PremiseType.txt' INTO TABLE incident_by_premise_type;
 LOAD DATA LOCAL INPATH '/bocsar_data/Tresspass_Sydney_Data_By_PremiseType.txt' INTO TABLE incident_by_premise_type;

Check out this wiki for more information on Hive Load Statement

Screen Capture 4- Load Data

Step 5: HiveQL DML for Data Verification

Now that the data is loaded into the Hive table incident_by_premise_type, finally verify the record counts by executing the following HiveQL Select statement. Check out this wiki for more information on HiveQL Select Statement

select
incident,
count(1)

select incident, count(1)
 from incident_by_premise_type
 group by incident;

That’s it – you now have a Hive table loaded with data.

One thought on “Big Data: Create and Load Hive Tables on Windows Azure HDInsight

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s