This post is a tutorial to get started on Hive in HDInsight.
The steps to be followed are given below. As a pre-requisite you would need a subscription to Microsoft Azure to try out these steps
- Provision Azure Storage Account
- Provision HDInsight Cluster
- Create Hive Database and Tables
- Prepare Data as Ctrl-A separated Text Files
- Upload Text Files to Azure Storage
- Load Data to Hive
- Execute HiveQL DML Jobs
Step 1: Provision Azure Storage Account
Login to Azure Management Portal and create a storage account by following these steps
Step 2: Provision HDInsight Cluster
Create a HDInsight cluster using the storage account created in step 1. In Hadoop compute cluster is also the storage cluster. So create the HDInsight cluster in the same region as the Storage Account. From Azure Management Portal follow these steps to create a HDInsight Cluster.
Step 3: Create Hive Database and Tables
At the end of steps 1 and 2, you should have a storage account and HDInsight cluster up and running as (Screen Capture 1)
Screen Capture 1 – HDInsight Cluster is now provisioned
Select the cluster and click Manage Cluster icon locate at the bottom of the page (Screen Capture 2)
Screen Capture 2 – Manage HDInsight Cluster
This will bring up the Hive Page (Screen Capture 3) from where you can issue HiveQL statements as jobs.
Screen Capture 3 – Hive Page
Copy and paste the following HiveQL DDL statement in the text area and click submit. This will create a Hive database called bocsar and a Hive table called incident_by_premise_type
CREATE DATABASE IF NOT EXISTS bocsar LOCATION '/hive/warehouse' ; CREATE TABLE IF NOT EXISTS bocsar.incident_by_premise_type( incident_year INT, division STRING, incident STRING, premise_type STRING, incident_count INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '01';
The job status refreshes every 30s and changes from Queued, Running and Succeeded (Screen Capture 4)
Screen Capture 4 – HiveQL Create Table Job
Step 4: Prepare Data
For this tutorial I am loading data to Hive tables using text files with columns delimited by ctrl-A which is “/001” or char (1). I used SSIS script task to convert data published by BOCSAR into ctrl-A delimited text files.
The Hive data files can be downloaded from here
Step 5: Upload Text Files to Azure Storage
The Hive data files have to be copied to Azure Storage Account from your local drive. I am using Azure Storage Explorer to upload the text files. Azure Storage Explorer can be downloaded from here.
Follow these instructions to configure Azure Storage Explorer and upload the text files. Once you have added your storage account to Azure Storage Explorer, select the HDInsight Cluster then click Blobs followed by Upload button and select the three text files (Screen Capture 5)
Screen Capture 5 – Azure Storage Explorer
Once the files are uploaded they are available at the Home directory of Azure Storage Account. You can verify this from the Hive File Browser page (Screen Capture 6)
Screen Capture 6 – Hive File Browser – View Files Uploaded from Azure Storage Explorer
Step 6: Load Data to Hive
Navigate to Hive Editor and run the following Hive statements to populate the Hive table – bocsar.incident_by_premise_type
LOAD DATA INPATH '/Arson_Sydney_TressPass_Data_By_PremiseType.txt' INTO TABLE bocsar.incident_by_premise_type; LOAD DATA INPATH '/TransportOffence_Sydney_Data_By_PremiseType.txt' INTO TABLE bocsar.incident_by_premise_type; LOAD DATA INPATH '/Tresspass_Sydney_Data_By_PremiseType.txt' INTO TABLE bocsar.incident_by_premise_type;
Screen Capture 7 – Load Data HiveQL
Once the job is successful, you will notice that the text files have been moved from the home directory to \hive\warehouse\ incident_by_premise_type (Screen Capture 8)
Screen Capture 8- Hive Data
Step 7: Execute HiveQL DML Jobs
Now that the hive table is populated, it can be queried by submitting HiveQL DML jobs
select incident, count(1) from bocsar.incident_by_premise_type group by incident;
Click View Details to see the query result.
Screen Capture 9 – HiveQL DML
This completes the tutorial. A complete reference of HiveQL statements is available here