HDP: Analysing Twitter Hashtags using Hive

Hive implements MapReduce using HiveQL. The built-in capabilities of HiveQL abstracts the implementation of mappers and reducers with a simple yet powerful SQL like query language. To demonstrate the inbuilt capabilities of HiveQL, I will be analysing hashtags from a twitter feed on Hortonworks Data Platform (HDP). I wrote a similar post using Azure HDInsight, but this time it’s on HDP.

For this post, I am using Hortonworks Sandbox on Hyper-V. The data I am using is from Twitter archive which can be downloaded by following these instructions. The twitter archive contains all your tweets from day one and information about each tweet which will help to determine the structure of Hive table.

From the Hyper-V Manager, connect to the Hortonworks Sandbox.

Login to sandbox and click Beeswax which is the Hive UI (Screen Capture 1)

Screen Capture 1 – Beeswax

From the query editor execute the following DDL statement to create a Hive database

CREATE DATABASE IF NOT EXISTS twitter LOCATION '/hive/warehouse';

Switch the database from default to twitter (Screen Capture 2)

Screen Capture 2 – Change Database

Execute the following DDL statement to create a Hive table.

CREATE TABLE IF NOT EXISTS twitter.tweets(
    tweet_id BIGINT,
    in_reply_to_status_id BIGINT,
    in_reply_to_user_id INT,
    status_timestamp TIMESTAMP,
    source STRING,
	text STRING,
	retweeted_status_id BIGINT,
	retweeted_status_user_id INT,
	retweeted_status_timestamp TIMESTAMP,
	expanded_urls STRING
	)
    ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\t';

Convert the tweets.csv to a tab separated text file and remove the header row. Copy the text file to HDFS using the File Browser (Screen Capture 3)

Screen Capture 3 – File Browser

Once the data file is uploaded to HDFS, load the data into Hive table using this statement

LOAD DATA INPATH '/user/hue/tweets_formatted.txt' INTO TABLE twitter.tweets;

Now execute the following Hive DML Statement from query editor.

SELECT text, word FROM twitter.tweets
LATERAL VIEW explode(split(text,' ')) text_ex as word;

The explode() is a Hive built-in User Defined Table-Generating Function (UDTF) that breaks down a array into its elements. In this case the tweet gets broken into words. The LATERAL VIEW joins the output of explode() to the input row (tweet) creating a result set that contains n rows (words) for each tweet as shown below

Tweet Output
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS SSIS:
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS Creating
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS Dynamic
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS Data-driven
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS SSAS
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS Partitions
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS http://t.co/dmC9RdnZ1H
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS #SSIS
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS #SSAS

This result set can be easily filtered to separate hashtags from all other words to give a simple statistics about the commonly used hashtags.

SELECT word, count(1) as wcount FROM twitter.tweets
LATERAL VIEW explode(split(text,' ')) text_ex as word
WHERE word LIKE '#%'
GROUP BY word
ORDER BY wcount DESC LIMIT 5

The output would be something like this

#SSAS 85
#MDX 72
#SSIS 61
#Hive 15
#HDInsight 8

If you wish, you can save the query to a HDFS folder for later use.

This simple example illustrates the capabilities of in-built Hive statements for common analytical tasks which could be accomplished without the need to write custom mappers and reducers.

Related Links

HiveQL

Installing Hortonworks Sandbox on Hyper-V

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s