Power Query: Import Data from Web Pages

Power Query can discover and import data from websites. Often data warehouses rely on external data which is readily available in public websites for e.g. public holidays, school holidays, daylight savings, SIC codes, SWIFT codes, post codes etc. Power Query is perfectly suitable for such situations. Power Query can discover, fetch and transform data from a HTML table in a web page into a format that can be easily imported into data warehouses using SSIS package. It’s like an ETL tool for the web page data source.

To illustrate this, I am using Power Query to extract NSW Public holidays from industrial relations website (Screen Capture 1)

Screen Capture 1 – NSW Public Holidays Web Page

This website publishes public holidays for the next 3 years in a pivoted format. Using Power Query, I would like to extract each year’s public holidays as a separate data set. The resulting 3 data sets will then be appended into one list that can be imported by an SSIS package to the data warehouse.

Let’s get started. From Excel click Power Query ribbon and then From Web (Screen Capture 2)

Screen Capture 2 – Power Query Import from Web

This will bring a dialog where you will enter the web page URL (Screen Capture 3)

Screen Capture 3 – Web Page URL

Power Query will begin its discovery cycle and present all the HTML tables available in the web page in the Navigator pane. Hover around the tables to get a preview of the data (Screen Capture 4)

Screen Capture 4 – Navigator

Select the table you are interested in and import. This will being the data to Power Query area in Excel. Once inside Power Query the data can be transformed in a series of steps. The following actions have been taken to arrive at an intermediate result as shown in Screen Capture 5

  • Renamed dataset name from Table0 to PH2016.
  • Promoted first Rows as Header
  • Renamed columns as appropriate
  • Removed columns 2014 and 2015, leaving only 2016
  • Split the week day name part from the date using comma delimiter and removed this column, leaving just the date part

All these action are recorded in the Applied Steps area. You could undo any step by deleting the action and any subsequent actions

Screen Capture 5 – Intermediate Result

The date is still missing the year part. To get around this create a custom column. Power Query allows you to create custom column from existing columns. Go to Transform and click Insert Custom Column (Screen Capture 6)

Screen Capture 6 – Insert Custom Column

Create a new column called Date using a simple expression as shown in Screen Capture 7

Date= [Day] & ” 2016″

Screen Capture 7 – Custom Column

The new column Date is still a text field, so change Type to Date (Screen Capture 8)

Screen Capture 8 – Change Type

Remove the intermediate date column and the first data set for year 2016 is ready (Screen Capture 9)

Screen Capture 9 -Dataset

Repeat the same steps to get datasets for years 2014 and 2015.

Now append the 3 datasets into one to get the public holidays as a list. From excel click Power Query and Append (Screen Capture 10). Select the datasets to append and click OK.

Screen Capture 10 – Append

This will produce a single list of all public holidays from 2014-16 as shown below

Screen Capture 11 – NSW Public Holidays 2014-16

Delete the records with no data. Now this spreadsheet is ready to be imported using SSIS into your data warehouse.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s