Power Query can discover and import data from websites. Often data warehouses rely on external data which is readily available in public websites for e.g. public holidays, school holidays, daylight savings, SIC codes, SWIFT codes, post codes etc. Power Query is perfectly suitable for such situations. Power Query can discover, fetch and transform data from a HTML table in a web page into a format that can be easily imported into data warehouses using SSIS package. It’s like an ETL tool for the web page data source.
To illustrate this, I am using Power Query to extract NSW Public holidays from industrial relations website (Screen Capture 1)
Screen Capture 1 – NSW Public Holidays Web Page
This website publishes public holidays for the next 3 years in a pivoted format. Using Power Query, I would like to extract each year’s public holidays as a separate data set. The resulting 3 data sets will then be appended into one list that can be imported by an SSIS package to the data warehouse.
Let’s get started. From Excel click Power Query ribbon and then From Web (Screen Capture 2)
Screen Capture 2 – Power Query Import from Web
This will bring a dialog where you will enter the web page URL (Screen Capture 3)
Screen Capture 3 – Web Page URL
Power Query will begin its discovery cycle and present all the HTML tables available in the web page in the Navigator pane. Hover around the tables to get a preview of the data (Screen Capture 4)
Screen Capture 4 – Navigator
Select the table you are interested in and import. This will being the data to Power Query area in Excel. Once inside Power Query the data can be transformed in a series of steps. The following actions have been taken to arrive at an intermediate result as shown in Screen Capture 5
- Renamed dataset name from Table0 to PH2016.
- Promoted first Rows as Header
- Renamed columns as appropriate
- Removed columns 2014 and 2015, leaving only 2016
- Split the week day name part from the date using comma delimiter and removed this column, leaving just the date part
All these action are recorded in the Applied Steps area. You could undo any step by deleting the action and any subsequent actions
Screen Capture 5 – Intermediate Result
The date is still missing the year part. To get around this create a custom column. Power Query allows you to create custom column from existing columns. Go to Transform and click Insert Custom Column (Screen Capture 6)
Screen Capture 6 – Insert Custom Column
Create a new column called Date using a simple expression as shown in Screen Capture 7
Date= [Day] & ” 2016″
Screen Capture 7 – Custom Column
The new column Date is still a text field, so change Type to Date (Screen Capture 8)
Screen Capture 8 – Change Type
Remove the intermediate date column and the first data set for year 2016 is ready (Screen Capture 9)
Screen Capture 9 -Dataset
Repeat the same steps to get datasets for years 2014 and 2015.
Now append the 3 datasets into one to get the public holidays as a list. From excel click Power Query and Append (Screen Capture 10). Select the datasets to append and click OK.
Screen Capture 10 – Append
This will produce a single list of all public holidays from 2014-16 as shown below
Screen Capture 11 – NSW Public Holidays 2014-16
Delete the records with no data. Now this spreadsheet is ready to be imported using SSIS into your data warehouse.
Benny Austin (@BennyAustin) September 02, 2014