Benny Austin

SSIS: Optimizing the ETL for Late Arriving Dimensions

 

Late Arriving Dimension/Early Arriving Fact as the name implies is a dimension record which is not available when the ETL loads the related fact record but becomes available at a later time.

This scenario could happen for a variety of reasons;

  • It might be a perfectly valid business process. For e.g. in an emergency ward, the patient interventions could be recorded first as priority while the patient details that is collected could be recorded into the system at a later time.
  • Sometimes due to operational reasons, the data might arrive at different times from their respective source systems. For e.g. The dimensions could be extracted from Master Data repository while the facts are extracted directly from the transaction system at different periods.
  • It might be due a failed ETL run.

The ETL assigns an unknown surrogate key for the Late Arriving Dimension in the related fact table (such as -1) when it does not find the dimension record. The fact table is then revisited later at set intervals to re-determine whether a valid dimension record is now available and the unknown surrogate key is updated with a real surrogate key, if available.

Typical challenges for the SSIS package that does the re-determination of the surrogate key are

  • Scanning potentially big fact tables for the few records with unknowns surrogate key.
  • Natural Key/ Business Key of the dimension record is not usually stored in fact record. It has to be sourced again to re-determine the surrogate key.
  • Might involve row-by-row updates for the re-determined records.

Few tips to tackle the above challenges and optimize the ETL for Late Arriving Dimensions are discussed below

1. Use Filtered Index on Fact Tables

Since the unknown surrogate key is well-defined subset of data, use Filtered Index to fetch fact records that have unknown SK.  Filtered Index is smaller in size and has accurate statistics resulting in better query plan and significant improvement in query response time. This avoids table scan.

--Filtered Index
CREATE NONCLUSTERED INDEX FX_CLIENT
ON dbo.FactEmergency (EVENT_NK,EVENT_START_DATE) WHERE (CLIENT_SK =-1)

Structure the source query of the SSIS dataflow to use the Filtered Index.

2. Use Skinny Staging Tables

Fact tables usually do not store the Natural Key of the dimension. So the Natural Key of the dimension has to be fetched from the source system. Create a skinny staging table that maps the fact record to dimension’s natural key and the re-determined surrogate key. Using the example of the patient in an Emergency ward, the staging table could be something like this

CREATE Table Staging.EmergencyClient
(
EVENT_NK Varchar(20) null,
EVENT_START_DATE Datetime,
CLIENT_ID Varchar(20),
CLIENT_SK INT null
)

The dataflow could be something like the one shown in screen capture 1

Screen Capture 1 – ETL for Late Arriving Dimension

Note that in the above SSIS package,

  • The Natural/Business Key of the dimension (CLIENT_ID in this case) has to be fetched from the source system
  • The ETL has to cater for the scenario where the client information may not be available yet from the source system. So do the lookup only if client’s Natural Key is available for the previously unknown clients.
  • Insert into the staging table only when the dimension surrogate key is found. In this way the staging table has only the records that require update.

3. Avoid Row-by-row updates

By populating the staging table only when a genuine dimension record is found, the staging table contains only the records that require update, the fact record can be updated in single batch.

–Single Update Statement to Avoid Row By Row Update

--Single Update Statement to Avoid Row By Row Update
UPDATE emerg
SET emerg.CLIENT_SK =stage.CLIENT_SK
FROM dbo.FactEmergency AS emerg
	INNER JOIN Staging.EmergencyClient As Stage
	ON stage.EVENT_NK =emerg.EVENT_NK
WHERE emerg.CLIENT_SK=-1

The use of staging tables facilitates batch updates instead of row-by-row updates.

These optimizations would keep ETL for Late Arriving Dimensions in good shape.

 

Benny Austin

About these ads

2 comments on “SSIS: Optimizing the ETL for Late Arriving Dimensions

  1. Kumar
    April 7, 2012

    Since updating the fact tables is more costly than updating the dimension table, I think it is more efficient to first load the dimension with new natural key and set the rest of attributes in dimension table to ‘null’ or some other default. Use the new generated surrogate key to insert in the fact table. Later parse the dimension tables for any such infered rows and update them with the actual data if it has arrived.

    • Benny Austin
      April 8, 2012

      Good idea and will work if the natural key of the dim is known at the time of inserting the fact record. In most late arriving instances this will not be the case.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Copyright

© Benny Austin.
Follow

Get every new post delivered to your Inbox.

Join 183 other followers

%d bloggers like this: