Late Arriving Dimension/Early Arriving Fact as the name implies is a dimension record which is not available when the ETL loads the related fact record but becomes available at a later time.
This scenario could happen for a variety of reasons;
The ETL assigns an unknown surrogate key for the Late Arriving Dimension in the related fact table (such as -1) when it does not find the dimension record. The fact table is then revisited later at set intervals to re-determine whether a valid dimension record is now available and the unknown surrogate key is updated with a real surrogate key, if available.
Typical challenges for the SSIS package that does the re-determination of the surrogate key are
Few tips to tackle the above challenges and optimize the ETL for Late Arriving Dimensions are discussed below
Since the unknown surrogate key is well-defined subset of data, use Filtered Index to fetch fact records that have unknown SK. Filtered Index is smaller in size and has accurate statistics resulting in better query plan and significant improvement in query response time. This avoids table scan.
--Filtered Index CREATE NONCLUSTERED INDEX FX_CLIENT ON dbo.FactEmergency (EVENT_NK,EVENT_START_DATE) WHERE (CLIENT_SK =-1)
Structure the source query of the SSIS dataflow to use the Filtered Index.
Fact tables usually do not store the Natural Key of the dimension. So the Natural Key of the dimension has to be fetched from the source system. Create a skinny staging table that maps the fact record to dimension’s natural key and the re-determined surrogate key. Using the example of the patient in an Emergency ward, the staging table could be something like this
CREATE Table Staging.EmergencyClient ( EVENT_NK Varchar(20) null, EVENT_START_DATE Datetime, CLIENT_ID Varchar(20), CLIENT_SK INT null )
The dataflow could be something like the one shown in screen capture 1
Screen Capture 1 – ETL for Late Arriving Dimension
Note that in the above SSIS package,
By populating the staging table only when a genuine dimension record is found, the staging table contains only the records that require update, the fact record can be updated in single batch.
–Single Update Statement to Avoid Row By Row Update
--Single Update Statement to Avoid Row By Row Update UPDATE emerg SET emerg.CLIENT_SK =stage.CLIENT_SK FROM dbo.FactEmergency AS emerg INNER JOIN Staging.EmergencyClient As Stage ON stage.EVENT_NK =emerg.EVENT_NK WHERE emerg.CLIENT_SK=-1
The use of staging tables facilitates batch updates instead of row-by-row updates.
These optimizations would keep ETL for Late Arriving Dimensions in good shape.
Since updating the fact tables is more costly than updating the dimension table, I think it is more efficient to first load the dimension with new natural key and set the rest of attributes in dimension table to ‘null’ or some other default. Use the new generated surrogate key to insert in the fact table. Later parse the dimension tables for any such infered rows and update them with the actual data if it has arrived.
Good idea and will work if the natural key of the dim is known at the time of inserting the fact record. In most late arriving instances this will not be the case.