This post explains a SSIS technique that transforms large volume of data in shorter time by using Raw Files.
I was working on a task that required me to create a snapshot table by aggregating data from existing fact tables using an SSIS package. There were two fact tables having a total of 20 million records.
The simple solution would be to create a data flow to aggregate data from the fact tables and then insert into the snapshot table. While this method is relatively straight forward, there is performance degradation in using database aggregate functions (SUM, MAX, MIN etc) over large datasets. This means the package execution time could potentially exceed the allocated job window.
To workaround this potential problem, I used raw files instead of database tables as source of data that needs to be aggregated.
Moving the aggregation to raw files provides the following benefits:
- The use of database aggregation function is avoided. Instead the aggregation now happens on the memory of integration services – which is significantly faster.
- The raw files being disconnected data sources are not holding up the database resources.
Using this approach, the package transformed 20 million records in less than 10 min on a 32-bit win2k3 , 4GB GB memory, x86 Intel Xeon 3.4Ghz processor.
So, this is the structure of the package (Figure 1)
The Script control at the top uses a package level variable, RefreshMode to switch the control flow between FULL or INCREMENTAL load. This enables the package to be reused for first time initialization (FULL) and as daily job (INCREMENTAL).
The control flow consists of:
- Script control to Truncate (RefreshMode=FULL) or condition based Delete (RefreshMode=INCREMENT) of the snapshot table.
- Dataflow control to create raw file.
- Dataflow control to aggregate raw file and insert into snapshot table.
The control flow is similar along each path, so I’ll just explain the important Dataflow controls from one of the paths.
The Create Raw File Dataflow consists of an OLEDB source and a raw file destination (Shown in Figure 4). The OLEDB source fetches the data from the fact table and inserts into a raw file destination.
The property of the raw file destination is configured as shown below.
Once the raw file is created, the subsequent data flow uses the raw file as data source. Take note that the metadata of the raw file is available only after the raw file is physically created. When developing the package, remember to execute the Data Flow to create the raw file before using the raw file as source.
💡Tip: Filter your source table to create a raw file with say 10 records, all you need for building the package is the metadata of the raw file.
The raw file source is configured as shown below
In this dataflow, the data from the raw file is aggregated (Sum) and few derived columns like timestamp are added to the record and inserted to the Snapshot table.
There is a bit of overhead in creating the raw files , but the overall performance benefits far outweighs this overhead.
Few points to be aware of when using raw files:
- As mentioned earlier, raw file data sources require the metadata of the raw file to build the data flow during design time. So the physical raw file should be created before using a raw file data source. Check-out How to Create Raw File for Use as SSIS Data Source?
- Raw files are not particularly easy to read. Fortunately there is a reader available at SQL Server Community Blog. For larger files, my preference would be to write a simple dataflow to import the raw file to a local database table and query the table.
- It’s a good practice to set up Checkpoints in your package, so that if the package fails after creating the raw file, it can skip the raw file creation step in the next run. This is particularly handy for large raw files.
- Raw tend to grow bigger, so create the raw files in network storage.
- Use parallel execution paths to take advantage of multi-threading aspect of SSIS packages, this improves the performance.
- Since much of the transformations happen in the memory, if not properly designed the package may hijack the memory. If you look at my package, I have intentionally put the last Dataflow to execute after the Dataflow from another parallel path. This is because I did experience memory issues on my dev server (it may not be an issue at the 64-bit production servers).
It takes very little time to get a hang of the raw files and it’s definitely worth considering for your next heavy lifting ETL.