Monday, June 9, 2014

SSIS Tips: Implementing SCD Functionality Using Hashing Method

There are numerous ways in which you can implement slowly changing dimension processing in SSIS. There is also a standard task available in SSIS called SCD Task for this. But the task will cause a performance  bottleneck for large tables. So quite often people make use of alternatives like using a combination of lookup task with conditional split to do the SCD processing. But problem with this approach is that it would require comparison on each of the non-key columns to check if the value has changed and would be cumbersome for large tables with large number of attribute columns. So a better alternative here would be to apply hashing function over the columns and use the generated value for comparison. This blog discusses two methods by which you can utilize the hashing algorithms to implement SCD processing.

 Method 1 : Applying hashing in SSIS
There is a third party task called Multiple hash available in codeplex ( that can be utilized for this.
The task has a set of algorithms available like SHA1,MD5 etc which can used for generating the hash values.
The package utilizing the Mutiple Hash task would be as below

The source file will look like this

The requirement is to transfer the details from the source file to dimension table. All except ReservationType attribute has to be undergo Type 1 processing and Reservation Type has to be Type 2 processed.
The package workflow would be as below
1. We will have a flat file source that points to the source file to extract the data from it.
2. A Multiple Hash task is used next to generate two hashing values one based on the Reservation Type column and second one based on other non key columns in the table. The algorithm used in this example is SHA1 with safe null handling enabled. We utilize these two hash values for comparison and doing Type 2 / Type 1 processing of the associated columns.
The Multiple Hash task configuration for the two hash values look like below

3. There are three lookup tasks used one after the other
The first lookup would be based on primary key field(s) ie ID in above case
This is to understand whether the record is a new record or an existing one. The NoMatch output of this lookup would be linked to OLEDB destination which will insert the new records.
In the case of  Match output records ie existing ones , we have three conditions
a. Records with no changes in values of any of the non key values ie hash values beng same we need to ignore these.
b. Records with change in ReservationType field value. These have to be Type 2 processed (add new entry + close existing entry). The table has two fields ValidFrom and ValidTo to indicate the validity of each record for historical (Type 2) representation. At any time only one record will have ValidTo as NULL which will store the present values of the Type 2 attributes.
c. Records with changes to any of the other non key columns. These need to be Type 1 processed (simple update)

For this purpose we link Match output of the primary key lookup task to second lookup which will try to match on ReservationType hash value field. The NonMatch output of this would include two OLEDB command steps one to close down existing record by updating ValidTo field with the current date value and second one to insert a new record with ReservationType as new value, ValidFrom as current date and ValidTo as NULL. The Matched output would then link to the third lookup task to see if atleast the other nonkey fields value have changed. The NoMatch output of this will be linked to OLEDB Command to update the values of other non key attributes which have changed. The Match output will be ignored as we dont want to do any processing for the unchanged records.

Method 2 : Applying HASHBYTES T-SQL function

T-SQL also has a HASHBYTES function which can be utilized in the above scenario.For that the first step would be to get the flat file data onto a staging table. This can be done using a simple data flow task in SSIS. Once the data is there in staging table, You can use another data flow task where you can do same set of lookups as in the above case to check for hash values and identify the column value changes.
The second data flow task would look almost same as the above except for the Multiple Hash step.

We need to be careful on couple of things while using the HASHBYTES function

1. Make sure we handle NULLs appropriately. The presence of NULL value in one of the columns can cause value to become NULL after concatenation so we need to make sure we use ISNULL or COALESCE to convert NULLs to blank values

2. Another thing we need to keep in mind is to make sure we identify column values separately by introducing a delimiter in between while concatenating. This is needed because there's a slight possibility of  concatenated value pattern being same though individual values are not

See this as an illustration

DECLARE @Col1 varchar(100)= 'Test',@Col2 varchar(100) = 'Code',@col3 varchar(100)='In Progress'
DECLARE @ColA varchar(100)= 'Test Code',@ColB varchar(100) = 'In',@colC varchar(100)='Progress'

SELECT HASHBYTES('SHA1',@Col1+ '  ' + @Col2+ '  ' + @col3),HASHBYTES('SHA1',@ColA+ '  ' + @ColB+ '  ' + @colC)

You can see that two hashed values match though column values are not the same

Now add a delimiter in between and see the result

DECLARE @Col1 varchar(100)= 'Test',@Col2 varchar(100) = 'Code',@col3 varchar(100)='In Progress'
DECLARE @ColA varchar(100)= 'Test Code',@ColB varchar(100) = 'In',@colC varchar(100)='Progress'

SELECT HASHBYTES('SHA1',@Col1+ ', ' + @Col2+ ', ' + @col3),HASHBYTES('SHA1',@ColA+ ', ' + @ColB+ ' ,' + @colC)

This method can be used to ensure uniqueness of hash value based on the values of the individual columns.

You may use any of the above methods to apply hashing algorithm for SCD processing in SSIS. The choice of method depends on whether its feasible to use third party component, whether staging environment is available etc.
Feel free to revert for any clarification. Enjoy coding!