Wednesday, July 2, 2014

SSIS Tips: Handling Inconsistent Text Qualifiers in Flat File Source

Last week I had blogged about how to handle embedded text qualifiers coming in flat file source data in the below blog.
http://visakhm.blogspot.in/2014/06/ssis-tips-handling-embedded-text.html
One of the feedback I got was from a ETL developer who asked whether its possible to handle files with inconsistent text qualifiers using flat file source in SSIS. He had also sent a sample file which I've adapted here for illustrating the solution I had provided for him.
The file looks like below
As you see the file has inconsistent text qualifier ie for 3rd column it has combination "| as text qualifier and for other columns its | alone. It also has " characters coming inside data as well (see row 3).
In order handle this the basic approach would be same as what we discussed previously. We will add a separate data flow task to handle the inconsistent text qualifiers and make them consistent throughout.
The first data flow looks like below
The flat file source here reads the entire row contents of the file as a single column. This is achieved by specifying the row delimiter as {CR}{LF} and no column delimiter specified. The derived column task will  make text qualifier consistent by doing pattern replace using REPLACE function. The expression used would be this

REPLACE(REPLACE(DataColumn,",\"|",",|"),"|\",","|,")

This is then saved into a new flat file temporarily which would have structure as below
You can see from the above that text qualifier is now consistent (|). 
Now you can use a new data flow task with above file as the source, choose delimiter and text qualifier as , and | and it will work fine populating the data to your destination table. 
This is an approach you can apply to extract data out of flat file with inconsistent text qualifiers and get it loaded into your database table using SSIS.
The sample file and package for this can be found in the below link. 
Replace the connectionstrings with your actual values and you will be able to execute and test the package for yourself.
Feel free to revert for any further clarification. Happy coding!

Tuesday, July 1, 2014

T-SQL Tips: Retrieve Connectionstring Details from SSIS Package

I've published a new article on MSDN Wiki which explains how connectionstring details can be extracted from an existing SSIS package.
You can find the article in the below link

http://social.technet.microsoft.com/wiki/contents/articles/25150.t-sql-retrieve-connectionstring-details-from-a-ssis-package.aspx

This approach can be used when you want to audit the existing SSIS packages and understand the details like connectionstrings its using, which SQLServer instances its trying to connect to etc and comes handy in maintenance projects where you inherit an existing system with no or little documentation.

Feel free to revert for any clarification

Monday, June 23, 2014

SSIS Tips: Handling Embedded Text Qualifiers in SSIS Flat File Source

Recently in one of my projects there was a case where I had to transfer the data from flat file to SQLServer table. The file was generated by an automated process at the client side and was FTPed to our landing location. I created a simple package to do the data transfer with a single data flow task which looks like below.
On executing the package it broke with the below error
I checked the source file and it resembled the below structure (actual data replaced by sample for illustration)
On careful analysis I understood the issue. The reason was due to the presence of embedded text qualifier characters within the data ie Description value of "Descr3 "Descr4" desc5". Now question was how to handle this to do the data transfer without any issues. In my case it was not possible to fix this at source as the file generation step was automated. This post deals with a workaround which can be applied to get around the above issue.
I added a new data flow task before the current data flow. The purpose of this data flow task was to change the text qualifier character (in this case ") to some other character (here | character) so as not to get confused with the embedded data characters. The data flow will have a structure as below


What this does is to read the entire row contents of the file as a single column (just don't select anything for the column delimiter) and the replace the text qualifier character from " to | and save it as a new file.
The connection manager would be configured as below

Notice the row and column delimiter settings used
Now for the derived column task use an expression as below

SUBSTRING(REPLACE(REPLACE("," + DataColumn + ",","\",","|,"),",\"",",|"),2,LEN("," + DataColumn + ",") - 2)

What this does is to replace the occurrences of ," or ", pattern with ,| or |, which will replace text qualifier character to | from ". A comma (,) character is added to the beginning and end of string for making the pattern consistent and after the replacement  we ignore the first and last characters we added using the SUBSTRING function
Once this is saved to new file we will get the modified file structure as this
.
Now in our main data flow point to the above saved file and configure the settings as below


Now link this to an OLEDB destination to populate your table and on executing you'll get the below data in the table

which clearly shows that the file data was properly transferred to the table.
This is a good workaround you can apply to deal with embedded qualifier characters coming in your data and get your data successfully transferred.
The package and sample file can be accessed from the below link
Package
https://drive.google.com/file/d/0B4ZDNhljf8tQQVRmTkJ4U01BLTA/edit?usp=sharing
Sample file
https://docs.google.com/spreadsheets/d/1HPIaz1VSR9BXfZo27KWxvq2qv12pzLNDRQHRTuCwx0U/edit?usp=sharing
Please revert if you want more clarification on the above and I will be happy to help. Also let me know any other comments you may have. Happy coding!

Monday, June 9, 2014

SSIS Tips: Implementing SCD Functionality Using Hashing Method

There are numerous ways in which you can implement slowly changing dimension processing in SSIS. There is also a standard task available in SSIS called SCD Task for this. But the task will cause a performance  bottleneck for large tables. So quite often people make use of alternatives like using a combination of lookup task with conditional split to do the SCD processing. But problem with this approach is that it would require comparison on each of the non-key columns to check if the value has changed and would be cumbersome for large tables with large number of attribute columns. So a better alternative here would be to apply hashing function over the columns and use the generated value for comparison. This blog discusses two methods by which you can utilize the hashing algorithms to implement SCD processing.

 Method 1 : Applying hashing in SSIS
--------------------------------------------
There is a third party task called Multiple hash available in codeplex (http://ssismhash.codeplex.com/) that can be utilized for this.
The task has a set of algorithms available like SHA1,MD5 etc which can used for generating the hash values.
The package utilizing the Mutiple Hash task would be as below


The source file will look like this

The requirement is to transfer the details from the source file to dimension table. All except ReservationType attribute has to be undergo Type 1 processing and Reservation Type has to be Type 2 processed.
The package workflow would be as below
1. We will have a flat file source that points to the source file to extract the data from it.
2. A Multiple Hash task is used next to generate two hashing values one based on the Reservation Type column and second one based on other non key columns in the table. The algorithm used in this example is SHA1 with safe null handling enabled. We utilize these two hash values for comparison and doing Type 2 / Type 1 processing of the associated columns.
The Multiple Hash task configuration for the two hash values look like below

3. There are three lookup tasks used one after the other
The first lookup would be based on primary key field(s) ie ID in above case
This is to understand whether the record is a new record or an existing one. The NoMatch output of this lookup would be linked to OLEDB destination which will insert the new records.
In the case of  Match output records ie existing ones , we have three conditions
a. Records with no changes in values of any of the non key values ie hash values beng same we need to ignore these.
b. Records with change in ReservationType field value. These have to be Type 2 processed (add new entry + close existing entry). The table has two fields ValidFrom and ValidTo to indicate the validity of each record for historical (Type 2) representation. At any time only one record will have ValidTo as NULL which will store the present values of the Type 2 attributes.
c. Records with changes to any of the other non key columns. These need to be Type 1 processed (simple update)

For this purpose we link Match output of the primary key lookup task to second lookup which will try to match on ReservationType hash value field. The NonMatch output of this would include two OLEDB command steps one to close down existing record by updating ValidTo field with the current date value and second one to insert a new record with ReservationType as new value, ValidFrom as current date and ValidTo as NULL. The Matched output would then link to the third lookup task to see if atleast the other nonkey fields value have changed. The NoMatch output of this will be linked to OLEDB Command to update the values of other non key attributes which have changed. The Match output will be ignored as we dont want to do any processing for the unchanged records.

Method 2 : Applying HASHBYTES T-SQL function
-------------------------------------------------------------------------

T-SQL also has a HASHBYTES function which can be utilized in the above scenario.For that the first step would be to get the flat file data onto a staging table. This can be done using a simple data flow task in SSIS. Once the data is there in staging table, You can use another data flow task where you can do same set of lookups as in the above case to check for hash values and identify the column value changes.
The second data flow task would look almost same as the above except for the Multiple Hash step.

We need to be careful on couple of things while using the HASHBYTES function

1. Make sure we handle NULLs appropriately. The presence of NULL value in one of the columns can cause value to become NULL after concatenation so we need to make sure we use ISNULL or COALESCE to convert NULLs to blank values

2. Another thing we need to keep in mind is to make sure we identify column values separately by introducing a delimiter in between while concatenating. This is needed because there's a slight possibility of  concatenated value pattern being same though individual values are not

See this as an illustration

DECLARE @Col1 varchar(100)= 'Test',@Col2 varchar(100) = 'Code',@col3 varchar(100)='In Progress'
DECLARE @ColA varchar(100)= 'Test Code',@ColB varchar(100) = 'In',@colC varchar(100)='Progress'

SELECT HASHBYTES('SHA1',@Col1+ '  ' + @Col2+ '  ' + @col3),HASHBYTES('SHA1',@ColA+ '  ' + @ColB+ '  ' + @colC)

You can see that two hashed values match though column values are not the same


Now add a delimiter in between and see the result

DECLARE @Col1 varchar(100)= 'Test',@Col2 varchar(100) = 'Code',@col3 varchar(100)='In Progress'
DECLARE @ColA varchar(100)= 'Test Code',@ColB varchar(100) = 'In',@colC varchar(100)='Progress'

SELECT HASHBYTES('SHA1',@Col1+ ', ' + @Col2+ ', ' + @col3),HASHBYTES('SHA1',@ColA+ ', ' + @ColB+ ' ,' + @colC)



This method can be used to ensure uniqueness of hash value based on the values of the individual columns.

You may use any of the above methods to apply hashing algorithm for SCD processing in SSIS. The choice of method depends on whether its feasible to use third party component, whether staging environment is available etc.
Feel free to revert for any clarification. Enjoy coding!

Monday, May 26, 2014

String Comparison in SSIS Lookup Task SSIS vs T-SQL Engine

Recently a colleague of mine came to me with an issue he was facing. He was trying to do mapping of data from a flat file source against a reference table to get the conversion codes corresponding to the incoming data values. He was using the lookup task and was finding that lots of data were going through the unmatched output in spite of them having matching detail in the reference table.
I analyzed his package and tweaked few options and found out that this was happening only with the default settings of lookup task which is using full cache mode. When I tried to use partial or no cache option it could finding out the matching details. So I understood its the difference in the way comparison works for these two cases.
I analyzed further and understood the reason. Apparently in full cache mode the comparison happens inside SSIS engine itself and so it does a case sensitive comparison. In contrast, during partial and no cache modes comparison takes place in T-SQL engine and so it will be case insensitive by default unless you've changed default setting to use a case sensitive collation instead.
So in the above scenario if you still want the comparison to work fine in the above case you need to do a small tweak to unify the casing of values before doing the comparison. This can be done by using a derived column task after the flat file source and using expression to unify the casing of values like UPPER(ColumnName). Then in the query used inside lookup task use UPPER(ColumnName) to make result in upper case. Once you do that comparison will work as intended by ignoring the actual case of the values.
See the below illustration to see it in action
The source file looks like below
And the reference table has the below data
The attempt is to lookup against the table based on the value of ProductName field. Its quite evident from the above that the file has matching values existing in the reference table. Now lets see what happens when we execute package using the full cache option. The package execution looks like below

As you see from the above all rows from source where passed through no match lookup indicating that lookup was a failure. This is expected behavior as inside the file the values exist in camel casing whereas in table its in upper case. Since we used full cache option the comparison took place in SSIS engine which is why it did a case sensitive search and lookup failed to find the matches.
Change the cache mode to partial or no cache mode and on running the package execution result looks like below

The above screenshot clearly suggests the lookup matched all the incoming data indicating that it did a case insensitive search as the comparison was done in SQL Server.
The workaround to obtain the same functionality using full cache mode is as below

The difference here is the addition of derived column transform before the lookup to convert the input values to upper case to match those present in the reference table. This would cause the lookup to match values even when comparison takes place inside SSIS as the casing is now the same between file and table data.
This is something to take care of while working with the lookup task in SSIS.