Saturday, August 5, 2017

SSIS Tips: Handling UTF 8 Based File Data


Quite a few times there were cases where we had to transfer CSV data containing non English based characters to SQLServer database. Though it seems straightforward there are few things we need to consider while working with CSV containing non English characters. This post explains the steps that shall be followed while transferring UTF 8 based data from a CSV onto a SQLServer based relational database system


Consider the case where we've a CSV file with the below data

As you see from the above the file has a column which stores non English based data. The requirement was to get this data transferred to a SQLServer database for doing further manipulations with them.
Given this requirement our straightforward way would be to use a data flow task with a flat file source and an OLEDB destination to connect to the SQLServer db.
Lets see what happens when we try to use the standard DFT task for this case
Given below is the screenshot of the standard data flow task using non-unicode datatype I tried to use

As you see from the above it complains of data conversion issue between unicode and non-unicode data types. 
Now lets try using an unicode datatype and see what happens

Now we get an error stating that it cant convert between the two code pages. This clearly implies we cant use the default OLEDB destination in the case of UTF - 8 based file data.
Now lets try using the next possible option i.e. with ADO .Net destination. 
The data flow for this case is as below

The column will be of unicode datatype (nchar, nvarchar) in the database.
Once executed you can check the table to see that data is getting populated correctly

As seen from the above results data got transferred correctly this time to the SQLServer table.


By virtue of the above illustration, its clear that for transferring UTF-8 based file data to SQLServer using SSIS we need to do the below

1. Use UTF-8 based codepage (65001) in the Flat File connection manager
2. Use ADO .Net destination task inside data flow task
3. Use a Unicode based datatype for the table column like nchar, nvarchar