Monday, September 9, 2013

Implementing Transaction over File System Objects

Its a common scenario in ETL processes to transfer files from remote location to landing folder and do some processing with them. Recently we had a requirement wherein we were asked to implement transaction processing for the file transfer package. The purpose behind this was to track and rollback the entire batch of files upon any failure as they wanted batch wise atomic behavior to be implemented.
The obvious idea that came to my mind was to make use transaction property of SSIS for achieving this. So I setup a package with following control flow. For sake of simplicity I'm just illustrating part which deals with file transfer without referring much onto processing part.

The control flow will contain a for each loop which points to source directory for iterating through the files in it. Inside loop we've a File System Task to to movement of file from remote location to our landing folder. As highlighted above we've set the transaction option as required for the for each loop to make sure loop starts a transaction for the file transfer. You need to make sure Distributed Transaction Coordinator service is running in machine before you can implement transactions in SSIS.

The file system task looks like this

To mimic an error scenario I've set OverwriteDestination option to False and intentionally copied a file to destination from source before hand so that it fails the loop.
Source directory looks like this
The attempt is to move files to Arch folder.
Running the package will give us following output
This is expected behavior due to presence of file already in the Arch folder. Now lets go and check if transaction action was performed by for loop. To our surprise we will see that files transferred in the current batch prior to error are not rolled back
This reveals the fact that SSIS is not able to rollback changes done by File System Task. I tried the same in 2008,2008 R2 and 2012 versions and in all cases the behavior was the same.
I analyzed further and came up with another approach. I created a file table which points to the destination folder and made ETL package like this.

The caveat is that you need to use 2012 version for this as FileTable is only available from 2012. You can understand more about FileTables from below posts which I published earlier.

The Execute SQL Task will have a variable based dynamic query to add the files to the directory (FileTable). The expression used for query would be.

"INSERT INTO MyFileTable (name,file_stream)
SELECT '"+ @[User::FileName] + "',*
FROM OPENROWSET(BULK '" + @[User::FullFilePath]  +"', SINGLE_BLOB) AS FileData"

The variables FileName get name of file to be transfered from within For Each Loop and FullFilePath will append directory info to get full path.
Now lets see how this new logic works. I've put a breakpoint on ExecuteSQL Task to check intermediate results.
The first iteration looks like this

As you see it fetches first file (Crosslink.jpg) and inserts it to FileTable (Directory). Lets check the intermediate results.

As expected the file got added to directory as highlighted in red above.
Continue running package and it will fail eventually due to file LatestWeekDay being already present.
Now go back and check the folder and you would find all newly added files removed by current package indicating that transaction rollback was successfully done.

By this approach I was able to implement transaction over file transfer process.
This is an approach that can come handy in situations like this.