Saturday, September 13, 2014

Troubleshooting SQLServer CDC Service for Oracle Capturing Only Changed Column Values for Updates Issue

Recently for one of my datawarehouse projects we had a scenario which required getting data from a Oracle server. Since this Oracle server was a high end OLTP system with high frequency DML operations happening the impact on the system has to be kept a minimum.
For this purpose we made use of the CDC service for Oracle for replicating the changes from Oracle server to our SQLServer datawarehouse. This worked fine for us until we faced an issue with data modifications happening at the Oracle end.
We could see that the CDC only captured details of the changed columns for UPDATE operations happening in Oracle server leaving out all the other columns in the table with NULL values. This threw off our logic to capture the changed information from the Oracle delta data.
What we didn't understand was what made this issue to suddenly crop up as this was working well at the time of implementation for us. I did some analysis on this and found out the reason for this which helped us to fix this issue. I'm sharing it via this blog to benefit others who may get stuck up with similar issues.
The key requirement for CDC to work is that Oracle server should be setup in ARCHIVELOG mode with supplementary logging enabled. This is because CDC makes use of logs to capture the change data and replicate it to the SQLServer database.So my analysis was around the settings to be done for supplementary logging at Oracle server.
Through the below link
http://cglendenningoracle.blogspot.in/2009/09/what-is-oracle-supplemental-logging.html
I learned that there are two level of logging and for CDC to get enough information from the redo logs and capture it outside of Oracle instance Database Level Identification Key logging has to be enabled. The information regarding this can be got from V$DATABASES catalog view in Oracle. It has a column corresponding to each of the options available like
SUPPLEMENTAL_LOG_DATA_FK,
SUPPLEMENTAL_LOG_DATA_PK, 
SUPPLEMENTAL_LOG_DATA_ALL etc. 
I checked the value and found out that SUPPLEMENTAL_LOG_DATA_ALL had the value set as NO. So I tried setting it to YES using below statement

ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
I did some changes in source tables to check the outcome and found out that CDC was able to pick up data in other columns too from then on which confirmed me that issue had been fixed by tweaking the setting. Apparently what had happened was that the ARCHIVELOG got full in the Oracle server and after fixing this the supplementary settings somehow got reset to the default values. Setting it back to original state fixed the issue. So whenever you face similar issues with CDC not able to get required data from redo logs make sure you check the V$DATABASES view for the supplementary log settings existing for various types of columns and fix them accordingly in case they got reset using ALTER DATABASE command.

Monday, August 25, 2014

T-SQL Tips: Be Cautious With Your Delete Queries

This blog speaks about some of the critical things you need to be aware of while implementing a delete logic for your project.
The impetus for writing the blog is a recent experience we had due to a small oversight in one of the delete code implemented by a fellow team member. By the time we realized the mistake, considerable amount of the damage was already done and we had to do a good amount of work to put everything back in place.
Now onto the details.
There are few things that need to taken care of when you're developing a DELETE procedure to clean up the data in your project. The main checks that you need to do can be summed up as below

1. Make sure the logic is correct
This is the basic thing that you need to take care of :) . Do a part to part comparison of the DELETE query logic against the actual requirement and make sure the query covers all the conditions as stated by the requirement

2. Test with a sample data
Make sure you have a sandbox with sample data available in which you can test the delete logic and make sure it works exactly as per your requirement.
In cases were you don't have such a setup available i.e. just have a single environment to work with (which is extremely dangerous IMO!)  you can follow the below approach
Wrap the script within a dummy transaction as below

BEGIN TRAN TestDelete


SELECT * FROM Table

--COMMIT TRAN TestDelete
--ROLLBACK TRAN TestDelete

Check the output from the table after the delete operation and if it matches your expectation uncomment and execute the COMMIT TRAN statement. Otherwise uncomment and execute the ROLLBACK TRAN statement.
This can be applied in the cases where you want to apply manual delete script.
In case of automated jobs you can do a different approach as the above is not possible.
You can make use of OUTPUT clause to capture the contents of the DELETED table to another staging table and check it after the DELETE. If by any chance it doesn't match you expectation you can ROLLBACK the transaction otherwise you may COMMIT it.
The stub will look like below

BEGIN TRAN TestDelete

DELETE t
OUTPUT DELETED.* INTO OtherTable
FROM Table t
...

IF (SELECT COUNT(*) FROM OtherTable) =
 ROLLBACK TRAN TestDelete
ELSE
 COMMIT TRAN TestDelete


3. Always fully qualify the object names used in DELETE statement
This is the most critical of the steps and the one which gave us the maximum trouble with. Make sure you use appropriate aliases and fully qualify the columns otherwise you may end up flushing out the entire data of the table as per below. This is what happened to us and we had to spend good amount of time to rework and put the lost data back in place.
See the below illustration to understand the issue

declare @t table
(
id int,
val varchar(50)
)

insert @t
values (1,'Test'),
(2,'jkgj'),
(1,'grwreg'),
(1,'sdvad'),
(1,'svdb')


declare @t1 table
(
Col1 int,
col2 int
)
insert @t1
values (12,123),(2,234),(5,213)

select * from @t
select * from @t1

delete 
from @t
where id in (select id from @t1)


select * from @t
select * from @t1

If you see the code above carefully it has a wrong column placed within the subquery. Little did we realize what this would cause to us.
Checking the output you can see that it will delete all the records from the outer table @t as it will act as a trivial condition check.
This behavior is by design and is explained in the KB article below

http://support.microsoft.com/kb/298674/en-us

Rewriting the query as below you can see that it will complain on the wrong column

delete 
from @t t
where t.id in (select t1.id from @t1 t1)

Which will avoid any possibility of table getting emptied as it happened above
So the bottom line is always fully qualify your objects
Hopefully this would give you some insight into things to be taken care of while developing and implementing delete queries in future and avoid any unintended deletions from happening.

Tuesday, July 29, 2014

SSIS Tips: Extract and Populate Multiple Table Relational Data from Multipart File

I'm back with another scenario regarding text file data transfer.
The impetus to writing this blog was recent mail I got from one of my blog followers. He had a scenario which he wanted me to provide him with a solution with.
The scenario here was a multipart file containing the details of orders. The file has three different types of row indicating order header information, the details and also the summary information. The metadata of the three types of rows are different so it was not possible to use a standard data flow task with flat file source to process and get the data from the file. Also the file data has to go into three separate tables set up in the database which also have referential integrity set up through foreign key constraints. As always with requirements of this type, they had little control over how the file was getting generated from the source.
I suggested a work around for handling the scenario which I'm sharing here
The file looked like this
As you see from the above the file consists of three types of rows. The rows starting with H designates the Order Header information and consists of column values for OrderNo,OrderDate and CustomerName columns.The rows starting with D indicates the details of the order which includes ItemName,PartNo for the item,Quantity ordered,UnitPrice of the item and also the Shipset number (Shipset represents a unit of deivery for the Order). Finally the rows with S represents Order Summary information with latest Delivery Date (of the last shipset) and Total Order Value. So group of rows from H till the following S represents the details of a single order and have to be stored against same reference ID value (OrderHeaderID).
The data from the above file has to go into the below tables in database
OrderHeader (OrderHeaderID (PK IDENTITY),OrderNo,OrderDate,CustomerName)
OrderDetails (OrderHeaderID(FK to OrderHeader),ItemName,PartNo,Qty,UnitPrice,ShipsetNo)
OrderSummary(OrderHeaderID(FK to OrderHeader),LatestDeliveryDate,OrderTotal)

The package for this looks like below

The steps involved are
1. Execute SQL task for truncating and preparing StagingTable for data load. The Staging table will have a RowID column with IDENTITY property 
2. Data Flow Task to load the staging table with file data into a single data column in comma separated format
3. Execute SQL tasks to generate the group ids for identifying the header,details and summary rows belong to each group. This is done in two steps. First header rows are assigned a sequential value and then use to apply to following detail and summary rows
The logic looks like below
UPDATE   t
SET Grp =Seq
FROM
(
SELECT *,ROW_NUMBER() OVER (ORDER BY RowID) AS Seq
FROM StagingTable
WHERE LEFT(DataColumn,CHARINDEX(',',DataColumn)-1) = 'H'
)t

UPDATE t
SET Grp= t1.Grp
FROM StagingTable t
CROSS APPLY (SELECT TOP 1 Grp
FROM StagingTable
WHERE Grp IS NOT NULL
AND RowID < t.RowID
ORDER BY RowID DESC) t1
WHERE t.Grp IS NULL

The logic orders the records on the basis of RowID value, generates a sequence ID value using ROW_NUMBER function and assigns it to the header (H) rows. This is followed by a further update statement which assigns the GroupID to the corresponding details and summary rows which comes after the header row. So at the end of this step all rows belonging to a single order will get the same group id value. This is required for applying the newly generated orderheader id (IDENTITY value) from OrderHeader table  to each of details and summary rows for the order while doing final data population
 4. Execute SQL task to execute procedure to do the final data load with logic as below

CREATE PROC PopulateOrderTableData
AS

DECLARE @INSERTED_ORDERS table
(
OrderHeaderID int,
GrpID int
)

IF OBJECT_ID('tempdb..#Headers') IS NOT NULL
DROP TABLE #Headers
IF OBJECT_ID('tempdb..#Details') IS NOT NULL
DROP TABLE #Details
IF OBJECT_ID('tempdb..#Summary') IS NOT NULL
DROP TABLE #Summary

SELECT RowID,Grp,[1],[2],[3],[4]
INTO #Headers
FROM StagingTable s
CROSS APPLY dbo.ParseValues(s.DataColumn,',')f
PIVOT (MAX(Val) FOR ID IN ([1],[2],[3],[4]))p
WHERE [1] = 'H'

SELECT RowID,Grp,[1],[2],[3],[4],[5],[6]
INTO #Details
FROM StagingTable s
CROSS APPLY dbo.ParseValues(s.DataColumn,',')f
PIVOT (MAX(Val) FOR ID IN ([1],[2],[3],[4],[5],[6]))p
WHERE [1] = 'D'

SELECT RowID,Grp,[1],[2],[3]
INTO #Summary
FROM StagingTable s
CROSS APPLY dbo.ParseValues(s.DataColumn,',')f
PIVOT (MAX(Val) FOR ID IN ([1],[2],[3]))p
WHERE [1] = 'S'

MERGE INTO OrderHeader t
USING #Headers s
ON 1=2 
WHEN NOT MATCHED THEN
INSERT (OrderNo ,OrderDate ,CustomerName)
VALUES (s.[2],s.[3],s.[4])
OUTPUT INSERTED.OrderHeaderID,s.Grp INTO @INSERTED_ORDERS;

INSERT OrderDetails
SELECT i.OrderHeaderID,[2],[3],[4],[5],[6] 
FROM #Details d
INNER JOIN @INSERTED_ORDERS i
ON i.GrpID = d.Grp

INSERT OrderSummary
SELECT i.OrderHeaderID,[2],[3] 
FROM #Summary d
INNER JOIN @INSERTED_ORDERS i
ON i.GrpID = d.Grp

The procedure created three temporary tables #Header,#Details and #Summary and populates them with header,details and summary rows of file data. The logic makes use of UDF ParseValues explained here http://visakhm.blogspot.in/2010/02/parsing-delimited-string.html
to parse the column separated data from the table field into separate values and pivoted into multiple columns. 
Then it makes use of a dummy MERGE statement to do insertion to OrderHeader table and captures the generated OrderHeaderID values along with corresponding Grp values to a table variable. The MERGE statement is necessary because we need to capture Grp value from source table along with IDENTITY field value using OUTPUT clause and this is possible only through the MERGE statement. Once this is done then we populate rest of the tables using corresponding temporary tables and uses a join to link to table variable on Grp value to get the generated identity value inserted to the tables. 
At the end you can run a set of select statements in your table and see the populated data as below

As seen from the above data got populated to our tables as expected
This is a good approach you can use to extract data out of a multipart file and use it to populate a set of related tables in the database maintaining the referential integrity. 
The main point to remember here is the generation of a GroupID value to group the similar Order rows together to make sure generated OrderHeaderID gets assigned correctly to each of them. 

The package and sample file used in the illustration can be downloaded from the below link

Package :

Sample file:

Feel free to get back for any comments/clarification you may have on the above

Wednesday, July 2, 2014

SSIS Tips: Handling Inconsistent Text Qualifiers in Flat File Source

Last week I had blogged about how to handle embedded text qualifiers coming in flat file source data in the below blog.
http://visakhm.blogspot.in/2014/06/ssis-tips-handling-embedded-text.html
One of the feedback I got was from a ETL developer who asked whether its possible to handle files with inconsistent text qualifiers using flat file source in SSIS. He had also sent a sample file which I've adapted here for illustrating the solution I had provided for him.
The file looks like below
As you see the file has inconsistent text qualifier ie for 3rd column it has combination "| as text qualifier and for other columns its | alone. It also has " characters coming inside data as well (see row 3).
In order handle this the basic approach would be same as what we discussed previously. We will add a separate data flow task to handle the inconsistent text qualifiers and make them consistent throughout.
The first data flow looks like below
The flat file source here reads the entire row contents of the file as a single column. This is achieved by specifying the row delimiter as {CR}{LF} and no column delimiter specified. The derived column task will  make text qualifier consistent by doing pattern replace using REPLACE function. The expression used would be this

REPLACE(REPLACE(DataColumn,",\"|",",|"),"|\",","|,")

This is then saved into a new flat file temporarily which would have structure as below
You can see from the above that text qualifier is now consistent (|). 
Now you can use a new data flow task with above file as the source, choose delimiter and text qualifier as , and | and it will work fine populating the data to your destination table. 
This is an approach you can apply to extract data out of flat file with inconsistent text qualifiers and get it loaded into your database table using SSIS.
The sample file and package for this can be found in the below link. 
Replace the connectionstrings with your actual values and you will be able to execute and test the package for yourself.
Feel free to revert for any further clarification. Happy coding!

Tuesday, July 1, 2014

T-SQL Tips: Retrieve Connectionstring Details from SSIS Package

I've published a new article on MSDN Wiki which explains how connectionstring details can be extracted from an existing SSIS package.
You can find the article in the below link

http://social.technet.microsoft.com/wiki/contents/articles/25150.t-sql-retrieve-connectionstring-details-from-a-ssis-package.aspx

This approach can be used when you want to audit the existing SSIS packages and understand the details like connectionstrings its using, which SQLServer instances its trying to connect to etc and comes handy in maintenance projects where you inherit an existing system with no or little documentation.

Feel free to revert for any clarification