Friday, September 26, 2014

T-SQL Tips: Multifaceted MERGE Statement

MERGE statement was a very useful addition which was introduced from SQL 2008. It is very useful in lots of day to day scenarios like normal UPSERT (UPDATE + INSERT) operations, slowly changing dimension processing etc.
This blog aims at explaining two not so common scenarios where MERGE statement comes handy.

1. Generating an "Incremental" Data Population Script
In a typical project development there will be a need to create data scripts for populating generic values like configurations, master data entry, audit/ control table entries etc. During the course of the project the data scripts keeps on adding new values. In normal cases this will be handled by adding new INSERT scripts each time. So at the project end whilst doing the final migration to the production we will have a set of INSERT statements which may even span multiple files.
This can be replaced by a single MERGE statement to which we can always keep on adding values and executed in database to apply only the incremental changes.
The statement would look like below

MERGE INTO Table  AS d
USING (VALUES('Value1','Value2',value3)) AS s (Col1, Col2, Col3)
ON s.Col1 = d.Col1
WHEN MATCHED
AND (
s.Col2 <> d.Col2
OR s.Col3 <> d.Col3
)
THEN 
UPDATE SET Col2 = s.Col2, Col3 = s.Col3
WHEN NOT MATCHED
THEN 
INSERT (Col1,Col2,Col3)
VALUES(s.Col1,s.Col2,s.Col3);
Illustration
Consider the code below

declare @t table
(
id int identity(1,1),
col1 varchar(50),
col2 varchar(50),
col3 int
)

MERGE @t AS d
USING (VALUES('Value1','Cat1',100),
('Value2','Cat1',150),
('Value1','Cat2',300),
('Value2','Cat2',225),
('Value3','Cat2',430),
('Value1','Cat3',520)
) AS s(col1,col2,col3)
ON s.col1 = d.col1
WHEN MATCHED
AND (
COALESCE(s.col2,'') <> COALESCE(d.col2,'')
OR COALESCE(s.col3,'') <> COALESCE(d.col3,'')
)
THEN 
UPDATE 
SET col2 = s.col2,col3= s.col3
WHEN NOT MATCHED BY TARGET
THEN
INSERT (col1,col2,col3)
VALUES(s.col1,s.col2,s.col3);

SELECT * FROM @t

Now suppose if you want to add three more values you can just extend that within same MERGE statement as below

declare @t table
(
id int identity(1,1),
col1 varchar(50),
col2 varchar(50),
col3 int
)

MERGE @t AS d
USING (VALUES('Value1','Cat1',100),
('Value2','Cat1',150),
('Value1','Cat2',300),
('Value2','Cat2',225),
('Value3','Cat2',430),
('Value1','Cat3',550),
('Value2','Cat3',610),
('Value1','Cat4',735),
('Value2','Cat4',821)
) AS s(col1,col2,col3)
ON s.col1 = d.col1
WHEN MATCHED
AND (
COALESCE(s.col2,'') <> COALESCE(d.col2,'')
OR COALESCE(s.col3,'') <> COALESCE(d.col3,'')
)
THEN 
UPDATE 
SET col2 = s.col2,col3= s.col3
WHEN NOT MATCHED BY TARGET
THEN
INSERT (col1,col2,col3)
VALUES(s.col1,s.col2,s.col3);


SELECT * FROM @t

I've marked the changed rows in green. I've added three new rows and changed the value of an existing row. Executing this you can see that only the incremental changes ie new rows + modified ones will get applied to table.You can keep on extending script like this and finally you'll just have a single MERGE statement in a single file with all values which you need to migrate to your production environment rather than a set of small small incremental INSERT scripts.

2. Populating Master Child Tables

Another major application of the MERGE statement comes in the cases where you want to populate a master child table with some external data. I had previously blogged about a similar scenario here

http://visakhm.blogspot.in/2010/04/using-xml-to-batch-load-master-child.html

Now lets see how MERGE statement can be used for achieving the same requirement. MERGE would provide a much more compact way of doing this without the need of any intermediate temporary table.

The XML looks like as in below file
https://drive.google.com/file/d/0B4ZDNhljf8tQbHdfZUFteXNRcDg/edit?usp=sharing

The tables can be given as below
declare @order table
(
OrderID int  IDENTITY(100000,1),
OrderDesc varchar(1000),
OrderDate datetime,
CustomerDesc varchar(1000),
ShippedDate datetime,
WebOrder bit
)
declare @OrderItems table
(
OrderItemID int IDENTITY(1000,1),
OrderID int,
ItemDesc varchar(100),
Qty int,
ItemPrice decimal(15,2)
)

For illustration I've given it as table variables but in the actual case it would be permanent tables.
Nows lets see the illustration of how MERGE can be applied in the above scenario

Illustration
The code will look like below


INSERT INTO @OrderItems
(
OrderID,
ItemDesc,
Qty,
ItemPrice
)
SELECT *
FROM
(
MERGE INTO @order AS d
USING (SELECT t.u.value('OrderDesc[1]','varchar(100)') AS OrderDesc,
       t.u.value('OrderDate[1]','datetime') AS OrderDate,
       t.u.value('CustomerDesc[1]','varchar(1000)') AS CustomerDesc,
       t.u.value('ShippedDate[1]','datetime') AS ShippedDate,
       t.u.value('WebOrder[1]','bit') AS WebOrder,
       m.n.value('ItemDesc[1]','varchar(100)') AS ItemDesc,
       m.n.value('Qty[1]','int') AS Qty,
       m.n.value('ItemPrice[1]','decimal(15,2)') AS ItemPrice  
       FROM @x.nodes('/Orders/Order')t(u)
       CROSS APPLY u.nodes('OrderItem')m(n)) AS s
 ON d.OrderDesc = s.OrderDesc
 AND d.OrderDate = s.OrderDate
 WHEN NOT MATCHED
 THEN
 INSERT ( 
 OrderDesc ,
OrderDate ,
CustomerDesc ,
ShippedDate ,
WebOrder
)
VALUES
(
s.OrderDesc ,
s.OrderDate ,
s.CustomerDesc ,
s.ShippedDate ,
s.WebOrder
)
OUTPUT inserted.OrderID,
s.ItemDesc,
s.Qty,
s.ItemPrice
)t (OrderID,
ItemDesc,
Qty,
ItemPrice)

Now check the output and you can see the below



As seen above the data will get populated to the master and the child table preserving the referential integrity. This is made possible through the use of OUTPUT clause with MERGE. The OUTPUT clause when used with MERGE exposes the columns of source table/query in addition to the INSERTED and DELETED table columns which enables us to capture the generated id values from the parent table as well as all the required column values for the child table from the source query within the same statement. This is then used in a nested INSERT for populating the child table. This is a scenario where MERGE can come really handy to us.
Hope you enjoyed this article speaking about two not so common practical scenarios where MERGE can be applied. As always feel free to revert with your clarifications/comments.
Will be back with a new T-SQL tip soon!

Wednesday, September 17, 2014

T-SQL Tips: Fun with date FORMATing

FORMAT was a new function introduced in SQL 2012 which can be use to format date,numeric values to a variety of different formats and styles.
This blog discusses on some of the cool date value formatting tips that can be done using the FORMAT function.

1. Display the date value based on a language/locale culture

declare @dt datetime = GETDATE()

SELECT FORMAT(@dt,'d','en-GB') AS BritishFormat,FORMAT(@dt,'D','en-US') AS UsFormat,
FORMAT(@dt,'D','zh-cn') AS ChineseFormat,FORMAT(@dt,'D','ta-IN') AS TamilIndiaFormat,
FORMAT(@dt,'D','ml-IN') AS MalayalamIndiaFormat



2. Display date as per any of the custom formats

declare @dt datetime = GETDATE()

SELECT FORMAT(@dt,'dd/MM/yyyy hh:mm:ss tt') AS DateWithTime12h,
FORMAT(@dt,'dd MMM yyyy HH:mm:ss') AS DateWithTime24h,
FORMAT(@dt,'HH:mm:ss.fff') AS TimeAloneWithMillisec,
FORMAT(@dt,'yyyy-MM-dd HH:mm:ss.fff zz') AS DateTimeWithOffset,
FORMAT(@dt,'dddd dd MMMM yyyy gg') AS DayNameWithDateEra


3. Display component parts of a date value based on custom formats

declare @dt datetime = GETDATE()

SELECT FORMAT(@dt,'yyyy\MM') AS Period,
FORMAT(@dt,'hh:mm:ss tt') AS Time12h,
FORMAT(@dt,'HH:mm:ss') AS Time24h,
FORMAT(@dt,'dddd dd\t\h MMMM') AS DayMonthDate,
FORMAT(@dt,'HH \h mm \min ss \sec') AS TimeElapsedSinceDayStart,
FORMAT(@dt,'yyyy-MM-ddTHH:mm:sszzz') AS XMLDateFormat,
FORMAT(@dt,'yyyy-MM-ddTHH:mm:ssZ') AS XMLDateFormatUTC


From the above we can see that FORMAT is a very useful function which provides the ability of generating flexible presentation formats from a given date value.

Saturday, September 13, 2014

Troubleshooting SQLServer CDC Service for Oracle Capturing Only Changed Column Values for Updates Issue

Recently for one of my datawarehouse projects we had a scenario which required getting data from a Oracle server. Since this Oracle server was a high end OLTP system with high frequency DML operations happening the impact on the system has to be kept a minimum.
For this purpose we made use of the CDC service for Oracle for replicating the changes from Oracle server to our SQLServer datawarehouse. This worked fine for us until we faced an issue with data modifications happening at the Oracle end.
We could see that the CDC only captured details of the changed columns for UPDATE operations happening in Oracle server leaving out all the other columns in the table with NULL values. This threw off our logic to capture the changed information from the Oracle delta data.
What we didn't understand was what made this issue to suddenly crop up as this was working well at the time of implementation for us. I did some analysis on this and found out the reason for this which helped us to fix this issue. I'm sharing it via this blog to benefit others who may get stuck up with similar issues.
The key requirement for CDC to work is that Oracle server should be setup in ARCHIVELOG mode with supplementary logging enabled. This is because CDC makes use of logs to capture the change data and replicate it to the SQLServer database.So my analysis was around the settings to be done for supplementary logging at Oracle server.
Through the below link
http://cglendenningoracle.blogspot.in/2009/09/what-is-oracle-supplemental-logging.html
I learned that there are two level of logging and for CDC to get enough information from the redo logs and capture it outside of Oracle instance Database Level Identification Key logging has to be enabled. The information regarding this can be got from V$DATABASES catalog view in Oracle. It has a column corresponding to each of the options available like
SUPPLEMENTAL_LOG_DATA_FK,
SUPPLEMENTAL_LOG_DATA_PK, 
SUPPLEMENTAL_LOG_DATA_ALL etc. 
I checked the value and found out that SUPPLEMENTAL_LOG_DATA_ALL had the value set as NO. So I tried setting it to YES using below statement

ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
I did some changes in source tables to check the outcome and found out that CDC was able to pick up data in other columns too from then on which confirmed me that issue had been fixed by tweaking the setting. Apparently what had happened was that the ARCHIVELOG got full in the Oracle server and after fixing this the supplementary settings somehow got reset to the default values. Setting it back to original state fixed the issue. So whenever you face similar issues with CDC not able to get required data from redo logs make sure you check the V$DATABASES view for the supplementary log settings existing for various types of columns and fix them accordingly in case they got reset using ALTER DATABASE command.

Monday, August 25, 2014

T-SQL Tips: Be Cautious With Your Delete Queries

This blog speaks about some of the critical things you need to be aware of while implementing a delete logic for your project.
The impetus for writing the blog is a recent experience we had due to a small oversight in one of the delete code implemented by a fellow team member. By the time we realized the mistake, considerable amount of the damage was already done and we had to do a good amount of work to put everything back in place.
Now onto the details.
There are few things that need to taken care of when you're developing a DELETE procedure to clean up the data in your project. The main checks that you need to do can be summed up as below

1. Make sure the logic is correct
This is the basic thing that you need to take care of :) . Do a part to part comparison of the DELETE query logic against the actual requirement and make sure the query covers all the conditions as stated by the requirement

2. Test with a sample data
Make sure you have a sandbox with sample data available in which you can test the delete logic and make sure it works exactly as per your requirement.
In cases were you don't have such a setup available i.e. just have a single environment to work with (which is extremely dangerous IMO!)  you can follow the below approach
Wrap the script within a dummy transaction as below

BEGIN TRAN TestDelete


SELECT * FROM Table

--COMMIT TRAN TestDelete
--ROLLBACK TRAN TestDelete

Check the output from the table after the delete operation and if it matches your expectation uncomment and execute the COMMIT TRAN statement. Otherwise uncomment and execute the ROLLBACK TRAN statement.
This can be applied in the cases where you want to apply manual delete script.
In case of automated jobs you can do a different approach as the above is not possible.
You can make use of OUTPUT clause to capture the contents of the DELETED table to another staging table and check it after the DELETE. If by any chance it doesn't match you expectation you can ROLLBACK the transaction otherwise you may COMMIT it.
The stub will look like below

BEGIN TRAN TestDelete

DELETE t
OUTPUT DELETED.* INTO OtherTable
FROM Table t
...

IF (SELECT COUNT(*) FROM OtherTable) =
 ROLLBACK TRAN TestDelete
ELSE
 COMMIT TRAN TestDelete


3. Always fully qualify the object names used in DELETE statement
This is the most critical of the steps and the one which gave us the maximum trouble with. Make sure you use appropriate aliases and fully qualify the columns otherwise you may end up flushing out the entire data of the table as per below. This is what happened to us and we had to spend good amount of time to rework and put the lost data back in place.
See the below illustration to understand the issue

declare @t table
(
id int,
val varchar(50)
)

insert @t
values (1,'Test'),
(2,'jkgj'),
(1,'grwreg'),
(1,'sdvad'),
(1,'svdb')


declare @t1 table
(
Col1 int,
col2 int
)
insert @t1
values (12,123),(2,234),(5,213)

select * from @t
select * from @t1

delete 
from @t
where id in (select id from @t1)


select * from @t
select * from @t1

If you see the code above carefully it has a wrong column placed within the subquery. Little did we realize what this would cause to us.
Checking the output you can see that it will delete all the records from the outer table @t as it will act as a trivial condition check.
This behavior is by design and is explained in the KB article below

http://support.microsoft.com/kb/298674/en-us

Rewriting the query as below you can see that it will complain on the wrong column

delete 
from @t t
where t.id in (select t1.id from @t1 t1)

Which will avoid any possibility of table getting emptied as it happened above
So the bottom line is always fully qualify your objects
Hopefully this would give you some insight into things to be taken care of while developing and implementing delete queries in future and avoid any unintended deletions from happening.

Tuesday, July 29, 2014

SSIS Tips: Extract and Populate Multiple Table Relational Data from Multipart File

I'm back with another scenario regarding text file data transfer.
The impetus to writing this blog was recent mail I got from one of my blog followers. He had a scenario which he wanted me to provide him with a solution with.
The scenario here was a multipart file containing the details of orders. The file has three different types of row indicating order header information, the details and also the summary information. The metadata of the three types of rows are different so it was not possible to use a standard data flow task with flat file source to process and get the data from the file. Also the file data has to go into three separate tables set up in the database which also have referential integrity set up through foreign key constraints. As always with requirements of this type, they had little control over how the file was getting generated from the source.
I suggested a work around for handling the scenario which I'm sharing here
The file looked like this
As you see from the above the file consists of three types of rows. The rows starting with H designates the Order Header information and consists of column values for OrderNo,OrderDate and CustomerName columns.The rows starting with D indicates the details of the order which includes ItemName,PartNo for the item,Quantity ordered,UnitPrice of the item and also the Shipset number (Shipset represents a unit of deivery for the Order). Finally the rows with S represents Order Summary information with latest Delivery Date (of the last shipset) and Total Order Value. So group of rows from H till the following S represents the details of a single order and have to be stored against same reference ID value (OrderHeaderID).
The data from the above file has to go into the below tables in database
OrderHeader (OrderHeaderID (PK IDENTITY),OrderNo,OrderDate,CustomerName)
OrderDetails (OrderHeaderID(FK to OrderHeader),ItemName,PartNo,Qty,UnitPrice,ShipsetNo)
OrderSummary(OrderHeaderID(FK to OrderHeader),LatestDeliveryDate,OrderTotal)

The package for this looks like below

The steps involved are
1. Execute SQL task for truncating and preparing StagingTable for data load. The Staging table will have a RowID column with IDENTITY property 
2. Data Flow Task to load the staging table with file data into a single data column in comma separated format
3. Execute SQL tasks to generate the group ids for identifying the header,details and summary rows belong to each group. This is done in two steps. First header rows are assigned a sequential value and then use to apply to following detail and summary rows
The logic looks like below
UPDATE   t
SET Grp =Seq
FROM
(
SELECT *,ROW_NUMBER() OVER (ORDER BY RowID) AS Seq
FROM StagingTable
WHERE LEFT(DataColumn,CHARINDEX(',',DataColumn)-1) = 'H'
)t

UPDATE t
SET Grp= t1.Grp
FROM StagingTable t
CROSS APPLY (SELECT TOP 1 Grp
FROM StagingTable
WHERE Grp IS NOT NULL
AND RowID < t.RowID
ORDER BY RowID DESC) t1
WHERE t.Grp IS NULL

The logic orders the records on the basis of RowID value, generates a sequence ID value using ROW_NUMBER function and assigns it to the header (H) rows. This is followed by a further update statement which assigns the GroupID to the corresponding details and summary rows which comes after the header row. So at the end of this step all rows belonging to a single order will get the same group id value. This is required for applying the newly generated orderheader id (IDENTITY value) from OrderHeader table  to each of details and summary rows for the order while doing final data population
 4. Execute SQL task to execute procedure to do the final data load with logic as below

CREATE PROC PopulateOrderTableData
AS

DECLARE @INSERTED_ORDERS table
(
OrderHeaderID int,
GrpID int
)

IF OBJECT_ID('tempdb..#Headers') IS NOT NULL
DROP TABLE #Headers
IF OBJECT_ID('tempdb..#Details') IS NOT NULL
DROP TABLE #Details
IF OBJECT_ID('tempdb..#Summary') IS NOT NULL
DROP TABLE #Summary

SELECT RowID,Grp,[1],[2],[3],[4]
INTO #Headers
FROM StagingTable s
CROSS APPLY dbo.ParseValues(s.DataColumn,',')f
PIVOT (MAX(Val) FOR ID IN ([1],[2],[3],[4]))p
WHERE [1] = 'H'

SELECT RowID,Grp,[1],[2],[3],[4],[5],[6]
INTO #Details
FROM StagingTable s
CROSS APPLY dbo.ParseValues(s.DataColumn,',')f
PIVOT (MAX(Val) FOR ID IN ([1],[2],[3],[4],[5],[6]))p
WHERE [1] = 'D'

SELECT RowID,Grp,[1],[2],[3]
INTO #Summary
FROM StagingTable s
CROSS APPLY dbo.ParseValues(s.DataColumn,',')f
PIVOT (MAX(Val) FOR ID IN ([1],[2],[3]))p
WHERE [1] = 'S'

MERGE INTO OrderHeader t
USING #Headers s
ON 1=2 
WHEN NOT MATCHED THEN
INSERT (OrderNo ,OrderDate ,CustomerName)
VALUES (s.[2],s.[3],s.[4])
OUTPUT INSERTED.OrderHeaderID,s.Grp INTO @INSERTED_ORDERS;

INSERT OrderDetails
SELECT i.OrderHeaderID,[2],[3],[4],[5],[6] 
FROM #Details d
INNER JOIN @INSERTED_ORDERS i
ON i.GrpID = d.Grp

INSERT OrderSummary
SELECT i.OrderHeaderID,[2],[3] 
FROM #Summary d
INNER JOIN @INSERTED_ORDERS i
ON i.GrpID = d.Grp

The procedure created three temporary tables #Header,#Details and #Summary and populates them with header,details and summary rows of file data. The logic makes use of UDF ParseValues explained here http://visakhm.blogspot.in/2010/02/parsing-delimited-string.html
to parse the column separated data from the table field into separate values and pivoted into multiple columns. 
Then it makes use of a dummy MERGE statement to do insertion to OrderHeader table and captures the generated OrderHeaderID values along with corresponding Grp values to a table variable. The MERGE statement is necessary because we need to capture Grp value from source table along with IDENTITY field value using OUTPUT clause and this is possible only through the MERGE statement. Once this is done then we populate rest of the tables using corresponding temporary tables and uses a join to link to table variable on Grp value to get the generated identity value inserted to the tables. 
At the end you can run a set of select statements in your table and see the populated data as below

As seen from the above data got populated to our tables as expected
This is a good approach you can use to extract data out of a multipart file and use it to populate a set of related tables in the database maintaining the referential integrity. 
The main point to remember here is the generation of a GroupID value to group the similar Order rows together to make sure generated OrderHeaderID gets assigned correctly to each of them. 

The package and sample file used in the illustration can be downloaded from the below link

Package :

Sample file:

Feel free to get back for any comments/clarification you may have on the above