Saturday, May 25, 2013

Enforcing effective data validation checks in T-SQL

I’ve seen on quite a few occasions of people asking ways to validate the source data especially in integration and ETL type of projects. The usual suggestion in those cases is to leverage upon the set of standard validation functions that SQL Server has provided like ISDATE, ISNUMERIC etc. Whilst this sounds sensible, most people are unaware of an inherent issue that this may introduce. This blog is focused on explaining the issue and gives the readers a better insight on the correct implementation of data validation logics.
Let’s start with the ISDATE() function usage.
Consider a scenario where a date field is coming from a source system csv extract in YYYYMMDD format and requirement to check for valid date format before populating the data to our system. The common approach to this problem is to implement a filter based logic that utilizes the ISDATE function to cleanse out the invalid date values. The filter would look like
WHERE ISDATE(DateField) = 1
Though we expect the above function to strip out all invalid date entries, it doesn’t exactly work that way. To illustrate, let’s assume two rows in source file with incomplete date values say 2012 and 201301. Let’s now try parsing them using above function

Description: cid:image001.png@01CE587B.ED424CD0

As we see it clears both the values as valid dates whereas as per our definition they are incomplete invalid values. The reason why it finds them as valid is because ISDATE tries to convert it to valid date format. So even an incomplete value of 2012 gets interpreted as date of 1905-07-06 and 201103 as 2020-11-03. This can be confirmed by below query.

So inorder for us to validate date functions effectively as per our business logic we need to add additional conditions based on date value lengths, formats etc.. That’s why ISDATE is not fully reliable for date value validations.
Similar is the case with ISNUMERIC function. Consider source files containing some records with some values like d, $ , - etc. coming for the Numeric field. Let’s see how ISNUMERIC validates them by using below query

As you see in all the above cases ISNUMERIC finds them as valid as the values are all patterns which are allowed under Numeric data type, first value closely resembling  alternate scientific notation (refer  and other symbols used as part of numeric values to denote currency, thousand separator etc.
This would compel us to put additional checks to ensure values are in exact format as per our business rule like filter presence of d, e characters, checking numeric data length etc. hence ISNUMERIC on most occasions is not fully reliable for doing efficient numeric data validation as per our business rule.
The conclusion is that functions like ISDATE, ISNUMERIC etc. provides us with generic validation rules for corresponding datatypes however we may need to extend and add additional checks to suit our specific business requirements.
Please let me know your valuable comments or any additional clarification you may require on the above.