Monday, July 8, 2013

T-SQL to DAX: Simulating PERCENTILE_CONT analytic function

PERCENTILE_CONT is used for calculating a percentile based on a continuous distribution of the column value. 
From ORACLE SQL documentation:
The result of PERCENTILE_CONT is computed by linear interpolation between values after ordering them. Using the percentile
value (P) and the number of rows (N) in the aggregation group, we compute the row number we are interested in after ordering the
rows with respect to the sort specification. This row number (RN) is computed according to the formula RN = (1+ (P*(N-1)). The
final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN =
CEILING(RN) and FRN = FLOOR(RN).
 
The final result will be:
 If (CRN = FRN = RN) then the result is
    (value of expression from row at RN)
  Otherwise the result is
    (CRN – RN) * (value of expression for row at FRN) +
    (RN – FRN) * (value of expression for row at CRN)


Now lets see how we can simulate this in DAX by the illustration below
Consider the following Powerpivot sheet


The DAX calculations used are as follows

RowNo is generated as below

=CALCULATE(DISTINCTCOUNT(StudentTable[ID])+1,FILTER(StudentTable,EARLIER(StudentTable[Subject])=StudentTable[Subject]  && (EARLIER(StudentTable[Marks])

CALCULATE function is used for applying DISTINCTCOUNT over resultset obtained by filtering StudentTable based on Subject group and finding count of records with Marks > current rows Mark value and in case Marks are same ordering based on ID value.

Once RowNo is determined we can apply the formula to find out RN from the formula for PERCENTILE_CONT as

=1+(0.9 *(CALCULATE(COUNTROWS(StudentTable),FILTER(StudentTable,EARLIER(StudentTable[Subject]) = StudentTable[Subject]))-1)) giving value as 4.6

Then find out ceiling and floor of the RN value by applying corresponding functions

=CEILING([RN],1) giving value as 5

=FLOOR([RN],1) with value 4

Once this is done, we need to determine value at ceiling and floor rows using the below formulas

Floor value is calculated as

=SUMX(FILTER(StudentTable,StudentTable[Row_No]=StudentTable[FRN]  && EARLIER(StudentTable[Subject])=StudentTable[Subject]),StudentTable[Marks]) * (StudentTable[CRN]-StudentTable[RN])

and Ceiling value as

=SUMX(FILTER(StudentTable,StudentTable[Row_No]=StudentTable[CRN]  && EARLIER(StudentTable[Subject])=StudentTable[Subject]),StudentTable[Marks]) * (StudentTable[RN]-StudentTable[FRN])

Final PERCENTILE_CONT can be calculated as sum of these two values

=[Value_CRN]+[Value_FRN]

Now lets try the corresponding calculation in a T-SQL query

DECLARE @Students table
(
ID int,
Name varchar(100),
[Subject] varchar(20),
Marks int
)
INSERT @Students
VALUES
(1, 'Praveen', 'Maths', 98),
(3, 'Mohan', 'Physics', 88),
(5, 'Mary', 'Maths', 92),
(9, 'Mathew', 'Physics', 90),
(11, 'Prithi', 'Maths', 85),
(17 ,'Mithun', 'Maths', 77),
(23, 'Martin', 'Physics', 82),
(25, 'Kiran', 'Maths', 93),
(27, 'Kumar', 'Chemistry' ,75),
(30, 'Sita', 'Physics', 88),
(32, 'Jophin', 'Chemistry', 62),
(33, 'Gladis', 'Chemistry', 83),
(35, 'Kunal', 'Chemistry', 88),
(36, 'Manu', 'Physics' ,93),
(40, 'Madhur', 'Chemistry' ,91)


SELECT *,
PERCENTILE_CONT(0.9) WITHIN GROUP( ORDER BY Marks DESC)  OVER (PARTITION BY [Subject]) AS PercentileCont
FROM @Students
ORDER BY ID

The result is as follows



As we see from above, the values are same as what we got earlier indicating that the formula and simulation is correct