Friday, September 11, 2009

Fun With Percentiles, Part 2

In Part 1, we talked about how to calculate a value V for a given percentile p in a given set of numbers. We looked at the mathematical formula that Excel uses for its PERCENTILE() function and then put together some T-SQL scripts that used those formulas.

In this blog entry, we are going to put together the inverse of the PERCENTILE() function. Excel calls this the PERCENTRANK() function. In other words, given a value V we are going to calculate its percentile p.

For those of you who can’t get enough of algebra and logic, you can look at the mathematical formula presented in the last blog entry and see how the following formula for PERCENTRANK() is derived.

Given a set of N ordered values, {v[1], v[2], ... , v[N]}, the percentile p of the value V is calculated as follows:

Find the first value v[k] that is exactly equal to V. If found then set d equal to 0.00. Otherwise find the two adjacent values v[k] and v[k+1] such that v[k] < V < v[k+1], and then calculate d as follows:

d = (V – v[k]) / (v[k+1] – v[k])

Using these values of k and d, the percentile can be calculated:

p = (k + d – 1) / (N – 1)

As an example, let’s try the value of 11.75 and see what percentile that represents in the set of {1,5,9,20}. Since there are 4 values in the set, N is equal to 4. Our value V of 11.75 is between the values of 9 and 20, which are the 3rd and 4th entries in the ordered set, so k is 3 and k+1 is 4. And so…

d = (11.75 – v[3]) / (v[4] – v[3]) = (11.75 – 9) / (20 – 9) = 0.25

And therefore…

p = (3 + 0.25 – 1) / (4 – 1) = 0.75

So 11.75 represents the 75th percentile of {1,5,9,20}.

Before attacking this in T-SQL, let’s again build a temporary table called #Set which contains our 4 values:

if object_id('tempdb..#Set') is not null drop table #Set
go
create table #Set (SetValue int)
go
insert #Set select 1
union all select 5
union all select 9
union all select 20
The hardest part of this is to determine where the value V falls in the set of values. Let’s attack this problem by taking the original set and create a derived table that consists of the ordered ranges of those values. First a CTE called SequencedData orders the items in the set, then we JOIN that CTE with itself, in order to create ranges of adjacent values in the set:

;with SequencedData as
(
select SetValue
,SeqNo=row_number() over (order by SetValue)
from #Set
)
select FromValue=Cur.SetValue
,ToValue=Nxt.SetValue
,FromSeqNo=Cur.SeqNo
,ToSeqNo=Nxt.SeqNo
from SequencedData Cur
full join SequencedData Nxt on Cur.SeqNo=Nxt.SeqNo-1
/*
FromValue ToValue FromSeqNo ToSeqNo
---------- -------- ---------- --------
NULL 1 NULL 1
1 5 1 2
5 9 2 3
9 20 3 4
20 NULL 4 NULL
*/
So we now know that the values 1 thru 5 represent the 1st and 2nd items in the set, and values 5 thru 9 represent the 2nd and 3rd items, etc. Now that we have that information, we can find out which range contains our value V:

declare @V decimal(7,2)
set @V=11.75
;with SequencedData as
(
select SetValue
,SeqNo=row_number() over (order by SetValue)
from #Set
)
,
DataRanges as
(
select FromValue=Cur.SetValue
,ToValue=Nxt.SetValue
,FromSeqNo=Cur.SeqNo
,ToSeqNo=Nxt.SeqNo
from SequencedData Cur
full join SequencedData Nxt on Cur.SeqNo=Nxt.SeqNo-1
)
select FromValue
,ToValue
,FromSeqNo
,ToSeqNo
from DataRanges
where (FromValue is null or @V>FromValue)
and (@V<=ToValue or ToValue is null)
/*
FromValue ToValue FromSeqNo ToSeqNo
---------- -------- ---------- --------
9 20 3 4
*/
So our value of 11.75 lies between 9 and 20, which are the 3rd and 4th items.

Notice that the WHERE clause specifically asked for @V to be less than OR EQUAL to ToValue. This is the way we can see if our value V is actually IN the set of values. If @V is exactly equal to ToValue, then we can set k equal to ToSeqNo and set d equal to 0.00. If they are not equal, then @V falls between FromValue and ToValue and so k is equal to FromSeqNo, and k+1 is equal to ToSeqNo, and we can calculate d as indicated by our mathematical formula. Then, once we have k and d we can calculate our percentile p:

declare @V decimal(7,2)
set @V=11.75
;with SequencedData as
(
select SetValue
,SeqNo=row_number() over (order by SetValue)
from #Set
)
,
DataRanges as
(
select FromValue=Cur.SetValue
,ToValue=Nxt.SetValue
,FromSeqNo=Cur.SeqNo
,ToSeqNo=Nxt.SeqNo
from SequencedData Cur
full join SequencedData Nxt on Cur.SeqNo=Nxt.SeqNo-1
)
,
TargetRange as
(
select FromValue
,ToValue
,FromSeqNo
,ToSeqNo
from DataRanges
where (FromValue is null or @V>FromValue)
and (@V<=ToValue or ToValue is null)
)
select p
from TargetRange
cross apply (select N=count(*) from #Set) F1
cross apply (select k=case
when ToValue=@V
then ToSeqNo
else FromSeqNo
end
,d=case
when ToValue=@V
then 0.00
else 1.0*(@V-FromValue)/(ToValue-FromValue)
end) F2
cross apply (select p=(k+d-1)/(N-1)) F3
/*
p
--------------------
0.75000000000000000
*/
Well, that looks great. However, once again, the query plan is not so great. JOINing the SequencedData CTE to itself to get the ranges forces SQL to sort the table 2 times.

Consider another approach… Our problem is that we’re trying to look for the values v[k] and v[k+1] such that v[k] < V <= v[k+1]. But if you sit and think about it, v[k] is simply the maximum value in the entire set that is less than V and v[k+1] is simply the minimum value in the entire set that is greater than or equal to V. If we find that v[k+1] is exactly equal to V, then we can set k equal to the quantity of entries in the set that are less than V and add 1 to that. Otherwise, if we find that v[k+1] is greater than V, then we can set k equal to the quantity of entries less than V. We have all of our information we need right there without having to sort anything. We can just use aggregate functions.

So here’s a new query that addresses all of the above. The first CTE calculates N, v[k], v[k+1], and it also calculates the number of entries that are less than V and the number that are equal to V. Then the main query calculates k and d based on whether V is in the set of values or not, and then finally calculates the percentile p:

declare @V decimal(7,2)
set @V=11.75
;with ValuesAndCounts as
(
select N=count(*)
,[v(k)]=max(case when SetValue<@V then SetValue end)
,[v(k+1)]=min(case when @V<=SetValue then SetValue end)
,QtyLessThan=sum(case when SetValue<@V then 1 else 0 end)
,QtyEqualTo=sum(case when SetValue=@V then 1 else 0 end)
from #Set
)
select p
from ValuesAndCounts
cross apply (select k=case
when QtyEqualTo>0
then QtyLessThan+1
else QtyLessThan
end
,d=case
when QtyEqualTo>0
then 0.00
else 1.0*(@V-[v(k)])/([v(k+1)]-[v(k)])
end) F2
cross apply (select p=(k+d-1)/(N-1)) F3
/*
p
---------------------------
0.750000000000000000000000
*/
I ran a test on a set of 100,000 values, comparing this query to the one with the self-JOINing ROW_NUMBER() CTE approach, and this query performed much better. The CPU decreased by 67%, the Reads decreased by 67%, and the Execution Time decreased by 75%.

Now that we have the basic query approach in place, we can apply this concept to the AdventureWorks database. In my last blog entry, we calculated both the median (50th percentile) and the mean (the T-SQL AVG()) of Total Sales Dollars By Customer (only looking at Customers who were Stores rather than Individuals). We found that the median was $62,257 and the mean was $170,498.

So what percentile does that mean value of $170,498 represent?

declare @V money
set
@V=170498
;with BaseData as
(
select c.CustomerID
,CustDollars=sum(h.TotalDue)
from Sales.SalesOrderHeader h
join Sales.Customer c on h.CustomerID=c.CustomerID
where c.CustomerType='S'
group by c.CustomerID
)
,
ValuesAndCounts as
(
select N=count(*)
,[v(k)]=max(case when CustDollars<@V then CustDollars end)
,[v(k+1)]=min(case when @V<=CustDollars then CustDollars end)
,QtyLessThan=sum(case when CustDollars<@V then 1 else 0 end)
,QtyEqualTo=sum(case when CustDollars=@V then 1 else 0 end)
from BaseData
)
select p
from ValuesAndCounts
cross apply (select k=case
when QtyEqualTo>0
then QtyLessThan+1
else QtyLessThan
end
,d=case
when QtyEqualTo>0
then 0.00
else 1.0*(@V-[v(k)])/([v(k+1)]-[v(k)])
end) F2
cross apply (select p=(k+d-1)/(N-1)) F3
/*
p
--------------------
0.66802891020021456
*/
So $170,498 is between the 66th and 67th percentile… and it represents a value that is higher than 66.8% of all Customers.

Everything looks great… All is well with the world… we have successfully duplicated the Excel PERCENTRANK() function.

But…

Don't sit back and relax yet.

I personally have a problem with the way that Excel calculates it… specifically I don’t like the way it handles (or doesn’t handle) the value V being in the set multiple times. Let me give you an example. Let’s create a table of scores for the SAT Test. Only 100 people took the test… 5 of them got the lowest score you can possibly get on the SAT (200) and 5 of them got the highest possible score (2400). The other remaining 90 people got a very respectable score of 1900:

if object_id('tempdb..#SAT') is not null drop table #SAT
go
create table #SAT (Score int)
go
insert #SAT
select 200
go 5
--SpongeBob,Curly,Sluggo,Gilligan,Shemp
insert #SAT
select 2400
go 5
--Plato,Galileo,Leonardo,Wolfgang,Albert
insert #SAT
select 1900
go 90
--The rest of us
Now we have a set of 100 values: {200, 200, 200, 200, 200, 1900, 1900, …, 1900, 1900, 2400, 2400, 2400, 2400, 2400}. What is the 80th percentile of this set? It’s 1900. What is the 60th percentile? It’s also 1900. And the 20th and 40th percentile are 1900. In fact, EVERY percentile from 6 to 94 has a value of 1900.

So what does Excel (and our new query) think the percentile represented by 1900 is?

declare @V int
set
@V=1900
;with ValuesAndCounts as
(
select N=count(*)
,[v(k)]=max(case when Score<@V then Score end)
,[v(k+1)]=min(case when @V<=Score then Score end)
,QtyLessThan=sum(case when Score<@V then 1 else 0 end)
,QtyEqualTo=sum(case when Score=@V then 1 else 0 end)
from #SAT
)
select p
from ValuesAndCounts
cross apply (select k=case
when QtyEqualTo>0
then QtyLessThan+1
else QtyLessThan
end
,d=case
when QtyEqualTo>0
then 0.00
else 1.0*(@V-[v(k)])/([v(k+1)]-[v(k)])
end) F2
cross apply (select p=(k+d-1)/(N-1)) F3
/*
p
--------------------------
0.05050505050505050505050
*/
It thinks that 1900 is just a smidgen above the 5th percentile. I guess you could argue that this is correct, since 1900 was only better than 5 scores out of 100. But it was better than OR EQUAL TO 95 scores out of 100. Shouldn’t that be taken into account somehow? It seems like there should be a compromise of some sort, considering that 1900 represented every percentile from 6 to 94.

By making a small adjustment to the query (where d is calculated), we can come to this compromise:

declare @V int
set
@V=1900
;with ValuesAndCounts as
(
select N=count(*)
,[v(k)]=max(case when Score<@V then Score end)
,[v(k+1)]=min(case when @V<=Score then Score end)
,QtyLessThan=sum(case when Score<@V then 1 else 0 end)
,QtyEqualTo=sum(case when Score=@V then 1 else 0 end)
from #SAT
)
select p
from ValuesAndCounts
cross apply (select k=case
when QtyEqualTo>0
then QtyLessThan+1
else QtyLessThan
end
,d=case
when QtyEqualTo>0
then (QtyEqualTo-1)/2.0 --instead of 0.00
else 1.0*(@V-[v(k)])/([v(k+1)]-[v(k)])
end) F2
cross apply (select p=(k+d-1)/(N-1)) F3
/*
p
--------------------------
0.50000000000000000000000
*/
That’s a better answer in my opinion. Those 90 people who took the test and got a 1900 all fell as a group in the middle of the set of all scores, so we can report back to them that they were in the 50th percentile.

I hope you’ve enjoyed the exercise of implementing (and improving) Excel’s PERCENTILE() and PERCENTRANK() functions in T-SQL. These statistical concepts can be very useful in analyzing your data. I also urge you to explore the web, as there are other alternate formulas out there for calculating percentiles and percentile ranks. Most use the same core approach, but differ in how they perform interpolations and minor calculations. You'll also come to appreciate why there is no agreed-upon definition of how to calculate percentiles.

No comments:

Post a Comment