*V*for a given percentile

*p*in a given set of numbers. We looked at the mathematical formula that Excel uses for its PERCENTILE() function and then put together some T-SQL scripts that used those formulas.

In this blog entry, we are going to put together the

*inverse*of the PERCENTILE() function. Excel calls this the PERCENTRANK() function. In other words, given a value

*V*we are going to calculate its percentile

*p*.

For those of you who can’t get enough of algebra and logic, you can look at the mathematical formula presented in the last blog entry and see how the following formula for PERCENTRANK() is derived.

Given a set of

*N*ordered values, {

*v[1], v[2], ... , v[N]*}, the percentile

*p*of the value

*V*is calculated as follows:

Find the first value

*v[k]*that is exactly equal to

*V*. If found then set

*d*equal to 0.00. Otherwise find the two adjacent values

*v[k]*and

*v[k+1]*such that

*v[k]*<

*V*<

*v[k+1]*, and then calculate

*d*as follows:

*d =*(

*V – v[k]*)

*/*(

*v[k+1] – v[k]*)

Using these values of

*k*and

*d*, the percentile can be calculated:

*p =*(

*k + d – 1*)

*/*(

*N – 1*)

As an example, let’s try the value of 11.75 and see what percentile that represents in the set of {1,5,9,20}. Since there are 4 values in the set,

*N*is equal to 4. Our value

*V*of 11.75 is between the values of 9 and 20, which are the 3rd and 4th entries in the ordered set, so

*k*is 3 and

*k+1*is 4. And so…

*d =*(

*11.75 – v[3]*)

*/*(

*v[4] – v[3]*)

*=*(

*11.75 – 9*)

*/*(

*20 – 9*)

*= 0.25*

And therefore…

*p =*(

*3 + 0.25 – 1*)

*/*(

*4 – 1*)

*= 0.75*

So 11.75 represents the 75th percentile of {1,5,9,20}.

Before attacking this in T-SQL, let’s again build a temporary table called #Set which contains our 4 values:

if object_id('tempdb..#Set') is not null drop table #SetThe hardest part of this is to determine where the value

go

create table #Set (SetValue int)

go

insert #Set select 1

union all select 5

union all select 9

union all select 20

*V*falls in the set of values. Let’s attack this problem by taking the original set and create a derived table that consists of the ordered ranges of those values. First a CTE called SequencedData orders the items in the set, then we JOIN that CTE with itself, in order to create ranges of adjacent values in the set:

;with SequencedData asSo we now know that the values 1 thru 5 represent the 1st and 2nd items in the set, and values 5 thru 9 represent the 2nd and 3rd items, etc. Now that we have that information, we can find out which range contains our value

(

select SetValue

,SeqNo=row_number() over (order by SetValue)

from #Set

)

select FromValue=Cur.SetValue

,ToValue=Nxt.SetValue

,FromSeqNo=Cur.SeqNo

,ToSeqNo=Nxt.SeqNo

from SequencedData Cur

full join SequencedData Nxt on Cur.SeqNo=Nxt.SeqNo-1

/*

FromValue ToValue FromSeqNo ToSeqNo

---------- -------- ---------- --------

NULL 1 NULL 1

1 5 1 2

5 9 2 3

9 20 3 4

20 NULL 4 NULL

*/

*V*:

declare @V decimal(7,2)So our value of 11.75 lies between 9 and 20, which are the 3rd and 4th items.

set @V=11.75

;with SequencedData as

(

select SetValue

,SeqNo=row_number() over (order by SetValue)

from #Set

)

,DataRanges as

(

select FromValue=Cur.SetValue

,ToValue=Nxt.SetValue

,FromSeqNo=Cur.SeqNo

,ToSeqNo=Nxt.SeqNo

from SequencedData Cur

full join SequencedData Nxt on Cur.SeqNo=Nxt.SeqNo-1

)

select FromValue

,ToValue

,FromSeqNo

,ToSeqNo

from DataRanges

where (FromValue is null or @V>FromValue)

and (@V<=ToValue or ToValue is null)

/*

FromValue ToValue FromSeqNo ToSeqNo

---------- -------- ---------- --------

9 20 3 4

*/

Notice that the WHERE clause specifically asked for @V to be less than OR EQUAL to ToValue. This is the way we can see if our value

*V*is actually IN the set of values. If @V is exactly equal to ToValue, then we can set

*k*equal to ToSeqNo and set

*d*equal to 0.00. If they are not equal, then @V falls between FromValue and ToValue and so

*k*is equal to FromSeqNo, and

*k+1*is equal to ToSeqNo, and we can calculate

*d*as indicated by our mathematical formula. Then, once we have

*k*and

*d*we can calculate our percentile

*p*:

declare @V decimal(7,2)Well, that looks great. However, once again, the query plan is not so great. JOINing the SequencedData CTE to itself to get the ranges forces SQL to sort the table 2 times.

set @V=11.75

;with SequencedData as

(

select SetValue

,SeqNo=row_number() over (order by SetValue)

from #Set

)

,DataRanges as

(

select FromValue=Cur.SetValue

,ToValue=Nxt.SetValue

,FromSeqNo=Cur.SeqNo

,ToSeqNo=Nxt.SeqNo

from SequencedData Cur

full join SequencedData Nxt on Cur.SeqNo=Nxt.SeqNo-1

)

,TargetRange as

(

select FromValue

,ToValue

,FromSeqNo

,ToSeqNo

from DataRanges

where (FromValue is null or @V>FromValue)

and (@V<=ToValue or ToValue is null)

)

select p

from TargetRange

cross apply (select N=count(*) from #Set) F1

cross apply (select k=case

when ToValue=@V

then ToSeqNo

else FromSeqNo

end

,d=case

when ToValue=@V

then 0.00

else 1.0*(@V-FromValue)/(ToValue-FromValue)

end) F2

cross apply (select p=(k+d-1)/(N-1)) F3

/*

p

--------------------

0.75000000000000000

*/

Consider another approach… Our problem is that we’re trying to look for the values

*v[k]*and

*v[k+1]*such that

*v[k]*<

*V*<=

*v[k+1]*. But if you sit and think about it,

*v[k]*is simply the maximum value in the entire set that is less than

*V*and

*v[k+1]*is simply the minimum value in the entire set that is greater than or equal to

*V*. If we find that

*v[k+1]*is exactly equal to

*V*, then we can set

*k*equal to the quantity of entries in the set that are less than

*V*and add 1 to that. Otherwise, if we find that

*v[k+1]*is greater than

*V*, then we can set

*k*equal to the quantity of entries less than

*V*. We have all of our information we need right there without having to sort anything. We can just use aggregate functions.

So here’s a new query that addresses all of the above. The first CTE calculates

*N*,

*v[k]*,

*v[k+1]*, and it also calculates the number of entries that are less than

*V*and the number that are equal to

*V*. Then the main query calculates

*k*and

*d*based on whether

*V*is in the set of values or not, and then finally calculates the percentile

*p*:

declare @V decimal(7,2)I ran a test on a set of 100,000 values, comparing this query to the one with the self-JOINing ROW_NUMBER() CTE approach, and this query performed much better. The CPU decreased by 67%, the Reads decreased by 67%, and the Execution Time decreased by 75%.

set @V=11.75

;with ValuesAndCounts as

(

select N=count(*)

,[v(k)]=max(case when SetValue<@V then SetValue end)

,[v(k+1)]=min(case when @V<=SetValue then SetValue end)

,QtyLessThan=sum(case when SetValue<@V then 1 else 0 end)

,QtyEqualTo=sum(case when SetValue=@V then 1 else 0 end)

from #Set

)

select p

from ValuesAndCounts

cross apply (select k=case

when QtyEqualTo>0

then QtyLessThan+1

else QtyLessThan

end

,d=case

when QtyEqualTo>0

then 0.00

else 1.0*(@V-[v(k)])/([v(k+1)]-[v(k)])

end) F2

cross apply (select p=(k+d-1)/(N-1)) F3

/*

p

---------------------------

0.750000000000000000000000

*/

Now that we have the basic query approach in place, we can apply this concept to the AdventureWorks database. In my last blog entry, we calculated both the median (50th percentile) and the mean (the T-SQL AVG()) of Total Sales Dollars By Customer (only looking at Customers who were Stores rather than Individuals). We found that the median was $62,257 and the mean was $170,498.

So what percentile does that mean value of $170,498 represent?

declare @V moneySo $170,498 is between the 66th and 67th percentile… and it represents a value that is higher than 66.8% of all Customers.

set @V=170498

;with BaseData as

(

select c.CustomerID

,CustDollars=sum(h.TotalDue)

from Sales.SalesOrderHeader h

join Sales.Customer c on h.CustomerID=c.CustomerID

where c.CustomerType='S'

group by c.CustomerID

)

,ValuesAndCounts as

(

select N=count(*)

,[v(k)]=max(case when CustDollars<@V then CustDollars end)

,[v(k+1)]=min(case when @V<=CustDollars then CustDollars end)

,QtyLessThan=sum(case when CustDollars<@V then 1 else 0 end)

,QtyEqualTo=sum(case when CustDollars=@V then 1 else 0 end)

from BaseData

)

select p

from ValuesAndCounts

cross apply (select k=case

when QtyEqualTo>0

then QtyLessThan+1

else QtyLessThan

end

,d=case

when QtyEqualTo>0

then 0.00

else 1.0*(@V-[v(k)])/([v(k+1)]-[v(k)])

end) F2

cross apply (select p=(k+d-1)/(N-1)) F3

/*

p

--------------------

0.66802891020021456

*/

Everything looks great… All is well with the world… we have successfully duplicated the Excel PERCENTRANK() function.

But…

Don't sit back and relax yet.

I personally have a problem with the way that Excel calculates it… specifically I don’t like the way it handles (or

*doesn’t*handle) the value

*V*being in the set multiple times. Let me give you an example. Let’s create a table of scores for the SAT Test. Only 100 people took the test… 5 of them got the lowest score you can possibly get on the SAT (200) and 5 of them got the highest possible score (2400). The other remaining 90 people got a very respectable score of 1900:

if object_id('tempdb..#SAT') is not null drop table #SATNow we have a set of 100 values: {200, 200, 200, 200, 200, 1900, 1900, …, 1900, 1900, 2400, 2400, 2400, 2400, 2400}. What is the 80th percentile of this set? It’s 1900. What is the 60th percentile? It’s also 1900. And the 20th and 40th percentile are 1900. In fact, EVERY percentile from 6 to 94 has a value of 1900.

go

create table #SAT (Score int)

go

insert #SAT

select 200

go 5 --SpongeBob,Curly,Sluggo,Gilligan,Shemp

insert #SAT

select 2400

go 5 --Plato,Galileo,Leonardo,Wolfgang,Albert

insert #SAT

select 1900

go 90 --The rest of us

So what does Excel (and our new query) think the percentile represented by 1900 is?

declare @V intIt thinks that 1900 is just a smidgen above the 5th percentile. I guess you could argue that this is correct, since 1900 was only better than 5 scores out of 100. But it was better than OR EQUAL TO

set @V=1900

;with ValuesAndCounts as

(

select N=count(*)

,[v(k)]=max(case when Score<@V then Score end)

,[v(k+1)]=min(case when @V<=Score then Score end)

,QtyLessThan=sum(case when Score<@V then 1 else 0 end)

,QtyEqualTo=sum(case when Score=@V then 1 else 0 end)

from #SAT

)

select p

from ValuesAndCounts

cross apply (select k=case

when QtyEqualTo>0

then QtyLessThan+1

else QtyLessThan

end

,d=case

when QtyEqualTo>0

then 0.00

else 1.0*(@V-[v(k)])/([v(k+1)]-[v(k)])

end) F2

cross apply (select p=(k+d-1)/(N-1)) F3

/*

p

--------------------------

0.05050505050505050505050

*/

**95**scores out of 100. Shouldn’t that be taken into account somehow? It seems like there should be a compromise of some sort, considering that 1900 represented

*every*percentile from 6 to 94.

By making a small adjustment to the query (where

*d*is calculated), we can come to this compromise:

declare @V intThat’s a better answer in my opinion. Those 90 people who took the test and got a 1900 all fell

set @V=1900

;with ValuesAndCounts as

(

select N=count(*)

,[v(k)]=max(case when Score<@V then Score end)

,[v(k+1)]=min(case when @V<=Score then Score end)

,QtyLessThan=sum(case when Score<@V then 1 else 0 end)

,QtyEqualTo=sum(case when Score=@V then 1 else 0 end)

from #SAT

)

select p

from ValuesAndCounts

cross apply (select k=case

when QtyEqualTo>0

then QtyLessThan+1

else QtyLessThan

end

,d=case

when QtyEqualTo>0

then (QtyEqualTo-1)/2.0 --instead of 0.00

else 1.0*(@V-[v(k)])/([v(k+1)]-[v(k)])

end) F2

cross apply (select p=(k+d-1)/(N-1)) F3

/*

p

--------------------------

0.50000000000000000000000

*/

*as a group*in the middle of the set of

*all*scores, so we can report back to them that they were in the 50th percentile.

I hope you’ve enjoyed the exercise of implementing (and improving) Excel’s PERCENTILE() and PERCENTRANK() functions in T-SQL. These statistical concepts can be very useful in analyzing your data. I also urge you to explore the web, as there are other alternate formulas out there for calculating percentiles and percentile ranks. Most use the same core approach, but differ in how they perform interpolations and minor calculations. You'll also come to appreciate why there is no agreed-upon definition of how to calculate percentiles.

## No comments:

## Post a Comment