select *, GroupValue=ntile(100) over (order by Dollars)… it will split the rows into 100 equal sized subsets (of 1000 rows each) by sorting the rows by Dollars and then assigning a value of 1, 2, 3, … , 100 to the column GroupValue.

from MyTable

But despite the enticing name of the function, NTILE() doesn’t do anything for you as far as calculating

*percentile*values. Percentiles are the 99 data values that mark the

*boundaries*between those 100 subsets. For example, the 80th percentile marks the boundary where 80% of the values lie at or below it and 20% of the values lie at or above it. Perhaps the most “famous” of the percentile values is the 50th percentile, which you may know as the

*median*.

There doesn’t seem to be any standard definition of

*percentile*. Many references will say that the

*nth*percentile has

*n*% of the values

*at or below*it (as I did above), and many will say that it has

*n*% of the values just

*below*it. But that’s kind of nitpicky stuff.

Unfortunately, though, there doesn’t seem to be a standard

*calculation*of a percentile either. For example, let’s say you have a set of 3 numbers {1,5,9}. The 50th percentile is 5, because it’s right there in the middle. But what if you have a set of 4 numbers {1,5,9,20}? There is no number in the middle here, so you have to take the average of the middle two numbers. Therefore, the 50th percentile of this set is (5 + 9) / 2 = 7. That seems straightforward.

But now what’s the value of the 75th percentile? Is it the average of the 3rd and 4th numbers (which would be 14.5)? Or is it something else? This is where the definitions differ. For example, Microsoft Excel would calculate the 75th percentile of that set of 4 numbers to be 11.75.

For the purposes of this exercise, in order to be consistent across Microsoft products, I’ll use the definition that Microsoft Excel uses.

Excel has a PERCENTILE() function that accepts 2 parameters: an array, and a percentile value which must be in the range of 0 to 1, with 0.75 representing the 75th percentile, for example. Note that you can pass any real number between 0 and 1 to the function… If, for some bizarre reason, you wanted to find out the 56.27th percentile, you can pass 0.5627. The illustration below shows the PERCENTILE() function being used in Excel on our familiar set of 4 numbers.

How does Excel calculate this function? Okay, here goes…

Given a set of

*N*ordered values, {

*v[1], v[2], ... , v[N]*}, the value

*V*of the percentile

*p*(expressed as a decimal) is calculated as follows:

First, calculate an intermediate result:

*I = p*(

*N – 1*)

*+ 1*

That intermediate result is split into an integer component,

*k*, and a decimal component,

*d*, such that

*k + d = I*

Now the value of the percentile can be calculated like so:

*V = v[k] + d*(

*v[k+1] – v[k]*)

Whew! That’s a lot of variables, but it’s easier when you see an example in action. Let's try the 75th percentile of our set of {1,5,9,20}. In this case

*p*is 0.75 and

*N*is 4. So…

*I = 0.75*(

*4 – 1*)

*+ 1 = 3.25*

That splits up into

*k*=3 and

*d*=0.25. So our final answer is:

*V = v[3] + 0.25*(

*v[4] – v[3]*)

*= 9 + 0.25*(

*20 – 9*)

*= 11.75*

So, let’s translate that whole thing into T-SQL, shall we?

First we’ll create a temporary table called #Set, which will contain our set of 4 values:

if object_id('tempdb..#Set') is not null drop table #SetCalculating the intermediate result is easy. Our

go

create table #Set (SetValue int)

go

insert #Set select 1

union all select 5

union all select 9

union all select 20

*N*value is the number of rows in our table, and the intermediate result,

*I*, is calculated from that, and then the

*k*and

*d*values are calculated from that. (If you’re confused about the CROSS APPLYs, please check out my blog entry Cool CROSS APPLY Tricks, Part 2).

declare @p decimal(8,7)Now that we have a value for

set @p=0.75

select N,I,k,d

from (select N=count(*) from #Set) F1

cross apply (select I=@p*(N-1)+1) F2

cross apply (select k=floor(I)

,d=I-floor(I)) F3

/*

N I k d

--- ---------- --- ----------

4 3.2500000 3 0.2500000

*/

*k*, we can get the

*kth*and (

*k+1*)

*th*entries out of our #Set table. We will use the ROW_NUMBER() function to order the values and assign a SeqNo to each value in order to acquire those 2 specific entries (i.e. the ones where SeqNo is equal to

*k*and

*k+1*). Once we get those two entries, we calculate our final result,

*V*.

declare @p decimal(8,7)This works fine, and it all flows intuitively, but the way that we acquire

set @p=0.75

;with SequencedData as

(

select SetValue

,SeqNo=row_number() over (order by SetValue)

from #Set

)

select V

from (select N=count(*) from #Set) F1

cross apply (select I=@p*(N-1)+1) F2

cross apply (select k=floor(I)

,d=I-floor(I)) F3

cross apply (select [v(k)]=(select SetValue

from SequencedData

where SeqNo=k)

,[v(k+1)]=(select SetValue

from SequencedData

where SeqNo=k+1)) F4

cross apply (select V=[v(k)]+d*([v(k+1)]-[v(k)])) F5

/*

V

----------

11.750000

*/

*v[k]*and

*v[k+1]*is not ideal. The query plan shows 2 Table Scans and 2 Sorts, because we’re doing two individual SELECTs from a CTE that involves ROW_NUMBER().

So let’s change that F4 CROSS APPLY so that it only does a single SELECT instead of 2 sub-query SELECTs:

declare @p decimal(8,7)That makes the query a bit more efficient. I ran a test, loading the #Set table with 100,000 entries, and the differences between those two queries was quite satisfying. The CPU decreased by 50%, the Reads decreased by 33%, and the Execution Time decreased by 44%.

set @p=0.75

;with SequencedData as

(

select SetValue

,SeqNo=row_number() over (order by SetValue)

from #Set

)

select V

from (select N=count(*) from #Set) F1

cross apply (select I=@p*(N-1)+1) F2

cross apply (select k=floor(I)

,d=I-floor(I)) F3

cross apply (select [v(k)]=min(SetValue)

,[v(k+1)]=max(SetValue)

from SequencedData

where SeqNo between k and k+1) F4

cross apply (select V=[v(k)]+d*([v(k+1)]-[v(k)])) F5

/*

V

----------

11.750000

*/

Now let’s put this concept into practical use. Let’s get Total Sales Dollars by Customer for all Orders in the AdventureWorks database. We will only look at the Customers who are Stores (CustomerType=’S’) rather than Individuals (CustomerType=’I’). That will be our base data. Then we will find the 3 Quartiles (i.e. 25th, 50th, and 75th percentiles) and, just for kicks, we will also find the minimum and maximum Customer Sales Dollars by finding the 0th and 100th percentiles. In theory, there’s no such thing as the 0th and 100th percentile, but that is one side benefit to Excel's calculation method:

;with BaseData asHmmm… half of our customers have Dollar Totals of about $62,000 or less. But then the 75th percentile leaps up to about $238,000, with the biggest customer generating about $1.18 million. Those are big leaps. Just out of curiosity, how does the median of $62,000 compare to the Average Sales Dollars Per Customer (also known as the

(

select c.CustomerID

,CustDollars=sum(h.TotalDue)

from Sales.SalesOrderHeader h

join Sales.Customer c on h.CustomerID=c.CustomerID

where c.CustomerType='S'

group by c.CustomerID

)

,SequencedData as

(

select CustDollars

,SeqNo=row_number() over (order by CustDollars)

from BaseData

)

select Percentile=p

,[Value]=V

from (select N=count(*) from BaseData) F1

cross apply (select 0.00 union all

select 0.25 union all

select 0.50 union all

select 0.75 union all

select 1.00) Percentiles(p)

cross apply (select I=p*(N-1)+1) F2

cross apply (select k=floor(I)

,d=I-floor(I)) F3

cross apply (select [v(k)]=min(CustDollars)

,[v(k+1)]=max(CustDollars)

from SequencedData

where SeqNo between k and k+1) F4

cross apply (select V=[v(k)]+d*([v(k+1)]-[v(k)])) F5

/*

Percentile Value

----------- ---------------

0.00 1.518300

0.25 10079.041700

0.50 62256.990300

0.75 237845.936350

1.00 1179857.465700

*/

*mean*)?

;with BaseData asSo this tells us that the mean lies somewhere between the 50th and 75th percentile. The numbers indicate that our customers are not evenly spread across the spectrum and that the data is skewed to the lower side. In other words, a large number of our customers generate lower sales dollars. So which figure, $62,000 or $170,000, is more representative of the “average” customer? I'll leave that up to AdventureWorks management. An entire book could be written on the interpretation of the median and the mean. (But I'll bet that the CEO of AdventureWorks brags to his girlfriend using the higher number).

(

select c.CustomerID

,CustDollars=sum(h.TotalDue)

from Sales.SalesOrderHeader h

join Sales.Customer c on h.CustomerID=c.CustomerID

where c.CustomerType='S'

group by c.CustomerID

)

select AverageDollars=avg(CustDollars)

from BaseData

/*

AverageDollars

---------------

170498.0247

*/

In my next blog entry, we’ll take a look at Excel’s PERCENTRANK() function. This is essentially the inverse of the PERCENTILE() function we looked at in this entry. This way we’ll take a look at that $170,498 that we calculated above and see what percentile that represents.

Nice example.

ReplyDeleteGood post Brad. I love how you solved this problem, as it made it easy to understand even for those not familar with calculating percentiles. One thing you may want to consider is to rely less on the cross apply and use more standard joins. This query may be more difficult to read, with standard joins, but should perform better.

ReplyDeleteHere is a sample:

;WITH BaseData AS

(

select c.CustomerID

,CustDollars=sum(h.TotalDue)

,SeqNo=row_number() over (order by sum(h.TotalDue))

,COUNT(*) OVER() AS cnt

from Sales.SalesOrderHeader h

join Sales.Customer c on h.CustomerID=c.CustomerID

where c.CustomerType='S'

group by c.CustomerID

)

SELECT

p,

[v]=MIN(CustDollars) + ((p*(cnt-1)+1)

- FLOOR(p*(cnt-1)+1))*(MAX(CustDollars)- MIN(CustDollars))

FROM BaseData

CROSS JOIN(

SELECT 0.00 AS P UNION ALL

SELECT 0.25 UNION ALL

SELECT 0.50 UNION ALL

SELECT 0.75 UNION ALL

SELECT 1.00

) AS Percentiles

WHERE

SeqNo BETWEEN FLOOR(p*(cnt-1)+1)

AND FLOOR(p*(cnt-1)+1) + 1

GROUP BY p,cnt

Thanks for your input, Adam.

ReplyDeleteI think the reason your query runs faster is because you are effectively doing 1 run through the table. Mine ends up doing two, because of that SELECT N=COUNT(*) in addition to just processing the SequencedData CTE.

I know I fooled around with a solution that included a COUNT(*) OVER () as you did in your query (attempting only 1 pass through the table), and I swear it came out taking longer for some reason, but I'll bet I just didn't have the JOINs or GROUP BYs arranged in an appropriate manner.

Thanks again.

Hi Adam...

ReplyDeleteJust a followup... here's your query dressed up to use the variables I originally had in the mathematical formulas so that it is easier to follow. It brings about a virtually identical query plan as your original query, just a "Compute Scalar" or two get thrown in (which incurs essentially zero cost). Comparing your query and this one, they come out at 50%/50%.

I'm providing a link below to the query at my old website so that it's easier to read in terms of indentation, etc.:

http://www.stockciphering.com/percentile.htm

Brad,

ReplyDeleteThanks for the followup :). I defintely agree that your version is much easier to read. It is great to see that it performs well! Performance and scalabilty are two of my biggest concerns (aside from security), when rolling out code. I look forward to reading part 2 of this series.

Thanks.

Dude, you rock. This is exactly what I was looking for!

ReplyDelete@Nicholas:

ReplyDeleteThanks! I never updated this blog post with the faster code that was suggested by Adam in the comments section here. Please make sure you check this out for the final best approach:

http://www.stockciphering.com/percentile.htm

This comment has been removed by the author.

ReplyDeleteGreat article - how will you implement this into a function?

ReplyDelete