Sunday, September 6, 2009

Fun With Percentiles, Part 1

Let’s say you have 100,000 rows in a table that has a Dollars column. If you perform the following…

select *, GroupValue=ntile(100) over (order by Dollars)
from MyTable
… it will split the rows into 100 equal sized subsets (of 1000 rows each) by sorting the rows by Dollars and then assigning a value of 1, 2, 3, … , 100 to the column GroupValue.

But despite the enticing name of the function, NTILE() doesn’t do anything for you as far as calculating percentile values. Percentiles are the 99 data values that mark the boundaries between those 100 subsets. For example, the 80th percentile marks the boundary where 80% of the values lie at or below it and 20% of the values lie at or above it. Perhaps the most “famous” of the percentile values is the 50th percentile, which you may know as the median.

There doesn’t seem to be any standard definition of percentile. Many references will say that the nth percentile has n% of the values at or below it (as I did above), and many will say that it has n% of the values just below it. But that’s kind of nitpicky stuff.

Unfortunately, though, there doesn’t seem to be a standard calculation of a percentile either. For example, let’s say you have a set of 3 numbers {1,5,9}. The 50th percentile is 5, because it’s right there in the middle. But what if you have a set of 4 numbers {1,5,9,20}? There is no number in the middle here, so you have to take the average of the middle two numbers. Therefore, the 50th percentile of this set is (5 + 9) / 2 = 7. That seems straightforward.

But now what’s the value of the 75th percentile? Is it the average of the 3rd and 4th numbers (which would be 14.5)? Or is it something else? This is where the definitions differ. For example, Microsoft Excel would calculate the 75th percentile of that set of 4 numbers to be 11.75.

For the purposes of this exercise, in order to be consistent across Microsoft products, I’ll use the definition that Microsoft Excel uses.

Excel has a PERCENTILE() function that accepts 2 parameters: an array, and a percentile value which must be in the range of 0 to 1, with 0.75 representing the 75th percentile, for example. Note that you can pass any real number between 0 and 1 to the function… If, for some bizarre reason, you wanted to find out the 56.27th percentile, you can pass 0.5627. The illustration below shows the PERCENTILE() function being used in Excel on our familiar set of 4 numbers.



How does Excel calculate this function? Okay, here goes…

Given a set of N ordered values, {v[1], v[2], ... , v[N]}, the value V of the percentile p (expressed as a decimal) is calculated as follows:

First, calculate an intermediate result:

I = p(N – 1) + 1

That intermediate result is split into an integer component, k, and a decimal component, d, such that k + d = I

Now the value of the percentile can be calculated like so:

V = v[k] + d(v[k+1] – v[k])

Whew! That’s a lot of variables, but it’s easier when you see an example in action. Let's try the 75th percentile of our set of {1,5,9,20}. In this case p is 0.75 and N is 4. So…

I = 0.75(4 – 1) + 1 = 3.25

That splits up into k=3 and d=0.25. So our final answer is:

V = v[3] + 0.25(v[4] – v[3]) = 9 + 0.25(20 – 9) = 11.75

So, let’s translate that whole thing into T-SQL, shall we?

First we’ll create a temporary table called #Set, which will contain our set of 4 values:

if object_id('tempdb..#Set') is not null drop table #Set
go
create table #Set (SetValue int)
go
insert #Set select 1
union all select 5
union all select 9
union all select 20
Calculating the intermediate result is easy. Our N value is the number of rows in our table, and the intermediate result, I, is calculated from that, and then the k and d values are calculated from that. (If you’re confused about the CROSS APPLYs, please check out my blog entry Cool CROSS APPLY Tricks, Part 2).

declare @p decimal(8,7)
set @p=0.75
select N,I,k,d
from (select N=count(*) from #Set) F1
cross apply (select I=@p*(N-1)+1) F2
cross apply (select k=floor(I)
,d=I-floor(I)) F3
/*
N I k d
--- ---------- --- ----------
4 3.2500000 3 0.2500000
*/
Now that we have a value for k, we can get the kth and (k+1)th entries out of our #Set table. We will use the ROW_NUMBER() function to order the values and assign a SeqNo to each value in order to acquire those 2 specific entries (i.e. the ones where SeqNo is equal to k and k+1). Once we get those two entries, we calculate our final result, V.

declare @p decimal(8,7)
set @p=0.75
;with SequencedData as
(
select SetValue
,SeqNo=row_number() over (order by SetValue)
from #Set
)
select V
from (select N=count(*) from #Set) F1
cross apply (select I=@p*(N-1)+1) F2
cross apply (select k=floor(I)
,d=I-floor(I)) F3
cross apply (select [v(k)]=(select SetValue
from SequencedData
where SeqNo=k)
,[v(k+1)]=(select SetValue
from SequencedData
where SeqNo=k+1)) F4
cross apply (select V=[v(k)]+d*([v(k+1)]-[v(k)])) F5
/*
V
----------
11.750000
*/
This works fine, and it all flows intuitively, but the way that we acquire v[k] and v[k+1] is not ideal. The query plan shows 2 Table Scans and 2 Sorts, because we’re doing two individual SELECTs from a CTE that involves ROW_NUMBER().

So let’s change that F4 CROSS APPLY so that it only does a single SELECT instead of 2 sub-query SELECTs:

declare @p decimal(8,7)
set @p=0.75
;with SequencedData as
(
select SetValue
,SeqNo=row_number() over (order by SetValue)
from #Set
)
select V
from (select N=count(*) from #Set) F1
cross apply (select I=@p*(N-1)+1) F2
cross apply (select k=floor(I)
,d=I-floor(I)) F3
cross apply (select [v(k)]=min(SetValue)
,[v(k+1)]=max(SetValue)
from SequencedData
where SeqNo between k and k+1) F4
cross apply (select V=[v(k)]+d*([v(k+1)]-[v(k)])) F5
/*
V
----------
11.750000
*/
That makes the query a bit more efficient. I ran a test, loading the #Set table with 100,000 entries, and the differences between those two queries was quite satisfying. The CPU decreased by 50%, the Reads decreased by 33%, and the Execution Time decreased by 44%.

Now let’s put this concept into practical use. Let’s get Total Sales Dollars by Customer for all Orders in the AdventureWorks database. We will only look at the Customers who are Stores (CustomerType=’S’) rather than Individuals (CustomerType=’I’). That will be our base data. Then we will find the 3 Quartiles (i.e. 25th, 50th, and 75th percentiles) and, just for kicks, we will also find the minimum and maximum Customer Sales Dollars by finding the 0th and 100th percentiles. In theory, there’s no such thing as the 0th and 100th percentile, but that is one side benefit to Excel's calculation method:

;with BaseData as
(
select c.CustomerID
,CustDollars=sum(h.TotalDue)
from Sales.SalesOrderHeader h
join Sales.Customer c on h.CustomerID=c.CustomerID
where c.CustomerType='S'
group by c.CustomerID
)
,
SequencedData as
(
select CustDollars
,SeqNo=row_number() over (order by CustDollars)
from BaseData
)
select Percentile=p
,[Value]=V
from (select N=count(*) from BaseData) F1
cross apply (select 0.00 union all
select 0.25 union all
select 0.50 union all
select 0.75 union all
select 1.00) Percentiles(p)
cross
apply (select I=p*(N-1)+1) F2
cross apply (select k=floor(I)
,d=I-floor(I)) F3
cross apply (select [v(k)]=min(CustDollars)
,[v(k+1)]=max(CustDollars)
from SequencedData
where SeqNo between k and k+1) F4
cross apply (select V=[v(k)]+d*([v(k+1)]-[v(k)])) F5
/*
Percentile Value
----------- ---------------
0.00 1.518300
0.25 10079.041700
0.50 62256.990300
0.75 237845.936350
1.00 1179857.465700
*/
Hmmm… half of our customers have Dollar Totals of about $62,000 or less. But then the 75th percentile leaps up to about $238,000, with the biggest customer generating about $1.18 million. Those are big leaps. Just out of curiosity, how does the median of $62,000 compare to the Average Sales Dollars Per Customer (also known as the mean)?

;with BaseData as
(
select c.CustomerID
,CustDollars=sum(h.TotalDue)
from Sales.SalesOrderHeader h
join Sales.Customer c on h.CustomerID=c.CustomerID
where c.CustomerType='S'
group by c.CustomerID
)
select AverageDollars=avg(CustDollars)
from BaseData
/*
AverageDollars
---------------
170498.0247
*/
So this tells us that the mean lies somewhere between the 50th and 75th percentile. The numbers indicate that our customers are not evenly spread across the spectrum and that the data is skewed to the lower side. In other words, a large number of our customers generate lower sales dollars. So which figure, $62,000 or $170,000, is more representative of the “average” customer? I'll leave that up to AdventureWorks management. An entire book could be written on the interpretation of the median and the mean. (But I'll bet that the CEO of AdventureWorks brags to his girlfriend using the higher number).

In my next blog entry, we’ll take a look at Excel’s PERCENTRANK() function. This is essentially the inverse of the PERCENTILE() function we looked at in this entry. This way we’ll take a look at that $170,498 that we calculated above and see what percentile that represents.

9 comments:

  1. Good post Brad. I love how you solved this problem, as it made it easy to understand even for those not familar with calculating percentiles. One thing you may want to consider is to rely less on the cross apply and use more standard joins. This query may be more difficult to read, with standard joins, but should perform better.

    Here is a sample:

    ;WITH BaseData AS
    (
    select c.CustomerID
    ,CustDollars=sum(h.TotalDue)
    ,SeqNo=row_number() over (order by sum(h.TotalDue))
    ,COUNT(*) OVER() AS cnt
    from Sales.SalesOrderHeader h
    join Sales.Customer c on h.CustomerID=c.CustomerID
    where c.CustomerType='S'
    group by c.CustomerID
    )
    SELECT
    p,
    [v]=MIN(CustDollars) + ((p*(cnt-1)+1)
    - FLOOR(p*(cnt-1)+1))*(MAX(CustDollars)- MIN(CustDollars))
    FROM BaseData
    CROSS JOIN(
    SELECT 0.00 AS P UNION ALL
    SELECT 0.25 UNION ALL
    SELECT 0.50 UNION ALL
    SELECT 0.75 UNION ALL
    SELECT 1.00
    ) AS Percentiles
    WHERE
    SeqNo BETWEEN FLOOR(p*(cnt-1)+1)
    AND FLOOR(p*(cnt-1)+1) + 1
    GROUP BY p,cnt

    ReplyDelete
  2. Thanks for your input, Adam.

    I think the reason your query runs faster is because you are effectively doing 1 run through the table. Mine ends up doing two, because of that SELECT N=COUNT(*) in addition to just processing the SequencedData CTE.

    I know I fooled around with a solution that included a COUNT(*) OVER () as you did in your query (attempting only 1 pass through the table), and I swear it came out taking longer for some reason, but I'll bet I just didn't have the JOINs or GROUP BYs arranged in an appropriate manner.

    Thanks again.

    ReplyDelete
  3. Hi Adam...

    Just a followup... here's your query dressed up to use the variables I originally had in the mathematical formulas so that it is easier to follow. It brings about a virtually identical query plan as your original query, just a "Compute Scalar" or two get thrown in (which incurs essentially zero cost). Comparing your query and this one, they come out at 50%/50%.

    I'm providing a link below to the query at my old website so that it's easier to read in terms of indentation, etc.:

    http://www.stockciphering.com/percentile.htm

    ReplyDelete
  4. Brad,
    Thanks for the followup :). I defintely agree that your version is much easier to read. It is great to see that it performs well! Performance and scalabilty are two of my biggest concerns (aside from security), when rolling out code. I look forward to reading part 2 of this series.
    Thanks.

    ReplyDelete
  5. Dude, you rock. This is exactly what I was looking for!

    ReplyDelete
  6. @Nicholas:

    Thanks! I never updated this blog post with the faster code that was suggested by Adam in the comments section here. Please make sure you check this out for the final best approach:

    http://www.stockciphering.com/percentile.htm

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Great article - how will you implement this into a function?

    ReplyDelete