SELECT Blog FROM Brad.Schulz CROSS APPLY SQL.Server(): Aggregates

Showing posts with label Aggregates. Show all posts

Tuesday, March 8, 2011

T-SQL Tuesday #016: AWF-Inspiring Queries

This blog entry is participating in T-SQL Tuesday #016, hosted this month by Jes Schultz Borland (also known as grrlgeek). And no, we are not related, because you can plainly see that she puts an extraneous “t” in Schulz. (Sorry, Jes, I couldn’t resist).

You are invited to visit her blog and join the party and read more blogs participating in this month’s theme: Aggregates.

In May and June of 2010, I had the pleasure of giving a talk on Window Functions to the San Francisco and Silicon Valley SQL Server User Groups. I’ve been meaning to put together a blog post or two consisting of highlights of those presentations, but I just never got around to it.

Jes’ T-SQL Tuesday invitation was the perfect catalyst… And so this blog post will be specifically about Aggregate Window Functions (AWF) and some tips and tricks and traps.

Who Am I? Where Do I Fit In?

Consider the following query, which gives a detailed list of all the orders in AdventureWorks. The result is sorted by CustomerID and SalesOrderID for our convenience:

select CustomerID
      ,SalesOrderID
      ,SalesPersonID 
      ,OrderDate=convert(date,OrderDate)
      ,OrderDollars=TotalDue
from AdventureWorks.Sales.SalesOrderHeader 
order by CustomerID
        ,SalesOrderID 
/*
CustomerID SalesOrderID SalesPersonID OrderDate  OrderDollars
---------- ------------ ------------- ---------- ------------
         1        43860           280 2001-08-01   14603.7393
         1        44501           280 2001-11-01   26128.8674
         1        45283           280 2002-02-01   37643.1378
         1        46042           280 2002-05-01   34722.9906
         2        46976           283 2002-08-01   10184.0774
         2        47997           283 2002-11-01    5469.5941
. . .
(31465 Rows Total)
*/

Take a look at the first row in this result set… Order Number 43860. This row has an identity complex.

Sure, it knows that it’s an order for CustomerID 1, and it was handled by SalesPersonID 280, and it was ordered in the month of August 2001, and it was worth about $14600.

But what does all that really mean? The row is wondering: “What is my place in the world? How do I stack up against other rows? Who am I? I know that I belong to CustomerID 1, but am I that customer’s only order? Or am I one of many? What about my order amount of $14600? Am I a small order for that customer? Or am I huge? What percentage of CustomerID 1’s orders is attributable to me? What about my SalesPerson? Is my order amount above or below average for that SalesPerson? What about compared to ALL orders in AdventureWorks?”

This row has a lot of questions about the world outside of its little 1-row universe.

Since we sorted the result by CustomerID, it’s easy to see from our perspective the partition of rows for CustomerID number 1. And we can see quickly that Order 43860 is just 1 of 4 orders for this Customer, and, if we’re quick with mental calculations, we can see that the average dollars for that Customer’s orders is around $28000, so that makes Order 43860 below average.

We’ve always had the ability to analyze partitions (or groups) of data with the GROUP BY clause, like so:

select CustomerID
      ,NumOrders=count(*)
      ,TotalDollars=sum(TotalDue)
      ,AverageDollars=avg(TotalDue)
from AdventureWorks.Sales.SalesOrderHeader 
group by CustomerID
order by CustomerID
/*
CustomerID NumOrders TotalDollars AverageDollars
---------- --------- ------------ --------------
         1         4  113098.7351     28274.6837
         2         8   32733.9695      4091.7461
         3        12  479506.3256     39958.8604
         4         8  780035.2121     97504.4015
         5         8  114218.8006       14277.35
. . .
(19119 Rows Total)
*/

And we can also get information on the large single partition of ALL orders:

select NumOrders=count(*)
      ,TotalDollars=sum(TotalDue)
      ,AverageDollars=avg(TotalDue)
from AdventureWorks.Sales.SalesOrderHeader 
/*
NumOrders   TotalDollars AverageDollars
--------- -------------- --------------
    31465 140707584.8246      4471.8762
*/

But in getting that summary information, we lose the detail.

The Window Functions introduced in SQL2005 allow us to “marry” the detail and the summary information… and we also are not limited to partitioning the data in only one way.

There are Ranking Window Functions like ROW_NUMBER, RANK, DENSE_RANK, and NTILE, but I want to focus instead on Aggregate Window Functions, like SUM, COUNT, AVG, MIN, MAX, etc.

What differentiates a regular Aggregate from a Windowed Aggregate is the OVER clause, where we can specify how we want to partition the data:

Aggregate_Function( <parameter> ) OVER ( [PARTITION BY <partition_list>] )

The OVER clause defines how we want to partition the data into groups of rows, and the Aggregate Function is applied to each of those groups, returning a value for that group.

Note that the PARTITION BY clause is optional, but if it is left out, then it is implied that the aggregate is applied to the partition of the entire result set.

So, with that in mind, let’s look at our original query once again, but this time, through Aggregate Window Functions, let’s answer some of Order 43860’s questions. How many order does its customer have? What percent of the customer’s orders is its value responsible for? Is its value above or below average for the customer? What about for the salesperson? What about the entire company?

select CstID=CustomerID
      ,OrdID=SalesOrderID
      ,SPsnID=SalesPersonID
      ,Dollars=TotalDue
      ,[#CstOrds]=count(*) over (partition by CustomerID)
      ,[%CustTot]=TotalDue / sum(TotalDue) over (partition by CustomerID) * 100
      ,[+/-CustAvg]=TotalDue - avg(TotalDue) over (partition by CustomerID)
      ,[+/-SlsPsnAvg]=TotalDue - avg(TotalDue) over (partition by SalesPersonID)
      ,[+/-OverallAvg]=TotalDue - avg(TotalDue) over ()
from AdventureWorks.Sales.SalesOrderHeader 
order by CustomerID, SalesOrderID
/*
CstID OrdID SPsnID    Dollars #CstOrds %CustTot  +/-CustAvg +/-SlsPsnAvg +/-OverallAvg
----- ----- ------ ---------- -------- -------- ----------- ------------ -------------
    1 43860    280 14603.7393        4    12.91 -13670.9444  -32270.8068    10131.8631
    1 44501    280 26128.8674        4    23.10  -2145.8163  -20745.6787    21656.9912
    1 45283    280 37643.1378        4    33.28   9368.4541   -9231.4083    33171.2616
    1 46042    280 34722.9906        4    30.70   6448.3069  -12151.5555    30251.1144
    2 46976    283 10184.0774        8    31.11   6092.3313  -16428.8692     5712.2012
    2 47997    283  5469.5941        8    16.70    1377.848  -21143.3525      997.7179
    2 49054    283  1739.4078        8     5.31  -2352.3383  -24873.5388    -2732.4684
. . .
(31465 Rows Total)
*/

This is great! Now Order 43860 knows so much more about itself and how it fits into the world, and we have relieved it of the burden of expensive therapy sessions.

What is a Result Set?

It’s important to know, however, that these Window Functions are applied to the Result Set, and not to the original source data (i.e. the tables) of the query.

For example, if we add a WHERE clause to that last query to only look at orders placed in the year 2001, then that completely changes the Result Set…

select CstID=CustomerID
      ,OrdID=SalesOrderID
      ,SPsnID=SalesPersonID
      ,Dollars=TotalDue
      ,[#CstOrds]=count(*) over (partition by CustomerID)
      ,[%CustTot]=TotalDue / sum(TotalDue) over (partition by CustomerID) * 100
      ,[+/-CustAvg]=TotalDue - avg(TotalDue) over (partition by CustomerID)
      ,[+/-SlsPsnAvg]=TotalDue - avg(TotalDue) over (partition by SalesPersonID)
      ,[+/-OverallAvg]=TotalDue - avg(TotalDue) over ()
from AdventureWorks.Sales.SalesOrderHeader 
where OrderDate>='20010101' and OrderDate<'20020101'
order by CustomerID, SalesOrderID
/*
CstID OrdID SPsnID    Dollars #CstOrds %CustTot  +/-CustAvg +/-SlsPsnAvg +/-OverallAvg
----- ----- ------ ---------- -------- -------- ----------- ------------ -------------
    1 43860    280 14603.7393        2    35.85   -5762.564  -25709.6621     4213.9263
    1 44501    280 26128.8674        2    64.14   5762.5641   -14184.534    15739.0544
    3 44124    277 20504.1755        2    51.57    627.7545   -9620.1415    10114.3625
    3 44791    277 19248.6666        2    48.42   -627.7544  -10875.6504     8858.8536
   11 43853    282 34946.6266        2    86.60  14771.4029    7888.1527    24556.8136
   11 44495    282  5403.8208        2    13.39 -14771.4029  -21654.6531    -4985.9922
. . .
(1379 Rows Total)
*/

…And we see that Order 43860’s perspective completely changes. Now he’s only one of 2 orders of CustomerID 1, and he now makes up a full 35.85% of that customer’s order total (as opposed to only 12.91%). And he’s only $4000 above the overall average of all orders in AdventureWorks (as opposed to $10000 above average).

Because Window Functions are applied to partitions of the Result Set, it’s important to understand the logical processing order of a query so that we really know what a “Result Set” is.

In simplest terms, the logical processing order of a query’s clauses is like so:

/*
FROM and JOIN/APPLY/PIVOT/UNPIVOT
WHERE
GROUP BY
HAVING
  (At this point we have a Result Set)
SELECT
ORDER BY
TOP
*/

So Window Functions only come into play after we’ve done the FROM and JOINs and WHEREs and the GROUPing and HAVING stuff.

And this is why the Window Functions can only be used in the SELECT or ORDER BY clauses of a query.

What About the WHERE Clause?

You may NOT use Window Functions in a WHERE clause.

For example, let’s consider orders placed in January of 2004… Which orders had an above-average dollar amount (for that time period)? The following query, which employs a subquery to get the average for the month, will work fine:

select SalesOrderID 
      ,CustomerID
      ,OrderDate=convert(date,OrderDate)
      ,OrderDollars=TotalDue
from AdventureWorks.Sales.SalesOrderHeader 
where OrderDate>='20040101' and OrderDate<'20040201'
  and TotalDue>(select avg(TotalDue)
                from AdventureWorks.Sales.SalesOrderHeader 
                where OrderDate>='20040101' and OrderDate<'20040201')
order by SalesOrderID
/*
SalesOrderID  CustomerID OrderDate           OrderDollars
------------ ----------- ---------- ---------------------
       61173         676 2004-01-01            38511.2882
       61174         622 2004-01-01             9128.8743
       61175          34 2004-01-01             34218.425
       61177         442 2004-01-01            28287.8223
       61178         510 2004-01-01             3565.8474
. . .
(498 Rows Total)
*/

But don’t you just hate having to duplicate the WHERE clause in the subquery? One might be tempted to use Window Function to get the average of the result set and put it in the WHERE clause like so:

select SalesOrderID 
      ,CustomerID
      ,OrderDate=convert(date,OrderDate)
      ,OrderDollars=TotalDue
from AdventureWorks.Sales.SalesOrderHeader 
where OrderDate>='20040101' and OrderDate<'20040201'
  and TotalDue>avg(TotalDue) over ()
order by SalesOrderID
/*
Msg 4108, Level 15, State 1, Line 7
Windowed functions can only appear in the SELECT or ORDER BY clauses.
*/

Well, that error message makes things clear, doesn’t it? We simply can’t do it that way. But how do we accomplish our goal? We’d have to use a CTE that included the Window Function, and then in the query that uses the CTE, we can now filter out what we want:

with SourceData as
(
  select SalesOrderID 
        ,CustomerID
        ,OrderDate=convert(date,OrderDate)
        ,OrderDollars=TotalDue
        ,OverallAverageDollars=avg(TotalDue) over ()
  from AdventureWorks.Sales.SalesOrderHeader 
  where OrderDate>='20040101' and OrderDate<'20040201'
)
select SalesOrderID 
      ,CustomerID
      ,OrderDate
      ,OrderDollars
from SourceData
where OrderDollars>OverallAverageDollars
order by SalesOrderID
/*
SalesOrderID  CustomerID OrderDate  OrderDollars
------------ ----------- ---------- ------------
       61173         676 2004-01-01   38511.2882
       61174         622 2004-01-01    9128.8743
       61175          34 2004-01-01    34218.425
       61177         442 2004-01-01   28287.8223
       61178         510 2004-01-01    3565.8474
. . .
(498 Rows Total)
*/

There, that’s better.

The GROUP BY Puzzle

This knowledge of when the Window Functions come into play is especially important when doing a GROUP BY query.

Let’s take a look at quantities sold by Region/Territory:

select Region=t.CountryRegionCode
      ,Territory=t.Name
      ,OrderTotal=sum(OrderQty)
from AdventureWorks.Sales.SalesOrderDetail d
join AdventureWorks.Sales.SalesOrderHeader h on d.SalesOrderID=h.SalesOrderID 
join AdventureWorks.Sales.SalesTerritory t on h.TerritoryID=t.TerritoryID 
group by t.CountryRegionCode 
        ,t.Name
order by Region
        ,Territory
/*
Region Territory      OrderTotal
------ -------------- ----------
AU     Australia           18293
CA     Canada              49381
DE     Germany             13143
FR     France              19906
GB     United Kingdom      20099
US     Central             19493
US     Northeast           19843
US     Northwest           36776
US     Southeast           18875
US     Southwest           59105
*/

We get a request to show how each territory contributes (percentage-wise) to the grand total. We might be tempted to do that by using the SUM Aggregate Window Function to calculate the grand total of the entire result set and divide each row’s total by that grand total to get a percentage.

select Region=t.CountryRegionCode
      ,Territory=t.Name
      ,OrderTotal=sum(OrderQty)
      ,PercentOfTotal=100.0 * sum(OrderQty) / sum(OrderQty) over ()
from AdventureWorks.Sales.SalesOrderDetail d
join AdventureWorks.Sales.SalesOrderHeader h on d.SalesOrderID=h.SalesOrderID 
join AdventureWorks.Sales.SalesTerritory t on h.TerritoryID=t.TerritoryID 
group by t.CountryRegionCode 
        ,t.Name
order by Region
        ,Territory
/*
Msg 8120, Level 16, State 1, Line 4
Column 'AdventureWorks.Sales.SalesOrderDetail.OrderQty' is invalid in the select list 
because it is not contained in either an aggregate function or the GROUP BY clause.
*/

Huh? What does that error message mean? OrderQty is not in the SELECT list… it’s part of an aggregate. Why does the optimizer think it’s in the SELECT list?

Let’s take a step back for a moment. What was the result set of our original query? It was 3 columns: Region, Territory, and OrderTotal. That OrderTotal column was simply SUM(OrderQty). So then what would the Grand Total of the entire result set be? Ah-hah! It would be the SUM of all the SUM(OrderQty)’s. So that’s what we have to use in our Aggregate Window Function:

select Region=t.CountryRegionCode
      ,Territory=t.Name
      ,OrderTotal=sum(OrderQty)
      ,PercentOfTotal=100.0 * sum(OrderQty) / sum(sum(OrderQty)) over ()
from AdventureWorks.Sales.SalesOrderDetail d
join AdventureWorks.Sales.SalesOrderHeader h on d.SalesOrderID=h.SalesOrderID 
join AdventureWorks.Sales.SalesTerritory t on h.TerritoryID=t.TerritoryID 
group by t.CountryRegionCode 
        ,t.Name
order by Region
        ,Territory
/*
Region Territory      OrderTotal  PercentOfTotal
------ -------------- ---------- ---------------
AU     Australia           18293  6.654080912576
CA     Canada              49381 17.962344587761
DE     Germany             13143  4.780767803749
FR     France              19906  7.240809853263
GB     United Kingdom      20099  7.311013626079
US     Central             19493  7.090581054438
US     Northeast           19843  7.217893595815
US     Northwest           36776 13.377274347614
US     Southeast           18875  6.865783481379
US     Southwest           59105 21.499450737321
*/

So, in short, we can’t apply an Aggregate Window Function to the expression OrderQty, because by the time that the GROUP BY does its job, the OrderQty column is not available anymore… It is not in the Result Set. However, SUM(OrderQty) was in our Result Set, so we can apply our SUM function to that expression… thus, SUM(SUM(OrderQty)).

ORDER BY Uses

One final word on Aggregate Window Functions: Using an Aggregate Window Function in the SELECT clause makes sense. But how the heck would you use one in an ORDER BY?

Here’s an interesting example of how that might be useful. Let’s take a look at the top 100 orders (in terms of dollars). Let’s see which SalesPersons are associated with those orders:

select top 100 SalesPersonID
              ,FirstName
              ,LastName
              ,SalesOrderID
              ,OrderDate=convert(date,OrderDate)
              ,TotalDue
from AdventureWorks.Sales.SalesOrderHeader soh
join AdventureWorks.Person.Contact c on soh.SalesPersonID=c.ContactID 
order by TotalDue desc
/*
SalesPersonID FirstName LastName SalesOrderID OrderDate     TotalDue
------------- --------- -------- ------------ ---------- -----------
          281 Jauna     Elson           51131 2003-07-01 247913.9138
          281 Jauna     Elson           55282 2003-10-01 227737.7215
          285 Gail      Erickson        46616 2002-07-01 207058.3754
          286 Mark      Erickson        46981 2002-08-01 201490.4144
          275 Maciej    Dusza           47395 2002-09-01 198628.3054
. . .
          286 Mark      Erickson        51735 2003-08-01 123218.8124
          286 Mark      Erickson        53497 2003-09-01 122462.2807
          275 Maciej    Dusza           51153 2003-07-01 122035.2702
          281 Jauna     Elson           44127 2001-09-01 121141.6984
          285 Gail      Erickson        46959 2002-08-01 120843.8322
(100 Rows Total)
*/

That gives us the detail about those Top 100 Orders. But it doesn’t give us any kind of perspective. We can see that Jauna Elson had the two biggest orders in the company, but does she have any others? What if we sort the result so that the SalesPerson who has the most orders is at the top of the result, and the SalesPerson with the second most orders is next, and so on? All we have to do is ORDER BY the COUNT of orders of each partition of SalesPersonID, like so:

select top 100 SalesPersonID
              ,FirstName
              ,LastName
              ,SalesOrderID
              ,OrderDate=convert(date,OrderDate)
              ,TotalDue
from AdventureWorks.Sales.SalesOrderHeader soh
join AdventureWorks.Person.Contact c on soh.SalesPersonID=c.ContactID 
order by count(*) over (partition by SalesPersonID) desc
/*
SalesPersonID FirstName LastName SalesOrderID OrderDate    TotalDue
------------- --------- -------- ------------ ---------- ----------
          277 Linda     Ecoffey         71894 2004-06-01 92952.5948
          277 Linda     Ecoffey         71908 2004-06-01  2791.3805
          277 Linda     Ecoffey         71939 2004-06-01    972.785
          277 Linda     Ecoffey         71940 2004-06-01   668.3798
          277 Linda     Ecoffey         71929 2004-06-01    66.2255
. . .
          277 Linda     Ecoffey         63274 2004-02-01   486.3925
          277 Linda     Ecoffey         63286 2004-02-01    449.506
          277 Linda     Ecoffey         63221 2004-02-01    50.2703
          277 Linda     Ecoffey         63226 2004-02-01    50.2703
          277 Linda     Ecoffey         63232 2004-02-01   102.1309
(100 Rows Total)
*/

Isn’t that cool? We see that Linda Ecoffey has the most orders, so she’s at the top, and at the bottom of the list, we see---

What a minute…

Something’s wrong here. Linda Ecoffey is the only one in the report. Where’s Jauna Elson? Why are piddly little $50 orders showing up in our query?

Any idea?

It’s the Logical Processing Order of the query coming into play again! What is the very last thing that happens in a query? The TOP operator! Our COUNT(*) Window Function was applied to CustomerID partitions of ALL orders… and THEN the TOP operator was applied.

So once again, we have to split the query up and get the Top 100 Orders in a CTE, and THEN use the ORDER BY in the main query that uses that CTE:

with Top100Sales as
(
  select top 100 SalesPersonID
                ,FirstName
                ,LastName
                ,SalesOrderID
                ,OrderDate=convert(date,OrderDate)
                ,TotalDue
  from AdventureWorks.Sales.SalesOrderHeader soh
  join AdventureWorks.Person.Contact c on soh.SalesPersonID=c.ContactID 
  order by TotalDue desc
)
select *
from Top100Sales
order by count(*) over (partition by SalesPersonID) desc
/*
SalesPersonID FirstName LastName  SalesOrderID OrderDate     TotalDue
------------- --------- --------- ------------ ---------- -----------
          277 Linda     Ecoffey          43884 2001-08-01 154912.0712
          277 Linda     Ecoffey          44528 2001-11-01 144355.8792
          277 Linda     Ecoffey          46660 2002-07-01 162629.7468
          277 Linda     Ecoffey          53506 2003-09-01  142587.761
          277 Linda     Ecoffey          47027 2002-08-01 141111.7487
          277 Linda     Ecoffey          50683 2003-06-01 140349.1998
          277 Linda     Ecoffey          44530 2001-11-01 139414.5241
          277 Linda     Ecoffey          46067 2002-05-01 135306.5844
          277 Linda     Ecoffey          49888 2003-04-01 128123.3556
          277 Linda     Ecoffey          47658 2002-10-01 132199.8023
          277 Linda     Ecoffey          46642 2002-07-01  130825.509
          277 Linda     Ecoffey          46066 2002-05-01 133271.8758
          277 Linda     Ecoffey          47712 2002-10-01 133229.5585
          277 Linda     Ecoffey          47990 2002-11-01 133056.7092
          277 Linda     Ecoffey          46629 2002-07-01  132987.326
          277 Linda     Ecoffey          48048 2002-11-01 126309.4039
          276 Shelley   Dyck             50304 2003-05-01 123326.9867
          276 Shelley   Dyck             69471 2004-05-01  131279.032
          276 Shelley   Dyck             44777 2001-12-01 132235.1616
. . .
          283 Terry     Eminhizer        51123 2003-07-01   125068.34
          283 Terry     Eminhizer        46643 2002-07-01 149897.3647
          283 Terry     Eminhizer        51711 2003-08-01 136046.4425
          287 Martha    Espinoza         50297 2003-05-01 137108.3865
          287 Martha    Espinoza         69508 2004-05-01 140734.4875
          289 Janeth    Esteves          55254 2003-10-01 140521.5049
          289 Janeth    Esteves          51761 2003-08-01  146293.234
          280 Shannon   Elliott          47033 2002-08-01 125254.4859
          280 Shannon   Elliott          67297 2004-04-01 125144.0084
          268 Gary      Drury            51830 2003-08-01 149861.0659
(100 Rows Total)
*/

So that’s more like it. Linda Ecoffey is the top sales person in terms of having the most orders of the Top 100. Then comes Shelley Dyck and her orders (not all listed). And at the bottom of the list is Gary Drury with only one order in the Top 100.

So this is a novel way of looking at a query result and seeing at a glance that Linda is our big gun and she really leaves Gary in the dust.

But is this fair? Maybe the quantity of orders is not a good representation of who is our big gun. Perhaps finding out who brings in the highest average dollar amount of orders should be put at the top. Easy enough… We’ll just change our ORDER BY clause:

with Top100Sales as
(
  select top 100 SalesPersonID
                ,FirstName
                ,LastName
                ,SalesOrderID
                ,OrderDate=convert(date,OrderDate)
                ,TotalDue
  from AdventureWorks.Sales.SalesOrderHeader soh
  join AdventureWorks.Person.Contact c on soh.SalesPersonID=c.ContactID 
  order by TotalDue desc
)
select *
from Top100Sales
order by avg(TotalDue) over (partition by SalesPersonID) desc
/*
SalesPersonID FirstName LastName  SalesOrderID OrderDate     TotalDue
------------- --------- --------- ------------ ---------- -----------
          281 Jauna     Elson            44762 2001-12-01 123265.0706
          281 Jauna     Elson            61184 2004-01-01 125364.2684
          281 Jauna     Elson            50252 2003-05-01 126565.2584
          281 Jauna     Elson            44127 2001-09-01 121141.6984
          281 Jauna     Elson            47441 2002-09-01  153055.214
          281 Jauna     Elson            47018 2002-08-01 144864.8759
          281 Jauna     Elson            47369 2002-09-01 189198.6156
          281 Jauna     Elson            51131 2003-07-01 247913.9138
          281 Jauna     Elson            55282 2003-10-01 227737.7215
          281 Jauna     Elson            67305 2004-04-01 155260.0958
          281 Jauna     Elson            48043 2002-11-01 129398.8662
          281 Jauna     Elson            44795 2001-12-01 139562.6866
          268 Gary      Drury            51830 2003-08-01 149861.0659
          286 Mark      Erickson         57143 2003-11-01 126349.7887
          286 Mark      Erickson         50222 2003-05-01  124696.565
          286 Mark      Erickson         51735 2003-08-01 123218.8124
. . .
          277 Linda     Ecoffey          47990 2002-11-01 133056.7092
          277 Linda     Ecoffey          46629 2002-07-01  132987.326
          277 Linda     Ecoffey          48048 2002-11-01 126309.4039
          283 Terry     Eminhizer        51123 2003-07-01   125068.34
          283 Terry     Eminhizer        46643 2002-07-01 149897.3647
          283 Terry     Eminhizer        51711 2003-08-01 136046.4425
          280 Shannon   Elliott          47033 2002-08-01 125254.4859
          280 Shannon   Elliott          67297 2004-04-01 125144.0084
(100 Rows Total)
*/

So Jauna is the champion in terms of highest-dollar average orders, and our friend Gary Drury, who ranked dead last in the previous query, redeems himself in this one because, although he only has one order, it is for a large amount, making him the second highest SalesPerson in terms of average dollars. Linda Ecoffey comes out towards the bottom.

In Conclusion

I hope this article gave you some insight into Aggregate Window Functions (AWF)… If you can’t get enough of this stuff, please also read a blog post I wrote last year called OVER Exposed, in which I talk in more detail about AWF’s and more gotchas and what really goes on under the hood with them.

Now go out there and produce some AWF-Inspiring Queries!

Thursday, March 25, 2010

OVER Exposed

The introduction of the OVER clause in SQL2005 gave us some terrific new capabilities in ranking data (via ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE()) and in aggregating data (COUNT(), SUM(), AVG(), etc). In this blog entry, I specifically want to talk about the OVER clause with respect to aggregates.

OVER Clause Overview

Let’s take a look at AdventureWorks Sales Order data in the United States (which is represented by TerritoryID’s 1 thru 5):

select sod.SalesOrderID
      ,CustomerID
      ,ProductID
      ,UnitPrice
from AdventureWorks.Sales.SalesOrderDetail sod
join AdventureWorks.Sales.SalesOrderHeader soh on sod.SalesOrderID=soh.SalesOrderID 
where TerritoryID between 1 and 5
order by SalesOrderID
        ,ProductID
/*
SalesOrderID CustomerID ProductID UnitPrice
------------ ---------- --------- ---------
       43659        676       709      5.70
       43659        676       711   20.1865
       43659        676       712    5.1865
       43659        676       714   28.8404
       43659        676       716   28.8404
       43659        676       771  2039.994
       43659        676       772  2039.994
       ...
(60153 Rows Total)
*/

Thanks to the OVER clause, we can effortlessly incorporate a column into that query that shows us each ProductID’s Average UnitPrice in that result:

select sod.SalesOrderID
      ,CustomerID
      ,ProductID
      ,UnitPrice
      ,AvgPrice=avg(UnitPrice) over (partition by ProductID)
from AdventureWorks.Sales.SalesOrderDetail sod
join AdventureWorks.Sales.SalesOrderHeader soh on sod.SalesOrderID=soh.SalesOrderID 
where TerritoryID between 1 and 5
order by SalesOrderID
        ,ProductID
/*
SalesOrderID CustomerID ProductID UnitPrice  AvgPrice
------------ ---------- --------- --------- ---------
       43659        676       709      5.70    5.6525
       43659        676       711   20.1865   28.2153
       43659        676       712    5.1865    7.0082
       43659        676       714   28.8404   34.4369
       43659        676       716   28.8404   35.1093
       43659        676       771  2039.994 2085.9938
       43659        676       772  2039.994  2043.917
       ...
(60153 Rows Total)
*/

Let’s add one additional column that shows the UnitPrice’s Percent of the Average:

select sod.SalesOrderID
      ,CustomerID
      ,ProductID
      ,UnitPrice
      ,AvgPrice=avg(UnitPrice) over (partition by ProductID)
      ,PercentOfAvg=UnitPrice/avg(UnitPrice) over (partition by ProductID)*100
from AdventureWorks.Sales.SalesOrderDetail sod
join AdventureWorks.Sales.SalesOrderHeader soh on sod.SalesOrderID=soh.SalesOrderID 
where TerritoryID between 1 and 5
order by SalesOrderID
        ,ProductID
/*
SalesOrderID CustomerID ProductID UnitPrice  AvgPrice PercentOfAvg
------------ ---------- --------- --------- --------- ------------
       43659        676       709      5.70    5.6525       100.84
       43659        676       711   20.1865   28.2153        71.54
       43659        676       712    5.1865    7.0082        74.00
       43659        676       714   28.8404   34.4369        83.74
       43659        676       716   28.8404   35.1093        82.14
       43659        676       771  2039.994 2085.9938        97.79
       43659        676       772  2039.994  2043.917        99.80
       ...
(60153 Rows Total)
*/

And finally, let’s just look at the 50 Line Items that had the lowest Percentages by changing our ORDER BY and adding a TOP 50 to the query:

select 
top 50 sod.SalesOrderID
      ,CustomerID
      ,ProductID
      ,UnitPrice
      ,AvgPrice=avg(UnitPrice) over (partition by ProductID)
      ,PercentOfAvg=UnitPrice/avg(UnitPrice) over (partition by ProductID)*100
from AdventureWorks.Sales.SalesOrderDetail sod
join AdventureWorks.Sales.SalesOrderHeader soh on sod.SalesOrderID=soh.SalesOrderID 
where TerritoryID between 1 and 5
order by PercentOfAvg
        ,SalesOrderID
        ,ProductID
/*
SalesOrderID CustomerID ProductID UnitPrice AvgPrice PercentOfAvg
------------ ---------- --------- --------- -------- ------------
       69408        236       987   112.998 337.3797        33.49
       69413         43       987   112.998 337.3797        33.49
       69418        381       987   112.998 337.3797        33.49
       69442        233       987   112.998 337.3797        33.49
       ...
       69418        381       984   112.998 330.3681        34.20
       69422        650       984   112.998 330.3681        34.20
       69452          9       984   112.998 330.3681        34.20
       69468        237       984   112.998 330.3681        34.20
(50 Rows Total)
*/

Well, that was easy. The OVER clause made our code nice and compact. If we didn’t have the OVER clause available to us, we would have had to write the query like the one below, JOINing our source data to a derived table where we computed the aggregate via a GROUP BY clause:

;with SourceData as
(
  select sod.SalesOrderID
        ,CustomerID
        ,ProductID
        ,UnitPrice
  from AdventureWorks.Sales.SalesOrderDetail sod
  join AdventureWorks.Sales.SalesOrderHeader soh on sod.SalesOrderID=soh.SalesOrderID
  where soh.TerritoryID between 1 and 5
)  
select 
top 50 SalesOrderID
      ,CustomerID
      ,S.ProductID
      ,UnitPrice
      ,AvgPrice
      ,PercentOfAvg=UnitPrice/AvgPrice*100
from SourceData S
join (select ProductID
            ,AvgPrice=avg(UnitPrice)
      from SourceData
      group by ProductID) G on S.ProductID=G.ProductID
order by PercentOfAvg
        ,SalesOrderID
        ,ProductID
/*
SalesOrderID CustomerID ProductID UnitPrice AvgPrice PercentOfAvg
------------ ---------- --------- --------- -------- ------------
       69408        236       987   112.998 337.3797        33.49
       69413         43       987   112.998 337.3797        33.49
       69418        381       987   112.998 337.3797        33.49
       69442        233       987   112.998 337.3797        33.49
       ...
       69418        381       984   112.998 330.3681        34.20
       69422        650       984   112.998 330.3681        34.20
       69452          9       984   112.998 330.3681        34.20
       69468        237       984   112.998 330.3681        34.20
(50 Rows Total)
*/

That JOIN/GROUP BY approach took a lot more code and it looks like a lot of work, mainly because it has to scan the SalesOrderHeader and SalesOrderDetail tables twice, as we can see in the query plan:

The query plan of the OVER Clause query, on the other hand, only scans those tables once:

So it must be a lot more efficient, right?

Wrong.

Let’s take a look at the statistics for the two queries, collected by Profiler (with Query Results Discarded so that we are purely looking at processing time of the query and not the time to render the output):

/*
Description           Reads  Writes   CPU  Duration
---------------------------------------------------
OVER Clause          125801       2   617      1406
JOIN with GROUP BY     3916       0   308      1176
*/

The OVER clause was a real pig as far as the Number of Reads is concerned. And it had higher CPU and Duration than the JOIN approach.

The OVER Clause Query Plan In Depth

Let’s take a closer look at the OVER clause actual execution plan once more, where I’ve added some annotations (please click on the image below to see a larger version in another browser window):

Looking at the outer loop (operators #5 thru #10), the rows that come from the merged table scans (#8, #9, #10) get sorted (#7) by ProductID. The Segment operator (#6) receives those rows and feeds them, groups (or segments) at a time, to the Table Spool operator (#5).

As the Table Spool (#5) receives those rows in a group, it saves a copy of the rows and then sends a single row to the Nested Loops operator (#4), indicating that it has done its job of saving the group’s rows.

Now that the Nested Loops operator (#4) has received its outer input, it calls for the inner input, which will come from operators #11 thru #15. The two other table spools in the plan (#14 and #15) are the same spool as #5, but they are replaying the copies of the rows that #5 had saved. The Stream Aggregate (#13) calculates the SUM(UnitPrice) and COUNT(*) and the Compute Scalar (#12) divides those two values to come up with the AVG(UnitPrice). This single value is JOINed (via the #11 Nested Loops operator) with the rows replayed by the Spool #15.

And then the Spool (#5) truncates its saved data and gets to work on the next group that it receives from the Segment operator (#6). And so on and so on, for each fresh group of ProductID's.

So, in a logical sense, this plan is also doing a GROUP BY and a JOIN, except it’s doing it with a single scan of the source data. However, to accomplish that, it has to do a triple scan of the spooled data, thus producing those hundreds of thousands of reads.

Memory Usage

If you hover your mouse over the SELECT operator (#1) in the actual query plan of our OVER Clause query, you can see that it requests a Memory Grant of 5288 Kbytes:

By comparison, our traditional JOIN/GROUP BY query only requests 1544 Kbytes.

In the case of the JOIN/GROUP BY query, the memory grant is for the “Top N” Sort and the two Hashing operators. As Adam Haines demonstrated in a recent blog entry, a TOP value of 100 or less does not require that much memory. It would only request 1024K. The additional memory required for the Hashes brought its request up to 1544K.

The OVER query, though is a different story. It also does a “Top N” Sort (#2), but, in addition, it does a full Sort (#7) of the source data, which requires a lot more memory than a “Top N” Sort. Thus the much larger memory grant request.

Things get a lot worse as we increase the size of our result set. Let’s add a lot more columns to our output, increasing the row size that the Sort operator must process from 27 bytes to 402 bytes…

select 
top 50 sod.SalesOrderID
      ,CustomerID
      ,ProductID
      ,UnitPrice
      ,AvgPrice=avg(UnitPrice) over (partition by ProductID)
      ,PercentOfAvg=UnitPrice/avg(UnitPrice) over (partition by ProductID)*100
      /* Let's add a bunch of other columns */
      ,soh.AccountNumber,soh.BillToAddressID,soh.Comment,soh.ContactID
      ,soh.CreditCardApprovalCode,soh.CreditCardID,soh.CurrencyRateID,soh.DueDate
      ,soh.Freight,soh.ModifiedDate,soh.OnlineOrderFlag,soh.OrderDate
      ,soh.PurchaseOrderNumber,soh.RevisionNumber,soh.rowguid,soh.SalesOrderNumber
      ,soh.SalesPersonID,soh.ShipDate,soh.ShipMethodID,soh.ShipToAddressID 
      ,soh.SubTotal,soh.TaxAmt,soh.TerritoryID,soh.TotalDue,sod.CarrierTrackingNumber 
      ,sod.OrderQty,sod.SalesOrderDetailID,sod.SpecialOfferID,sod.UnitPriceDiscount 
from AdventureWorks.Sales.SalesOrderDetail sod
join AdventureWorks.Sales.SalesOrderHeader soh on sod.SalesOrderID=soh.SalesOrderID 
where TerritoryID between 1 and 5
order by PercentOfAvg
        ,SalesOrderID
        ,ProductID

For this, the Memory Grant request balloons to 25,264K. If we add all those extra columns to our traditional JOIN/GROUP BY query, its Memory Grant request still remains at 1544K.

Collecting Profiler statistics for the two queries with all those extra columns shows more interesting information:

/*
Description           Reads  Writes   CPU  Duration
---------------------------------------------------
OVER Clause          144936     790  1696      2125
JOIN with GROUP BY     3916       0   850      1725
*/

Note all the extra writes that were required for the OVER query. The row size of the data that we needed to sort is too big to do completely in memory and it spilled over into TempDB, writing 790 pages worth of data.

The lesson here is that the JOIN/GROUP BY approach is faster and more efficient and less of a memory hog if you’re processing a large number of rows and/or if your rows are wide.

For our JOIN/GROUP BY query, the optimizer has the flexibility to choose from a variety of plans to come up with the most cost-effective approach. In our examples, the optimizer determined that it was more cost-effective to use a Hash Aggregation and a Hash Join as opposed to sorting the data. The Hash operators are extremely efficient in handling large gobs of data… and that data does not have to be sorted. The OVER query, on the other hand, requires the data to be sorted, because it always uses a Segment operator (#6) and a Stream Aggregate operator (#13) which only work on pre-sorted data.

Taking the SORT out of the Equation

Okay, so let’s do another experiment. Let’s see how our two approaches work on data that is pre-sorted. Just for kicks, instead of PARTITIONing BY (or GROUPing BY) ProductID, let’s do it by SalesOrderID. Since both the SalesOrderHeader and SalesOrderDetail tables have a clustered index on SalesOrderID, the system can scan both files in SalesOrderID order, and do Merge Joins to join the two tables together, and we won’t require any sorts at all. And let’s also take out our TOP 50 (and the ORDER BY) so that no “Top N” Sort has to be performed on either query. So we won’t require any memory grants at all in either query, and no expensive Sort operations. Here are the revised queries:

/* OVER Approach */
select sod.SalesOrderID
      ,CustomerID
      ,ProductID
      ,UnitPrice
      /* Let's add a bunch of other columns */
      ,soh.AccountNumber,soh.BillToAddressID,soh.Comment,soh.ContactID
      ,soh.CreditCardApprovalCode,soh.CreditCardID,soh.CurrencyRateID,soh.DueDate
      ,soh.Freight,soh.ModifiedDate,soh.OnlineOrderFlag,soh.OrderDate
      ,soh.PurchaseOrderNumber,soh.RevisionNumber,soh.rowguid,soh.SalesOrderNumber
      ,soh.SalesPersonID,soh.ShipDate,soh.ShipMethodID,soh.ShipToAddressID 
      ,soh.SubTotal,soh.TaxAmt,soh.TerritoryID,soh.TotalDue,sod.CarrierTrackingNumber 
      ,sod.OrderQty,sod.SalesOrderDetailID,sod.SpecialOfferID,sod.UnitPriceDiscount 
      ,AvgPrice=avg(UnitPrice) over (partition by sod.SalesOrderID)
from AdventureWorks.Sales.SalesOrderDetail sod
join AdventureWorks.Sales.SalesOrderHeader soh on sod.SalesOrderID=soh.SalesOrderID 
where TerritoryID between 1 and 5
 
/* JOIN/GROUP BY Approach */
;with SourceData as
(
  select sod.SalesOrderID
        ,CustomerID
        ,ProductID
        ,UnitPrice
        /* Let's add a bunch of other columns */
        ,soh.AccountNumber,soh.BillToAddressID,soh.Comment,soh.ContactID
        ,soh.CreditCardApprovalCode,soh.CreditCardID,soh.CurrencyRateID,soh.DueDate
        ,soh.Freight,soh.ModifiedDate,soh.OnlineOrderFlag,soh.OrderDate
        ,soh.PurchaseOrderNumber,soh.RevisionNumber,soh.rowguid,soh.SalesOrderNumber
        ,soh.SalesPersonID,soh.ShipDate,soh.ShipMethodID,soh.ShipToAddressID 
        ,soh.SubTotal,soh.TaxAmt,soh.TerritoryID,soh.TotalDue,sod.CarrierTrackingNumber 
        ,sod.OrderQty,sod.SalesOrderDetailID,sod.SpecialOfferID,sod.UnitPriceDiscount 
  from AdventureWorks.Sales.SalesOrderDetail sod
  join AdventureWorks.Sales.SalesOrderHeader soh on sod.SalesOrderID=soh.SalesOrderID
  where soh.TerritoryID between 1 and 5
)  
select S.*
      ,AvgPrice
from SourceData S
join (select SalesOrderID
            ,AvgPrice=avg(UnitPrice)
      from SourceData S
      group by SalesOrderID) G on S.SalesOrderID=G.SalesOrderID

Now let’s compare the performance of those two approaches:

/*
Description           Reads  Writes   CPU  Duration
---------------------------------------------------
OVER Clause          192017       0  1445      1945
JOIN with GROUP BY     3914       0   683      1184
*/

The JOIN/GROUP BY query still outperforms the OVER query in every respect. The OVER query still had to do truckloads of Reads, because of the Spooling. Note that the reads this time were even higher than they were in our original example, because there are more Groups of SalesOrderID’s (12041) than there are Groups of ProductID’s (265), so the Spools worked overtime, saving and replaying all those extra groups of data.

So it seems that the OVER clause is very convenient as far as coding and clarity, but unless you have pre-sorted data in a small number of groups with a small number of rows, the JOIN/GROUP BY approach may very well provide a better plan. That’s something to think about.

If you want the most efficient query possible, check out both approaches for performance. There may be some times when the optimizer will look at a JOIN/GROUP BY query and will construct a plan that is exactly like the OVER query anyway. For example, the following JOIN/GROUP BY query will do just that:

;with SourceData as
(
  select sod.SalesOrderID
        ,CustomerID
        ,ProductID
        ,UnitPrice
  from AdventureWorks.Sales.SalesOrderDetail sod
  join AdventureWorks.Sales.SalesOrderHeader soh on sod.SalesOrderID=soh.SalesOrderID
  where soh.TerritoryID between 1 and 5
)  
select SalesOrderID
      ,CustomerID
      ,S.ProductID
      ,UnitPrice
      ,AvgPrice
      ,PercentOfAvg=UnitPrice/AvgPrice*100
from SourceData S
join (select ProductID
            ,AvgPrice=avg(UnitPrice)
      from SourceData
      group by ProductID) G on S.ProductID=G.ProductID
order by ProductID

Interestingly enough, it ONLY creates an OVER-type plan for the above because I specified that I wanted the output to be ORDERed BY ProductID, which is our GROUP BY column. Because of that, the optimizer figured it was more cost-effective to SORT the data by ProductID up front and do the spooling stuff than it was to do the Hashing approach and SORT the final data by ProductID at the end. If you take the ORDER BY out of the query above, then the optimizer will revert back to a plan with Hash operators and no spooling.

But I DISTINCTly Asked For…

I know I’ve kind of trashed the OVER clause a bit during the course of this article, and I hate to kick someone when they’re down, but I’m afraid I have to bring something else up.

The OVER clause does not accept DISTINCT in its aggregate functions. In other words, the following query will bring about an error:

select SalesOrderID
      ,ProductID
      ,UnitPrice
      ,NumDistinctPrices=count(distinct UnitPrice) over (partition by ProductID)
      ,AvgDistinctPrice=avg(distinct UnitPrice) over (partition by ProductID)
from AdventureWorks.Sales.SalesOrderDetail 
/*
Msg 102, Level 15, State 1, Line 4
Incorrect syntax near 'distinct'.
*/

If you want DISTINCT aggregates, you are forced to use the JOIN/GROUP BY approach to accomplish this.

However, there may be times when, for whatever reason, you must scan the data only once. To be honest, the whole catalyst for my writing this article was a situation just like that. My source data was a derived table that produced random rows… something like this:

;with RandomRowNumsAdded as
(
  select SalesOrderID
        ,ProductID
        ,UnitPrice
        ,RowNum=row_number() over (order by newid())
  from AdventureWorks.Sales.SalesOrderDetail 
)
select SalesOrderID
      ,ProductID
      ,UnitPrice
from RandomRowNumsAdded
where RowNum<=50
/*
SalesOrderID ProductID  UnitPrice
------------ --------- ----------
       47439       762    469.794
       72888       873       2.29
       47640       780  2071.4196
       51143       883    31.3142
       . . .
       48393       798   600.2625
       63980       711      34.99
       64590       780    2319.99
       64262       877       7.95
(50 Rows Total)
*/

(If you’re wondering why the heck I’d want something like this, then stay tuned… you’ll see why in a future blog post… my craziest one yet).

Clearly, that query will come out differently every time it is executed, so if I wanted to use a JOIN/GROUP BY to calculate a COUNT(DISTINCT) or AVG(DISTINCT), I couldn’t, because by scanning the information twice, each scan would produce two different outcomes and therefore the following JOIN/GROUP BY approach would not work, because the two scans would produce two different sets of ProductID’s and UnitPrices:

;with RandomRowNumsAdded as
(
  select SalesOrderID
        ,ProductID
        ,UnitPrice
        ,RowNum=row_number() over (order by newid())
  from AdventureWorks.Sales.SalesOrderDetail 
)
,SourceData as
(
  select SalesOrderID
        ,ProductID
        ,UnitPrice
  from RandomRowNumsAdded
  where RowNum<=50
)
select S.*
      ,NumDistinctPrices
      ,AvgDistinctPrice
from SourceData S
join (select ProductID
            ,NumDistinctPrices=count(distinct UnitPrice)
            ,AvgDistinctPrice=avg(distinct UnitPrice)
      from SourceData 
      group by ProductID) G on S.ProductID=G.ProductID 
/*
Because of the two SCANs of the data, the JOIN above will
produce different numbers of rows (sometimes perhaps even
ZERO rows!) every time this query is run.
*/

So I had to come up with a different way to do DISTINCT aggregates by only doing a single scan on the source data.

The answer ended up being… ta-dah!... the OVER clause, ironically enough. Even though you can’t do DISTINCT Aggregates directly with OVER, you can accomplish the task by first performing a ranking function (ROW_NUMBER()), and then follow that with a traditional non-DISTINCT aggregation. Here’s how, using my random-row source data:

;with RandomRowNumsAdded as
(
  select SalesOrderID
        ,ProductID
        ,UnitPrice
        ,RowNum=row_number() over (order by newid())
  from AdventureWorks.Sales.SalesOrderDetail 
)
,SourceData as
(
  select SalesOrderID
        ,ProductID
        ,UnitPrice
        ,PriceSeq=row_number() over (partition by ProductID,UnitPrice order by UnitPrice)
  from RandomRowNumsAdded
  where RowNum<=50
)
select SalesOrderID
      ,ProductID
      ,UnitPrice
      ,NumDistinctPrices=count(case when PriceSeq=1 then UnitPrice end)
                         over (partition by ProductID)
      ,AvgDistinctPrice=avg(case when PriceSeq=1 then UnitPrice end) 
                        over (partition by ProductID)
from SourceData
/*
SalesOrderID ProductID UnitPrice NumDistinctPrices AvgDistinctPrice
------------ --------- --------- ----------------- ----------------
       73078       708     34.99                 1            34.99
       74332       708     34.99                 1            34.99
       69526       711   19.2445                 2          27.1172
       64795       711     34.99                 2          27.1172
       . . .
       51810       940    48.594                 1           48.594
       68514       955   2384.07                 1          2384.07
       71835       964    445.41                 1           445.41
       51823       968  1430.442                 1         1430.442
(50 Rows Total)
*/

We use ROW_NUMBER() to introduce a PriceSeq column, which resets to a value of 1 for each ProductID and UnitPrice combination. And then we only aggregate on those UnitPrices that have a PriceSeq of 1, since any rows with a PriceSeq of 2 and above will just be duplicate (non-DISTINCT) repeats of a ProductID and UnitPrice combination.

We already saw how an OVER aggregation query requires a Sort. In order to calculate DISTINCT aggregations, we require two sorts… one for the ROW_NUMBER() function to calculate our PriceSeq column and a second to do the aggregation. You can see this in the query plan:

This is a little odd, because the first sort orders the data by ProductID and UnitPrice, and the second sort orders the data by just ProductID. But that second sort is not necessary, because the first sort already put the data in a ProductID order anyway. It seems that the optimizer is not “smart enough” to know this and take advantage of it.

However, all is not completely lost. If the only DISTINCT aggregation you require in your query is a COUNT(), then you can use yet another clever approach to accomplish this:

;with RandomRowNumsAdded as
(
  select SalesOrderID
        ,ProductID
        ,UnitPrice
        ,RowNum=row_number() over (order by newid())
  from AdventureWorks.Sales.SalesOrderDetail 
)
,SourceData as
(
  select SalesOrderID
        ,ProductID
        ,UnitPrice
        ,PriceDenseRank=dense_rank() over (partition by ProductID order by UnitPrice)
  from RandomRowNumsAdded
  where RowNum<=50
)
select SalesOrderID
      ,ProductID
      ,UnitPrice
      ,NumDistinctPrices=max(PriceDenseRank) over (partition by ProductID)
from SourceData
/*
SalesOrderID ProductID UnitPrice NumDistinctPrices
------------ --------- --------- -----------------
       67354       707     34.99                 1
       51653       707     34.99                 1
       58908       708    20.994                 2
       66632       708     34.99                 2
       . . .
       51140       975  1020.594                 1
       68191       976   1700.99                 1
       72706       984    564.99                 1
       53606       987   338.994                 1
(50 Rows Total)
*/

This applies a DENSE_RANK() value to each different UnitPrice found within each ProductID, and we can then calculate the COUNT(DISTINCT) by just getting the MAX() of that PriceDenseRank value.

Since the DENSE_RANK() and the MAX() both do a PARTITION BY ProductID, their sorting requirements are exactly identical, and this query plan only requires a single sort as opposed to two:

Final Wrapup

I hope you enjoyed this in-depth look at Aggregations and the OVER Clause. To review what was illustrated in this article…

The OVER clause is very convenient for calculating aggregations, but using a JOIN/GROUP BY approach may very well perform more efficiently, especially on source data that has a large number of rows and/or has a wide row size and/or is unsorted.

If you want to calculate DISTINCT Aggregates, use the JOIN/GROUP BY approach; however, …

If you, for some reason, only want to scan the source data once, assign sequence numbers using ROW_NUMBER() and then aggregate only on those rows that received a sequence number of 1; however, …

If you only require COUNT(DISTINCT), then use DENSE_RANK() and MAX(), since that only requires a single sort instead of two.

SELECT Blog FROM Brad.Schulz CROSS APPLY SQL.Server()

Tuesday, March 8, 2011

T-SQL Tuesday #016: AWF-Inspiring Queries

Thursday, March 25, 2010

OVER Exposed

Search This Blog

Subscribe

About Me

Keywords

Blog Archive

Followers

# # #

SELECT Blog FROM Brad.Schulz CROSS APPLY SQL.Server()

Tuesday, March 8, 2011

T-SQL Tuesday #016: AWF-Inspiring Queries

Thursday, March 25, 2010

OVER Exposed

Search This Blog

Subscribe

Subscribe To...

About Me

Keywords

Blog Archive

Followers

# # #