## Wednesday, August 12, 2009

### The Art of Regressing Linearly Let’s use our Wayback Machine to travel back in time to AdventureWorks headquarters in July of 2004. The company just closed the books for the month of June 2004 and the middle managers are barking at their subordinates because their superiors have been hounding them to come up with some sales projections. The minions of the company are working doggedly and panting from exhaustion. (Okay, I guess I’ve run the canine references into the ground… I suppose it was the WayBack Machine reference that put a picture of Mr Peabody into my mind… sorry ‘bout that).

Anyway, AdventureWorks management wants to know, given their sales statistics history, what sales figures they can expect in 6 months and 12 months and 18 months from now. They also want to know when they can expect to reach \$10 million per month in sales. How about \$11 million per month?

With no qualms about changing history, we interfere and offer our help.

Let’s see what kind of sales statistics we have up through June of 2004:

`select Mth=OrderDate-day(OrderDate)+1      ,Sales=sum(TotalDue)from Sales.SalesOrderHeader where OrderDate<'2004-07-01'group by OrderDate-day(OrderDate)+1order by Mth/*Mth                                     Sales----------------------- ---------------------2001-07-01 00:00:00.000          1172359.42892001-08-01 00:00:00.000          2605514.98092001-09-01 00:00:00.000          2073058.53852001-10-01 00:00:00.000          1688963.2744 . . .2004-03-01 00:00:00.000          5272786.81062004-04-01 00:00:00.000          4722890.73522004-05-01 00:00:00.000          6518825.22622004-06-01 00:00:00.000          6728034.9873*/`
If we were to put the data into Excel and graph those points and have Excel add a trend line, this is what we get: There are lots of up months and down months, but, in general, the trend is certainly up (always a good thing), with sales somewhere in the \$2 million per month range back in 2001 and in the \$6 million per month range by Jun2004.

So how can we make sales projections using SQL?

Like Excel did in generating the trend line, we will have to mathematically calculate the same thing using Linear Regression. You may recall from Algebra class that a line has the formula of y=mx+b, where m is the slope and b is the y-intercept. Given a collection of (x,y) values, the following Linear Regression formulas will calculate the slope and y-intercept values for us:  That looks like complicated stuff, but it’s actually very easy to implement.

Our Sales data has a datetime field (the 1st day of a month) and a Sales figure for that month. Those are our theoretical x and y values, but we can’t really use a datetime for our x value… we have to have an actual numeric value of some kind. So let’s just assume our first chronological month just has a value of 1, and the next month is 2, and so on.

For simplicity’s sake, let’s take our query we used above and put our sales statistics into a temporary file called #MonthlySales:

`select Mth=OrderDate-day(OrderDate)+1      ,Sales=sum(TotalDue) into #MonthlySalesfrom Sales.SalesOrderHeader where OrderDate<'2004-07-01'group by OrderDate-day(OrderDate)+1 `
And now we can get our (x,y) values from that as shown below. Note how the x value was calculated using DATEDIFF and a scalar sub-query to get the month number. This is done just to make it more general without having to make assumptions about the underlying data.

`select X=1+datediff(month,(select min(Mth) from #MonthlySales),Mth)      ,Y=Salesfrom #MonthlySalesorder by X/*     X                     Y------ ---------------------     1          1172359.4289     2          2605514.9809     3          2073058.5385     4          1688963.2744 . . .    33          5272786.8106    34          4722890.7352    35          6518825.2262    36          6728034.9873*/`
From that, it’s easy to calculate all those nasty sums (Σ’s) needed by the slope and y-intercept formulas:

`;with XYData as(  select X=1+datediff(month,(select min(Mth) from #MonthlySales),Mth)        ,Y=Sales  from #MonthlySales)select SumX=sum(X)      ,SumY=sum(Y)      ,SumXX=sum(X*X)      ,SumXY=sum(X*Y)      ,N=count(*)from XYData/*  SumX             SumY    SumXX            SumXY      N------ ---------------- -------- ---------------- ------   666   140651405.9023    16206  3050875537.8704     36*/`
The actual values don’t mean anything to us, but that doesn’t matter because we will just use these to plug into our slope and y-intercept formulas below. Note that I use two CROSS APPLYs below because the value of the y-intercept requires that we calculate the value of the slope first. (Those of you who know me already know that I’m a big-time CROSS APPLY fan… click here for a look at previous blog entries about this frequently-misunderstood, underused operator).

Also note that I introduced 1.0 into the Slope and Intercept calculations in order to force decimal/numeric math to come into play as opposed to possible integer math. In other words, if both our x and y values were integers, then all the SUM figures would be integers and therefore when we do division, we could end up with unexpected integer results. For example, SQL will calculate 500/251 as being equal to 1, but it will calculate 1.0*500/251 as 1.992031. We want that more accurate figure.

`;with XYData as(  select X=1+datediff(month,(select min(Mth) from #MonthlySales),Mth)        ,Y=Sales  from #MonthlySales),CalcSums as(  select SumX=sum(X)        ,SumY=sum(Y)        ,SumXX=sum(X*X)        ,SumXY=sum(X*Y)        ,N=count(*)  from XYData)select Slope      ,Interceptfrom CalcSumscross apply (select Slope=1.0*(N*SumXY-SumX*SumY)/(N*SumXX-SumX*SumX)) CalcSlopecross apply (select Intercept=1.0*(SumY-Slope*SumX)/N) CalcIntercept/*                   Slope         Intercept------------------------ ----------------- 115527.5492092277992277  1769723.83691539*/`
Again, these figures appear pretty meaningless, but the Slope figure does tell us that we generally increase our monthly sales by about \$115,528 per month, and that our sales at Month Number Zero was about \$1.77 million.

Anyway, we now have the values to use in our y=mx+b formula for our trend line. So we can use that formula to make sales projections. The management wanted to know what sales figures we can expect (given the current trend) in Dec2004, Jun2005, and Dec2005.

So we will add another CTE (called ProjMonths) onto the existing query we have to introduce those months, and then finally calculate the projected sales for those months. Note how we use a CROSS APPLY to incorporate the Slope and Intercept values we calculated, then we use another CROSS APPLY to convert our Projected Months into appropriate x values, and then we use yet another CROSS APPLY to plug those x values into the y=mx+b (or Y=Slope*X+Intercept) formula:

`;with XYData as(  select X=1+datediff(month,(select min(Mth) from #MonthlySales),Mth)        ,Y=Sales  from #MonthlySales),CalcSums as(  select SumX=sum(X)        ,SumY=sum(Y)        ,SumXX=sum(X*X)        ,SumXY=sum(X*Y)        ,N=count(*)  from XYData),SlopeIntercept as(  select Slope        ,Intercept  from CalcSums  cross apply (select Slope=1.0*(N*SumXY-SumX*SumY)/(N*SumXX-SumX*SumX)) CalcSlope  cross apply (select Intercept=1.0*(SumY-Slope*SumX)/N) CalcIntercept),ProjMonths as(  select ProjMth=cast('2004-12-01' as datetime)  union all  select '2005-06-01'  union all  select '2005-12-01')select ProjMth      ,ProjSalesfrom ProjMonthscross apply (select Slope,Intercept from SlopeIntercept) IncludeSlopeInterceptcross apply (select ProjX=1+datediff(month                                    ,(select min(Mth) from #MonthlySales)                                    ,ProjMth)) CalcProjXcross apply (select ProjSales=ProjX*Slope+Intercept) CalcProjSales/*ProjMth                                ProjSales----------------------- ------------------------2004-12-01 00:00:00.000         6621880.903702962005-06-01 00:00:00.000         7315046.198958322005-12-01 00:00:00.000         8008211.49421369*/`
So it looks like we can expect about \$8 million in sales by Dec2005.

Management also wanted to know when we can expect to reach \$10 million and \$11 million per month. So instead of taking an x value and plugging it into y=mx+b to get a y value, we need to do the opposite… we will take a y value and plug it into x=(y-b)/m to get an x value.

So we will use that in our query below. Blindly plugging in a y value of 10,000,000 would give us an x value of (10000000 - 1769723.83691539) / 115527.5492092277992277 = 71.24. That indicates that we’ll reach the \$10 million mark about 24% of the way into Month Number 71. But our “month number” must be an integer, so we use the FLOOR() function to make it an actual integer ordinal number, and then we use the final CROSS APPLY to figure out which actual month that translates into:

`;with XYData as(  select X=1+datediff(month,(select min(Mth) from #MonthlySales),Mth)        ,Y=Sales  from #MonthlySales),CalcSums as(  select SumX=sum(X)        ,SumY=sum(Y)        ,SumXX=sum(X*X)        ,SumXY=sum(X*Y)        ,N=count(*)  from XYData),SlopeIntercept as(  select Slope        ,Intercept  from CalcSums  cross apply (select Slope=1.0*(N*SumXY-SumX*SumY)/(N*SumXX-SumX*SumX)) CalcSlope  cross apply (select Intercept=1.0*(SumY-Slope*SumX)/N) CalcIntercept),ProjSales as(  select ProjSales=10000000  union all  select 11000000)select ProjSales      ,ProjMthfrom ProjSalescross apply (select Slope,Intercept from SlopeIntercept) IncludeSlopeInterceptcross apply (select ProjX=floor((ProjSales-Intercept)/Slope)) CalcProjXcross apply (select ProjMth=dateadd(month                                   ,ProjX                                   ,(select min(Mth) from #MonthlySales))) CalcProjMth/*  ProjSales ProjMth----------- -----------------------   10000000 2007-06-01 00:00:00.000   11000000 2008-02-01 00:00:00.000*/`
Now we can tell management that they will have to wait until Jun2007 to reach \$10 million per month in sales and it will take another 8 months to reach \$11 million in Feb2008.

So it looks like our time-travelling adventure yielded some good. We were able to give the management of AdventureWorks what they wanted. I’m sure they will use the information to make the company even stronger… perhaps because of our help, they’ll soon become the Amazon.com of bicycle sales. Many people will buy bicycles who hadn’t bought before, and…

Oops… perhaps we did change history. Perhaps we set something in motion that could cause a time paradox of some kind… perhaps something that … I don’t know… might make me just wink out of existence! Naaahhh… that’s impossible… it’s ridiculous! After all, if that happened, then how could I write this blog? How could I type this very sentence? How could I th

1. Nice.

Example is a bit complicated, i may have to look back at this one.

2. Great post Brad. I really enjoyed reading it. I like that you discuss real world problems and their solutions. Keep up the good work.

3. I have written a SQLCLR for this purpose.
See http://www.developerworkshop.net/software.html
and http://regression.developerworkshop.net/

There are a number of samples too!

4. @Peter: Cool... I'll check it out. Thanks!

5. Great, thanks for creating this awesome article. Thank you

Microsoft Server 2016 Installation
Server 2016