Wednesday, August 12, 2009

The Art of Regressing Linearly

Let’s use our Wayback Machine to travel back in time to AdventureWorks headquarters in July of 2004. The company just closed the books for the month of June 2004 and the middle managers are barking at their subordinates because their superiors have been hounding them to come up with some sales projections. The minions of the company are working doggedly and panting from exhaustion. (Okay, I guess I’ve run the canine references into the ground… I suppose it was the WayBack Machine reference that put a picture of Mr Peabody into my mind… sorry ‘bout that).

Anyway, AdventureWorks management wants to know, given their sales statistics history, what sales figures they can expect in 6 months and 12 months and 18 months from now. They also want to know when they can expect to reach $10 million per month in sales. How about $11 million per month?

With no qualms about changing history, we interfere and offer our help.

Let’s see what kind of sales statistics we have up through June of 2004:

select Mth=OrderDate-day(OrderDate)+1
,Sales=sum(TotalDue)
from Sales.SalesOrderHeader
where OrderDate<'2004-07-01'
group by OrderDate-day(OrderDate)+1
order by Mth
/*
Mth Sales
----------------------- ---------------------
2001-07-01 00:00:00.000 1172359.4289
2001-08-01 00:00:00.000 2605514.9809
2001-09-01 00:00:00.000 2073058.5385
2001-10-01 00:00:00.000 1688963.2744
. . .
2004-03-01 00:00:00.000 5272786.8106
2004-04-01 00:00:00.000 4722890.7352
2004-05-01 00:00:00.000 6518825.2262
2004-06-01 00:00:00.000 6728034.9873
*/
If we were to put the data into Excel and graph those points and have Excel add a trend line, this is what we get:



There are lots of up months and down months, but, in general, the trend is certainly up (always a good thing), with sales somewhere in the $2 million per month range back in 2001 and in the $6 million per month range by Jun2004.

So how can we make sales projections using SQL?

Like Excel did in generating the trend line, we will have to mathematically calculate the same thing using Linear Regression. You may recall from Algebra class that a line has the formula of y=mx+b, where m is the slope and b is the y-intercept. Given a collection of (x,y) values, the following Linear Regression formulas will calculate the slope and y-intercept values for us:




That looks like complicated stuff, but it’s actually very easy to implement.

Our Sales data has a datetime field (the 1st day of a month) and a Sales figure for that month. Those are our theoretical x and y values, but we can’t really use a datetime for our x value… we have to have an actual numeric value of some kind. So let’s just assume our first chronological month just has a value of 1, and the next month is 2, and so on.

For simplicity’s sake, let’s take our query we used above and put our sales statistics into a temporary file called #MonthlySales:

select Mth=OrderDate-day(OrderDate)+1
,Sales=sum(TotalDue)
into #MonthlySales
from Sales.SalesOrderHeader
where OrderDate<'2004-07-01'
group by OrderDate-day(OrderDate)+1
And now we can get our (x,y) values from that as shown below. Note how the x value was calculated using DATEDIFF and a scalar sub-query to get the month number. This is done just to make it more general without having to make assumptions about the underlying data.

select X=1+datediff(month,(select min(Mth) from #MonthlySales),Mth)
,Y=Sales
from #MonthlySales
order by X
/*
X Y
------ ---------------------
1 1172359.4289
2 2605514.9809
3 2073058.5385
4 1688963.2744
. . .
33 5272786.8106
34 4722890.7352
35 6518825.2262
36 6728034.9873
*/
From that, it’s easy to calculate all those nasty sums (Σ’s) needed by the slope and y-intercept formulas:

;with XYData as
(
select X=1+datediff(month,(select min(Mth) from #MonthlySales),Mth)
,Y=Sales
from #MonthlySales
)
select SumX=sum(X)
,SumY=sum(Y)
,SumXX=sum(X*X)
,SumXY=sum(X*Y)
,N=count(*)
from XYData
/*
SumX SumY SumXX SumXY N
------ ---------------- -------- ---------------- ------
666 140651405.9023 16206 3050875537.8704 36
*/
The actual values don’t mean anything to us, but that doesn’t matter because we will just use these to plug into our slope and y-intercept formulas below. Note that I use two CROSS APPLYs below because the value of the y-intercept requires that we calculate the value of the slope first. (Those of you who know me already know that I’m a big-time CROSS APPLY fan… click here for a look at previous blog entries about this frequently-misunderstood, underused operator).

Also note that I introduced 1.0 into the Slope and Intercept calculations in order to force decimal/numeric math to come into play as opposed to possible integer math. In other words, if both our x and y values were integers, then all the SUM figures would be integers and therefore when we do division, we could end up with unexpected integer results. For example, SQL will calculate 500/251 as being equal to 1, but it will calculate 1.0*500/251 as 1.992031. We want that more accurate figure.

;with XYData as
(
select X=1+datediff(month,(select min(Mth) from #MonthlySales),Mth)
,Y=Sales
from #MonthlySales
)
,
CalcSums as
(
select SumX=sum(X)
,SumY=sum(Y)
,SumXX=sum(X*X)
,SumXY=sum(X*Y)
,N=count(*)
from XYData
)
select Slope
,Intercept
from CalcSums
cross apply (select Slope=1.0*(N*SumXY-SumX*SumY)/(N*SumXX-SumX*SumX)) CalcSlope
cross apply (select Intercept=1.0*(SumY-Slope*SumX)/N) CalcIntercept
/*
Slope Intercept
------------------------ -----------------
115527.5492092277992277 1769723.83691539
*/
Again, these figures appear pretty meaningless, but the Slope figure does tell us that we generally increase our monthly sales by about $115,528 per month, and that our sales at Month Number Zero was about $1.77 million.

Anyway, we now have the values to use in our y=mx+b formula for our trend line. So we can use that formula to make sales projections. The management wanted to know what sales figures we can expect (given the current trend) in Dec2004, Jun2005, and Dec2005.

So we will add another CTE (called ProjMonths) onto the existing query we have to introduce those months, and then finally calculate the projected sales for those months. Note how we use a CROSS APPLY to incorporate the Slope and Intercept values we calculated, then we use another CROSS APPLY to convert our Projected Months into appropriate x values, and then we use yet another CROSS APPLY to plug those x values into the y=mx+b (or Y=Slope*X+Intercept) formula:

;with XYData as
(
select X=1+datediff(month,(select min(Mth) from #MonthlySales),Mth)
,Y=Sales
from #MonthlySales
)
,
CalcSums as
(
select SumX=sum(X)
,SumY=sum(Y)
,SumXX=sum(X*X)
,SumXY=sum(X*Y)
,N=count(*)
from XYData
)
,
SlopeIntercept as
(
select Slope
,Intercept
from CalcSums
cross apply (select Slope=1.0*(N*SumXY-SumX*SumY)/(N*SumXX-SumX*SumX)) CalcSlope
cross apply (select Intercept=1.0*(SumY-Slope*SumX)/N) CalcIntercept
)
,
ProjMonths as
(
select ProjMth=cast('2004-12-01' as datetime)
union all
select '2005-06-01'
union all
select '2005-12-01'
)
select ProjMth
,ProjSales
from ProjMonths
cross apply (select Slope,Intercept from SlopeIntercept) IncludeSlopeIntercept
cross apply (select ProjX=1+datediff(month
,(select min(Mth) from #MonthlySales)
,
ProjMth)) CalcProjX
cross apply (select ProjSales=ProjX*Slope+Intercept) CalcProjSales
/*
ProjMth ProjSales
----------------------- ------------------------
2004-12-01 00:00:00.000 6621880.90370296
2005-06-01 00:00:00.000 7315046.19895832
2005-12-01 00:00:00.000 8008211.49421369
*/
So it looks like we can expect about $8 million in sales by Dec2005.

Management also wanted to know when we can expect to reach $10 million and $11 million per month. So instead of taking an x value and plugging it into y=mx+b to get a y value, we need to do the opposite… we will take a y value and plug it into x=(y-b)/m to get an x value.

So we will use that in our query below. Blindly plugging in a y value of 10,000,000 would give us an x value of (10000000 - 1769723.83691539) / 115527.5492092277992277 = 71.24. That indicates that we’ll reach the $10 million mark about 24% of the way into Month Number 71. But our “month number” must be an integer, so we use the FLOOR() function to make it an actual integer ordinal number, and then we use the final CROSS APPLY to figure out which actual month that translates into:

;with XYData as
(
select X=1+datediff(month,(select min(Mth) from #MonthlySales),Mth)
,Y=Sales
from #MonthlySales
)
,
CalcSums as
(
select SumX=sum(X)
,SumY=sum(Y)
,SumXX=sum(X*X)
,SumXY=sum(X*Y)
,N=count(*)
from XYData
)
,
SlopeIntercept as
(
select Slope
,Intercept
from CalcSums
cross apply (select Slope=1.0*(N*SumXY-SumX*SumY)/(N*SumXX-SumX*SumX)) CalcSlope
cross apply (select Intercept=1.0*(SumY-Slope*SumX)/N) CalcIntercept
)
,
ProjSales as
(
select ProjSales=10000000
union all
select 11000000
)
select ProjSales
,ProjMth
from ProjSales
cross apply (select Slope,Intercept from SlopeIntercept) IncludeSlopeIntercept
cross apply (select ProjX=floor((ProjSales-Intercept)/Slope)) CalcProjX
cross apply (select ProjMth=dateadd(month
,ProjX
,(select min(Mth) from #MonthlySales))) CalcProjMth
/*
ProjSales ProjMth
----------- -----------------------
10000000 2007-06-01 00:00:00.000
11000000 2008-02-01 00:00:00.000
*/
Now we can tell management that they will have to wait until Jun2007 to reach $10 million per month in sales and it will take another 8 months to reach $11 million in Feb2008.

So it looks like our time-travelling adventure yielded some good. We were able to give the management of AdventureWorks what they wanted. I’m sure they will use the information to make the company even stronger… perhaps because of our help, they’ll soon become the Amazon.com of bicycle sales. Many people will buy bicycles who hadn’t bought before, and…

Oops… perhaps we did change history. Perhaps we set something in motion that could cause a time paradox of some kind… perhaps something that … I don’t know… might make me just wink out of existence! Naaahhh… that’s impossible… it’s ridiculous! After all, if that happened, then how could I write this blog? How could I type this very sentence? How could I th

4 comments:

  1. Nice.

    Example is a bit complicated, i may have to look back at this one.

    ReplyDelete
  2. Great post Brad. I really enjoyed reading it. I like that you discuss real world problems and their solutions. Keep up the good work.

    ReplyDelete
  3. I have written a SQLCLR for this purpose.
    See http://www.developerworkshop.net/software.html
    and http://regression.developerworkshop.net/

    There are a number of samples too!

    ReplyDelete
  4. @Peter: Cool... I'll check it out. Thanks!

    ReplyDelete