Showing posts with label outliers. Show all posts
Showing posts with label outliers. Show all posts

Tuesday, April 5, 2011

SysJobHistory Outliers

Bell Curve with Standard DeviationsIn honor of NCAA March Madness, former basketball star (and current SQL MVP and SQLRockStar) Thomas LaRock (b | t) wrote a series of blog posts about SQL Server System Tables entitled, appropriately enough, March Madness. They ran from Mar17 to Apr04. Check them out… there is sure to be something in there that you didn’t know about before or that you’ll find to be really cool and useful.

Which brings me to this post. I really enjoyed the concept of Tom’s Apr02 post on the sysjobhistory table in msdb. He had a query that found the SQL Agent Jobs that are outliers… in other words, jobs that are atypical in terms of their run duration… to put it still another way, this query found jobs that ran longer than what would seem to be statistically “normal”.

Tom’s query looked at the previous 24 hours’ worth of job history and found those job steps that ran longer than the average duration plus two standard deviations (i.e. the mean plus two sigmas). In the picture of the bell curve above, that is in the green and gray zone at the right.

I thought this was a great concept, but I wanted to expand on it and introduce some additional features.

First of all, as Tom mentioned, the sysjobhistory table stores dates and times as integers, which is a real pain in the posterior. The dates are stored as YYYYMMDD integers and the times (and durations) are stored as HHMMSS integers. So, for example, if I look at a random entry in my sysjobhistory table…

select top 1 run_date, run_time, run_duration
from msdb.dbo.sysjobhistory
/*
run_date run_time run_duration
-------- -------- ------------
20110302 74856 106
*/
the run_time of 74856 means that the item started at 07:48:56 and the run_duration of 106 means that the item ran for a total of 00:01:06, or 1 minute and 6 seconds. So my first job was to translate these goofy integer representations into something meaningful.

The run_date is easy, since we are beyond the Middle Ages and all years that we deal with are 4 digits, so I could just directly translate a date by converting the integer to a VARCHAR(8) and then converting that to a DATETIME:

select top 1 convert(datetime,convert(varchar(8),run_date))
from msdb.dbo.sysjobhistory
/* 2011-03-02 00:00:00.000 */
The run_time takes a little more work, because the CONVERT function insists on times being in HH:MM:SS format, with 2 digits for each element, meaning leading zeroes had to be there. So first I would introduce the leading zeroes I need by adding 1000000 to the number and then taking the RIGHT-most 6 characters (T-SQL will implicitly convert the integer to a VARCHAR before doing the RIGHT function):

select top 1 right(1000000+run_time,6)
from msdb.dbo.sysjobhistory
/* 074856 */
Then it’s a matter of introducing the colons to separate the elements:

select top 1 stuff(stuff(right(1000000+run_time,6),3,0,':'),6,0,':')
from msdb.dbo.sysjobhistory
/* 07:48:56 */
Now, when I combine that with my VARCHAR representation of the run_date and put them together in YYYYMMDD HH:MM:SS format, I can CONVERT it into a complete date and time of the item:

select top 1 convert(datetime,convert(varchar(8),run_date)
+' '
+stuff(stuff(right(1000000+run_time,6),3,0,':'),6,0,':'))
from msdb.dbo.sysjobhistory
/* 2011-03-02 07:48:56.000 */
The run_duration requires a little more manipulation to translate the HHMMSS integer representation into just a number of seconds (so the value of 106, representing 00:01:06, is translated into 66 seconds):

select top 1 run_duration/10000*3600   --Hours*3600 = Seconds
+run_duration%10000/100*60 --Minutes*60 = Seconds
+run_duration%100 --Seconds
from msdb.dbo.sysjobhistory
/* 66 */
So now that I have those formulas down, I can put together a query that would give me more meaningful information. This query will give me information for successfully-completed jobs… Entries with a Step_ID of 0 indicate a date/time/duration of the job as a whole:

select job_id
,date_executed=convert(datetime,convert(varchar(8),run_date))
+' '
+stuff(stuff(right(1000000+run_time
,6)
,3,0,':')
,6,0,':')
,secs_duration=run_duration/10000*3600
+run_duration%10000/100*60
+run_duration%100
from msdb.dbo.sysjobhistory
where step_id=0 --Job Outcome
and run_status=1 --Succeeded
/*
job_id date_executed secs_duration
------------------------------------ ----------------------- -------------
ECA51518-7144-49DE-9153-EB12D42AE0D9 2011-03-02 07:59:17.000 66
ECA51518-7144-49DE-9153-EB12D42AE0D9 2011-03-02 08:23:49.000 74
ECA51518-7144-49DE-9153-EB12D42AE0D9 2011-03-02 13:07:24.000 61
etc, etc, etc
*/
Now that I have that information, I can put that query into a CTE and make good use of it.

In order to find the outliers, I wanted the ability to define a certain “history window” or “baseline window” of a range of dates of job runs from which I could calculate the Average and the Standard Deviation. So the query would accept a @HistoryStartDate and @HistoryEndDate date range (I would convert the dates to full days starting/ending at midnight). Also, for a halfway meaningful sample, I wanted an option to calculate the statistics for jobs that had run a minimum number of times (represented by @MinHistExecutions). And, since we may only want to look at outliers of jobs that are kind of “beefy”, I wanted to introduce the option to only pay attention to jobs that had a minimum average duration of some kind (@MinAvgSecsDuration).

So this query would give us our history/baseline calculations:

declare @HistoryStartDate datetime = '19000101'
,@HistoryEndDate datetime = getdate()
,@MinHistExecutions int = 10
,@MinAvgSecsDuration int = 30
;
with JobHistData as
(
select job_id
,date_executed=convert(datetime,convert(varchar(8),run_date))
+' '
+stuff(stuff(right(1000000+run_time
,6)
,3,0,':')
,6,0,':')
,secs_duration=run_duration/10000*3600
+run_duration%10000/100*60
+run_duration%100
from msdb.dbo.sysjobhistory
where step_id=0 --Job Outcome
and run_status=1 --Succeeded
)
select job_id
,AvgDuration=avg(secs_duration*1.)
,AvgPlus2StDev=avg(secs_duration*1.)+2*stdevp(secs_duration)
from JobHistData
where date_executed>=dateadd(day,datediff(day,'19000101',@HistoryStartDate),'19000101')
and date_executed<dateadd(day,1+datediff(day,'19000101',@HistoryEndDate),'19000101')
group by job_id
having count(*)>=@MinHistExecutions
and avg(secs_duration*1.)>=@MinAvgSecsDuration
/*
job_id AvgDuration AvgPlus2StDev
------------------------------------ ----------- ----------------
ECA51518-7144-49DE-9153-EB12D42AE0D9 69.373333 108.739927181991
0BC1CB96-602F-417A-AB42-CAF98E1E39A1 47.266666 82.860423255389
etc, etc, etc
*/
And now that we have that, we can look for jobs in an “analysis window” date range (@AnalysisStartDate and @AnalysisEndDate)… perhaps the last 24 or 48 hours… to see which ones have a duration that exceeds the AvgPlus2StDev of the baseline.

Here is a stored procedure to do just that.

create procedure usp_SQLAgentJobOutliers
@HistoryStartDate
datetime = null
,@HistoryEndDate datetime = null
,@AnalysisStartDate datetime = null
,@AnalysisEndDate datetime = null
,@MinHistExecutions int = 10
,@MinAvgSecsDuration int = 30
as

if @HistoryStartDate is null set @HistoryStartDate='19000101' ;
if @HistoryEndDate is null set @HistoryEndDate=getdate() ;
if @AnalysisStartDate is null set @AnalysisStartDate='19000101' ;
if @AnalysisEndDate is null set @AnalysisEndDate=getdate() ;

with JobHistData as
(
select job_id
,date_executed=convert(datetime,convert(varchar(8),run_date))
+' '
+stuff(stuff(right(1000000+run_time
,6)
,3,0,':')
,6,0,':')
,secs_duration=run_duration/10000*3600
+run_duration%10000/100*60
+run_duration%100
from msdb.dbo.sysjobhistory
where step_id=0 --Job Outcome
and run_status=1 --Succeeded
)
,
JobHistStats as
(
select job_id
,AvgDuration=avg(secs_duration*1.)
,AvgPlus2StDev=avg(secs_duration*1.)+2*stdevp(secs_duration)
from JobHistData
where date_executed>=dateadd(day,datediff(day,'19000101',@HistoryStartDate),'19000101')
and date_executed<dateadd(day,1+datediff(day,'19000101',@HistoryEndDate),'19000101')
group by job_id
having count(*)>=@MinHistExecutions
and avg(secs_duration*1.)>=@MinAvgSecsDuration
)
select jd.job_id
,JobName=j.name
,ExecutionDate=jd.date_executed
,[Duration (secs)]=jd.secs_duration
,[Historical Avg Duration (secs)]=AvgDuration
,[Min Threshhold (secs)]=AvgPlus2StDev
from JobHistData jd
join JobHistStats jhs on jd.job_id=jhs.job_id
join msdb.dbo.sysjobs j on jd.job_id=j.job_id
where date_executed>=dateadd(day,datediff(day,'19000101',@AnalysisStartDate),'19000101')
and date_executed<dateadd(day,1+datediff(day,'19000101',@AnalysisEndDate),'19000101')
and secs_duration>AvgPlus2StDev
If you don’t specify any parameters, it will use the ENTIRE sysjobhistory table for both the “history/baseline window” and the “analysis window”. Try calling the procedure with various parameters and see what you come up with. See if any jobs of yours have run with an atypically long duration.

If you want to get more granular and analyze individual Job Steps as opposed to Jobs as a whole, the following procedure will do that for you.

create procedure usp_SQLAgentJobStepOutliers
@HistoryStartDate
datetime = null
,@HistoryEndDate datetime = null
,@AnalysisStartDate datetime = null
,@AnalysisEndDate datetime = null
,@MinHistExecutions int = 10
,@MinAvgSecsDuration int = 30
as

if @HistoryStartDate is null set @HistoryStartDate='19000101' ;
if @HistoryEndDate is null set @HistoryEndDate=getdate() ;
if @AnalysisStartDate is null set @AnalysisStartDate='19000101' ;
if @AnalysisEndDate is null set @AnalysisEndDate=getdate() ;

with JobHistData as
(
select job_id
,step_id
,date_executed=convert(datetime,convert(varchar(8),run_date))
+' '
+stuff(stuff(right(1000000+run_time
,6)
,3,0,':')
,6,0,':')
,secs_duration=run_duration/10000*3600
+run_duration%10000/100*60
+run_duration%100
from msdb.dbo.sysjobhistory
where step_id<>0
and run_status=1 --Succeeded
)
,
JobHistStats as
(
select job_id
,step_id
,AvgDuration=avg(secs_duration*1.)
,AvgPlus2StDev=avg(secs_duration*1.)+2*stdevp(secs_duration)
from JobHistData
where date_executed>=dateadd(day,datediff(day,'19000101',@HistoryStartDate),'19000101')
and date_executed<dateadd(day,1+datediff(day,'19000101',@HistoryEndDate),'19000101')
group by job_id
,step_id
having count(*)>=@MinHistExecutions
and avg(secs_duration*1.)>=@MinAvgSecsDuration
)
select jd.job_id
,JobName=j.name
,jd.step_id
,s.step_name
,ExecutionDate=jd.date_executed
,[Duration (secs)]=jd.secs_duration
,[Historical Avg Duration (secs)]=AvgDuration
,[Min Threshhold (secs)]=AvgPlus2StDev
from JobHistData jd
join JobHistStats jhs on jd.job_id=jhs.job_id and jd.step_id=jhs.step_id
join msdb.dbo.sysjobs j on jd.job_id=j.job_id
join msdb.dbo.sysjobsteps s on jd.job_id=s.job_id and jd.step_id=s.step_id
where date_executed>=dateadd(day,datediff(day,'19000101',@AnalysisStartDate),'19000101')
and date_executed<dateadd(day,1+datediff(day,'19000101',@AnalysisEndDate),'19000101')
and secs_duration>AvgPlus2StDev
However, a word of warning: I was disappointed to find that the sysjobhistory table stores only the Step_ID and not the Step_UID unique identifier. So a Job’s Step_ID #1 today may not have been its Step_ID #1 a month ago. If you move steps up/down in the step list of the SQL Agent Job dialog, it changes their Step_ID value from that point forward, but it doesn’t change anything in history. So you may be comparing apples and oranges.

Anyway, I hope you find these procedures useful for finding those outliers. Thanks again to Thomas LaRock for the idea.

My next blog post will also be dealing with SQL Agent Jobs… I think you’ll really like that one… Stay tuned!