Tuesday, April 5, 2011

SysJobHistory Outliers

Bell Curve with Standard DeviationsIn honor of NCAA March Madness, former basketball star (and current SQL MVP and SQLRockStar) Thomas LaRock (b | t) wrote a series of blog posts about SQL Server System Tables entitled, appropriately enough, March Madness. They ran from Mar17 to Apr04. Check them out… there is sure to be something in there that you didn’t know about before or that you’ll find to be really cool and useful.

Which brings me to this post. I really enjoyed the concept of Tom’s Apr02 post on the sysjobhistory table in msdb. He had a query that found the SQL Agent Jobs that are outliers… in other words, jobs that are atypical in terms of their run duration… to put it still another way, this query found jobs that ran longer than what would seem to be statistically “normal”.

Tom’s query looked at the previous 24 hours’ worth of job history and found those job steps that ran longer than the average duration plus two standard deviations (i.e. the mean plus two sigmas). In the picture of the bell curve above, that is in the green and gray zone at the right.

I thought this was a great concept, but I wanted to expand on it and introduce some additional features.

First of all, as Tom mentioned, the sysjobhistory table stores dates and times as integers, which is a real pain in the posterior. The dates are stored as YYYYMMDD integers and the times (and durations) are stored as HHMMSS integers. So, for example, if I look at a random entry in my sysjobhistory table…

select top 1 run_date, run_time, run_duration
from msdb.dbo.sysjobhistory
/*
run_date run_time run_duration
-------- -------- ------------
20110302 74856 106
*/
the run_time of 74856 means that the item started at 07:48:56 and the run_duration of 106 means that the item ran for a total of 00:01:06, or 1 minute and 6 seconds. So my first job was to translate these goofy integer representations into something meaningful.

The run_date is easy, since we are beyond the Middle Ages and all years that we deal with are 4 digits, so I could just directly translate a date by converting the integer to a VARCHAR(8) and then converting that to a DATETIME:

select top 1 convert(datetime,convert(varchar(8),run_date))
from msdb.dbo.sysjobhistory
/* 2011-03-02 00:00:00.000 */
The run_time takes a little more work, because the CONVERT function insists on times being in HH:MM:SS format, with 2 digits for each element, meaning leading zeroes had to be there. So first I would introduce the leading zeroes I need by adding 1000000 to the number and then taking the RIGHT-most 6 characters (T-SQL will implicitly convert the integer to a VARCHAR before doing the RIGHT function):

select top 1 right(1000000+run_time,6)
from msdb.dbo.sysjobhistory
/* 074856 */
Then it’s a matter of introducing the colons to separate the elements:

select top 1 stuff(stuff(right(1000000+run_time,6),3,0,':'),6,0,':')
from msdb.dbo.sysjobhistory
/* 07:48:56 */
Now, when I combine that with my VARCHAR representation of the run_date and put them together in YYYYMMDD HH:MM:SS format, I can CONVERT it into a complete date and time of the item:

select top 1 convert(datetime,convert(varchar(8),run_date)
+' '
+stuff(stuff(right(1000000+run_time,6),3,0,':'),6,0,':'))
from msdb.dbo.sysjobhistory
/* 2011-03-02 07:48:56.000 */
The run_duration requires a little more manipulation to translate the HHMMSS integer representation into just a number of seconds (so the value of 106, representing 00:01:06, is translated into 66 seconds):

select top 1 run_duration/10000*3600   --Hours*3600 = Seconds
+run_duration%10000/100*60 --Minutes*60 = Seconds
+run_duration%100 --Seconds
from msdb.dbo.sysjobhistory
/* 66 */
So now that I have those formulas down, I can put together a query that would give me more meaningful information. This query will give me information for successfully-completed jobs… Entries with a Step_ID of 0 indicate a date/time/duration of the job as a whole:

select job_id
,date_executed=convert(datetime,convert(varchar(8),run_date))
+' '
+stuff(stuff(right(1000000+run_time
,6)
,3,0,':')
,6,0,':')
,secs_duration=run_duration/10000*3600
+run_duration%10000/100*60
+run_duration%100
from msdb.dbo.sysjobhistory
where step_id=0 --Job Outcome
and run_status=1 --Succeeded
/*
job_id date_executed secs_duration
------------------------------------ ----------------------- -------------
ECA51518-7144-49DE-9153-EB12D42AE0D9 2011-03-02 07:59:17.000 66
ECA51518-7144-49DE-9153-EB12D42AE0D9 2011-03-02 08:23:49.000 74
ECA51518-7144-49DE-9153-EB12D42AE0D9 2011-03-02 13:07:24.000 61
etc, etc, etc
*/
Now that I have that information, I can put that query into a CTE and make good use of it.

In order to find the outliers, I wanted the ability to define a certain “history window” or “baseline window” of a range of dates of job runs from which I could calculate the Average and the Standard Deviation. So the query would accept a @HistoryStartDate and @HistoryEndDate date range (I would convert the dates to full days starting/ending at midnight). Also, for a halfway meaningful sample, I wanted an option to calculate the statistics for jobs that had run a minimum number of times (represented by @MinHistExecutions). And, since we may only want to look at outliers of jobs that are kind of “beefy”, I wanted to introduce the option to only pay attention to jobs that had a minimum average duration of some kind (@MinAvgSecsDuration).

So this query would give us our history/baseline calculations:

declare @HistoryStartDate datetime = '19000101'
,@HistoryEndDate datetime = getdate()
,@MinHistExecutions int = 10
,@MinAvgSecsDuration int = 30
;
with JobHistData as
(
select job_id
,date_executed=convert(datetime,convert(varchar(8),run_date))
+' '
+stuff(stuff(right(1000000+run_time
,6)
,3,0,':')
,6,0,':')
,secs_duration=run_duration/10000*3600
+run_duration%10000/100*60
+run_duration%100
from msdb.dbo.sysjobhistory
where step_id=0 --Job Outcome
and run_status=1 --Succeeded
)
select job_id
,AvgDuration=avg(secs_duration*1.)
,AvgPlus2StDev=avg(secs_duration*1.)+2*stdevp(secs_duration)
from JobHistData
where date_executed>=dateadd(day,datediff(day,'19000101',@HistoryStartDate),'19000101')
and date_executed<dateadd(day,1+datediff(day,'19000101',@HistoryEndDate),'19000101')
group by job_id
having count(*)>=@MinHistExecutions
and avg(secs_duration*1.)>=@MinAvgSecsDuration
/*
job_id AvgDuration AvgPlus2StDev
------------------------------------ ----------- ----------------
ECA51518-7144-49DE-9153-EB12D42AE0D9 69.373333 108.739927181991
0BC1CB96-602F-417A-AB42-CAF98E1E39A1 47.266666 82.860423255389
etc, etc, etc
*/
And now that we have that, we can look for jobs in an “analysis window” date range (@AnalysisStartDate and @AnalysisEndDate)… perhaps the last 24 or 48 hours… to see which ones have a duration that exceeds the AvgPlus2StDev of the baseline.

Here is a stored procedure to do just that.

create procedure usp_SQLAgentJobOutliers
@HistoryStartDate
datetime = null
,@HistoryEndDate datetime = null
,@AnalysisStartDate datetime = null
,@AnalysisEndDate datetime = null
,@MinHistExecutions int = 10
,@MinAvgSecsDuration int = 30
as

if @HistoryStartDate is null set @HistoryStartDate='19000101' ;
if @HistoryEndDate is null set @HistoryEndDate=getdate() ;
if @AnalysisStartDate is null set @AnalysisStartDate='19000101' ;
if @AnalysisEndDate is null set @AnalysisEndDate=getdate() ;

with JobHistData as
(
select job_id
,date_executed=convert(datetime,convert(varchar(8),run_date))
+' '
+stuff(stuff(right(1000000+run_time
,6)
,3,0,':')
,6,0,':')
,secs_duration=run_duration/10000*3600
+run_duration%10000/100*60
+run_duration%100
from msdb.dbo.sysjobhistory
where step_id=0 --Job Outcome
and run_status=1 --Succeeded
)
,
JobHistStats as
(
select job_id
,AvgDuration=avg(secs_duration*1.)
,AvgPlus2StDev=avg(secs_duration*1.)+2*stdevp(secs_duration)
from JobHistData
where date_executed>=dateadd(day,datediff(day,'19000101',@HistoryStartDate),'19000101')
and date_executed<dateadd(day,1+datediff(day,'19000101',@HistoryEndDate),'19000101')
group by job_id
having count(*)>=@MinHistExecutions
and avg(secs_duration*1.)>=@MinAvgSecsDuration
)
select jd.job_id
,JobName=j.name
,ExecutionDate=jd.date_executed
,[Duration (secs)]=jd.secs_duration
,[Historical Avg Duration (secs)]=AvgDuration
,[Min Threshhold (secs)]=AvgPlus2StDev
from JobHistData jd
join JobHistStats jhs on jd.job_id=jhs.job_id
join msdb.dbo.sysjobs j on jd.job_id=j.job_id
where date_executed>=dateadd(day,datediff(day,'19000101',@AnalysisStartDate),'19000101')
and date_executed<dateadd(day,1+datediff(day,'19000101',@AnalysisEndDate),'19000101')
and secs_duration>AvgPlus2StDev
If you don’t specify any parameters, it will use the ENTIRE sysjobhistory table for both the “history/baseline window” and the “analysis window”. Try calling the procedure with various parameters and see what you come up with. See if any jobs of yours have run with an atypically long duration.

If you want to get more granular and analyze individual Job Steps as opposed to Jobs as a whole, the following procedure will do that for you.

create procedure usp_SQLAgentJobStepOutliers
@HistoryStartDate
datetime = null
,@HistoryEndDate datetime = null
,@AnalysisStartDate datetime = null
,@AnalysisEndDate datetime = null
,@MinHistExecutions int = 10
,@MinAvgSecsDuration int = 30
as

if @HistoryStartDate is null set @HistoryStartDate='19000101' ;
if @HistoryEndDate is null set @HistoryEndDate=getdate() ;
if @AnalysisStartDate is null set @AnalysisStartDate='19000101' ;
if @AnalysisEndDate is null set @AnalysisEndDate=getdate() ;

with JobHistData as
(
select job_id
,step_id
,date_executed=convert(datetime,convert(varchar(8),run_date))
+' '
+stuff(stuff(right(1000000+run_time
,6)
,3,0,':')
,6,0,':')
,secs_duration=run_duration/10000*3600
+run_duration%10000/100*60
+run_duration%100
from msdb.dbo.sysjobhistory
where step_id<>0
and run_status=1 --Succeeded
)
,
JobHistStats as
(
select job_id
,step_id
,AvgDuration=avg(secs_duration*1.)
,AvgPlus2StDev=avg(secs_duration*1.)+2*stdevp(secs_duration)
from JobHistData
where date_executed>=dateadd(day,datediff(day,'19000101',@HistoryStartDate),'19000101')
and date_executed<dateadd(day,1+datediff(day,'19000101',@HistoryEndDate),'19000101')
group by job_id
,step_id
having count(*)>=@MinHistExecutions
and avg(secs_duration*1.)>=@MinAvgSecsDuration
)
select jd.job_id
,JobName=j.name
,jd.step_id
,s.step_name
,ExecutionDate=jd.date_executed
,[Duration (secs)]=jd.secs_duration
,[Historical Avg Duration (secs)]=AvgDuration
,[Min Threshhold (secs)]=AvgPlus2StDev
from JobHistData jd
join JobHistStats jhs on jd.job_id=jhs.job_id and jd.step_id=jhs.step_id
join msdb.dbo.sysjobs j on jd.job_id=j.job_id
join msdb.dbo.sysjobsteps s on jd.job_id=s.job_id and jd.step_id=s.step_id
where date_executed>=dateadd(day,datediff(day,'19000101',@AnalysisStartDate),'19000101')
and date_executed<dateadd(day,1+datediff(day,'19000101',@AnalysisEndDate),'19000101')
and secs_duration>AvgPlus2StDev
However, a word of warning: I was disappointed to find that the sysjobhistory table stores only the Step_ID and not the Step_UID unique identifier. So a Job’s Step_ID #1 today may not have been its Step_ID #1 a month ago. If you move steps up/down in the step list of the SQL Agent Job dialog, it changes their Step_ID value from that point forward, but it doesn’t change anything in history. So you may be comparing apples and oranges.

Anyway, I hope you find these procedures useful for finding those outliers. Thanks again to Thomas LaRock for the idea.

My next blog post will also be dealing with SQL Agent Jobs… I think you’ll really like that one… Stay tuned!

10 comments:

  1. Very nice, Brad! This is really useful stuff build on top of @SQLRockStar's original work.

    Kudos,

    -Kev

    ReplyDelete
  2. Great post. As always, I'm impressed by what you put into these posts :-)

    I have to admit, though, I'm too lazy for some of the roll your own tools like this. I let SQL Sentry do the math and tell me when a job runs over or under a % of the average run time now. :-)

    Looking forward to the next one. Thanks for the comments on my post last week, sir.

    Keep up the AwesomeSauce

    ReplyDelete
  3. @Kevin and @Mike:

    Thanks so much for the great feedback!

    --Brad

    ReplyDelete
  4. Great stuff, thank you and Tom for putting some of this together so I can stop rolling my own solutions ;) One little tidbit I found while cursing the whole integer values for date and time is MS ships a function that reconstitutes the date and time into a datetime object. msdb.dbo.agent_datetime(@date, @time)

    select job_id
    , date_executed = msdb.dbo.agent_datetime(run_date, run_time), run_duration
    ,secs_duration=run_duration/10000*3600
    +run_duration%10000/100*60
    +run_duration%100
    from msdb.dbo.sysjobhistory
    where step_id=0 --Job Outcome
    and run_status=1 --Succeeded

    Posted a blog about it but all credit goes to Lori & Tim Edwards

    ReplyDelete
  5. @Bill:

    Ahhh... It makes sense that there would be a built-in function for that... Thanks for the info!

    --Brad

    ReplyDelete
  6. Hi Brad, I came across your post while looking into usage of the sysjobhistory table. I made a post on my (newly created) blog and thought you may be interested.
    - Ben

    ReplyDelete
  7. @BringMeData:

    Welcome to the blogging world... I'm looking forward to reading your posts.

    --Brad

    ReplyDelete
  8. Brad, I found a missing join condition between JobHistData and JobHistStats for the JobSteps script:
    FROM JobHistData jd
    JOIN JobHistStats jhs ON jd.job_id=jhs.job_id AND jd.step_id=jhs.step_id

    Thanks for publishing this, I really like the use of the CTE!

    ReplyDelete
  9. Oh, and one more bug:
    IF @AnalysisStartDate IS NULL SET @AnalysisStartDate='19000101' ;
    IF @AnalysisEndDate IS NULL SET @AnalysisEndDate=GETDATE() ;

    ReplyDelete
  10. Thanks for finding the missing JOIN condition, Chuck. I corrected the post.

    As I was changing thigs, I also happened to notice the @AnalysisStartDate and @AnalysisEndDate initialization problem, and when I went back to enter this comment, I see that you caught it too!

    Thanks for catching those, and thanks for the feedback!

    ReplyDelete