Friday, April 29, 2011

The Index Tuning Detective

A Shocking Revelation!Part of tuning queries is being a detective in figuring out what indexes need to be created… and figuring out what indexes may need to be tweaked a bit.

So to all you Nancy Drews and Hardy Boys out there: Get your magnifying glass and let’s unravel some mysteries!

Our first adventure will be The Case Of The Missing Indexes.

This has been talked about before in other blogs, because SQL Server already provides some tools to find missing indexes, but hopefully I’ll go just a little bit farther with the concept.

But it’s the second adventure, The Case Of The Key Lookup Killer, that I’m looking forward to sharing with you.

It may help you uncover some shocking revelations about your queries.

Anyway, please read on…



The Case of the Missing Indexes

Consider the following query in AdventureWorks:

select SalesOrderID,OrderDate 
from Sales.SalesOrderHeader
where PurchaseOrderNumber is not null
When we look at the Estimated Execution Plan for this query in SQL2008, we get a helpful hint about a missing index. It says that we can cut down the cost of the query by 94.2762% if we were to add the index that it suggests. And we can even right-click on the plan and choose Missing Index Details and it will provide a code window with the code necessary to create that index:

SQL2008 Showing Missing Index

/*
Missing Index Details from SQLQuery6.sql-BRADPC\SQL08.AdventureWorks (BRADPC\Brad (53))
The Query Processor estimates that implementing the following index
could improve the query cost by 94.2762%.
*/

/*
USE [AdventureWorks]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [Sales].[SalesOrderHeader] ([PurchaseOrderNumber])
INCLUDE ([SalesOrderID],[OrderDate])
GO
*/
Of course, the index is not a perfect suggestion because it suggests we INCLUDE the SalesOrderID column, which is ridiculous because SalesOrderID is the Clustered Index Key, so it would be part of the index automatically anyway.

But never mind that… This is still pretty cool stuff.

And it’s not limited to SQL2008 either. Yes, SQL2008 will provide the helpful hint when you look at the Estimated Plan, but SQL2005 still has the same information behind the scenes. If you look at the Execution Plan XML (by right-clicking on the Plan and choosing Show Execution Plan XML), you will find the Missing Index information buried in there… usually towards the top, but not always… there can potentially be many Missing Index sections in the XML if the plan is for a multiple-statement batch (particularly a batch with IF conditions).

/*
<?xml version="1.0" encoding="utf-16"?>
<ShowPlanXML ... xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan">
...
<MissingIndexes>
<MissingIndexGroup Impact="94.2762">
<MissingIndex Database="[AdventureWorks]" Schema="[Sales]" Table="[SalesOrderHeader]">
<ColumnGroup Usage="INEQUALITY">
<Column Name="[PurchaseOrderNumber]" ColumnId="9" />
</ColumnGroup>
<ColumnGroup Usage="INCLUDE">
<Column Name="[SalesOrderID]" ColumnId="1" />
<Column Name="[OrderDate]" ColumnId="3" />
</ColumnGroup>
</MissingIndex>
</MissingIndexGroup>
</MissingIndexes>
...
</ShowPlanXML>
*/
This is great if you happen to be looking at a specific plan, but what if you want to know about missing indexes across many plans?

There is a collection of DMV’s that report missing indexes for queries that have been executed on the server. The following is a query used by many DBA’s in order to find indexes that they can potentially add in order to speed up their queries:

select index_advantage=user_seeks*avg_total_user_cost*(avg_user_impact*0.01)
,migs.last_user_seek
,TableName=mid.statement
,mid.equality_columns
,mid.inequality_columns
,mid.included_columns
,migs.unique_compiles
,migs.user_seeks
,migs.avg_total_user_cost
,migs.avg_user_impact
from sys.dm_db_missing_index_group_stats migs with (nolock)
join
sys.dm_db_missing_index_groups mig with (nolock)
on migs.group_handle=mig.index_group_handle
join sys.dm_db_missing_index_details mid with (nolock)
on mig.index_handle=mid.index_handle
order by index_advantage desc
/*
index_advantage last_user_seek TableName
---------------- ----------------------- -------------------------------------------
3.23713751033778 2011-04-29 08:58:11.960 [AdventureWorks].[Sales].[SalesOrderHeader]

equality_columns inequality_columns included_columns
---------------- -------------------- ---------------------------
NULL [PurchaseOrderNumber [SalesOrderID], [OrderDate]

unique_compiles user_seeks avg_total_user_cost avg_user_impact
--------------- ---------- ------------------- ---------------
1 6 0.572255959259259 94.28
*/
You can see our familiar Impact figure of 94.28% in the last column, but since the DMV’s record the number of seeks that could have potentially been performed (based on the number of times the query was executed on the server), this query calculates a theoretical “advantage” figure using that information, and orders the data in descending order of that “advantage”.

Again, this is really cool that this information is recorded and we can effortlessly find out suggestions for indexes in our databases.

But these indexes shouldn’t be created blindly… they should be created after some consideration of how they will impact the system. If you know your database backwards and forwards, you probably have a good idea of what kind of queries would benefit from the creation of these indexes. But if you’re a consultant coming in cold, you don’t really know much about the queries in the system.

However, all is not lost!

We can look directly inside the query cache (via the sys.dm_exec_cached_plans DMV), hunting for queries that are running right now (or have run recently), and pick out those that have missing index suggestions in their Execution XML data. This can be used in concert with the Missing Index DMV query above to gain more knowledge about the queries that will benefit.

I put together the following query to do just that. The comments within the code should help you to figure out what is going on. Use the XML snippet shown earlier to follow how the code hunts for what it wants.

Note that I only wanted to pay attention to cached queries that had been used at least 5 times. I also wanted to limit the number of columns making up the key to 5 and the number of INCLUDEd columns to 5. I also only wanted to pay attention to Missing Indexes that have an Impact of at least 50%. You can fiddle with these quantities in the WHERE clause to match your needs.

The query provides the Key Column list and the INCLUDE Column List and the command you can use to actually create the index if you wish. It also provides a hyperlink to the code of the batch or procedure that would benefit from the index so you can see the query involved. There is also a hyperlink to the Execution Plan so you can investigate that as well.

The query is ORDERed BY the Impact figure in descending order, so the ones with the biggest benefit will be at the top.

with xmlnamespaces 
(
default 'http://schemas.microsoft.com/sqlserver/2004/07/showplan'
)
select Impact
,TableName=IxDB+'.'+IxSchema+'.'+IxTable
,KeyCols
,IncludeCols
,IndexCommand
,usecounts
,size_in_bytes
,objtype
,BatchCode
,QueryPlan=qp.query_plan
from sys.dm_exec_cached_plans qs
cross apply
--Get the Query Text
sys.dm_exec_sql_text(qs.plan_handle) qt
cross apply
--Get the Query Plan
sys.dm_exec_query_plan(qs.plan_handle) qp
cross apply
--Get the Code for the Batch in Hyperlink Form
(select BatchCode
=(select [processing-instruction(q)]=':'+nchar(13)+qt.text+nchar(13)
for xml path(''),type)
) F_Code
cross apply
--Find the Missing Indexes Group Nodes in the Plan
qp.query_plan.nodes('//MissingIndexes/MissingIndexGroup') F_GrpNodes(GrpNode)
cross
apply
--Pull out the Impact Figure
(select Impact=GrpNode.value('(./@Impact)','float')) F_Impact
cross apply
--Get the Missing Index Nodes from the Group
GrpNode.nodes('(./MissingIndex)') F_IxNodes(IxNode)
cross
apply
--Pull out the Database,Schema,Table of the Missing Index
(select IxDB=IxNode.value('(./@Database)','sysname')
,IxSchema=IxNode.value('(./@Schema)','sysname')
,IxTable=IxNode.value('(./@Table)','sysname')
) F_IxInfo
cross apply
--How many INCLUDE columns are there;
--And how many EQUALITY/INEQUALITY columns are there?
(select NumIncludes
=IxNode.value('count(./ColumnGroup[@Usage="INCLUDE"]/Column)','int')
,NumKeys
=IxNode.value('count(./ColumnGroup[@Usage!="INCLUDE"]/Column)','int')
) F_NumIncl
cross apply
--Pull out the Key Columns and the Include Columns from the various Column Groups
(select EqCols=max(case when Usage='EQUALITY' then ColList end)
,InEqCols=max(case when Usage='INEQUALITY' then ColList end)
,IncludeCols=max(case when Usage='INCLUDE' then ColList end)
from IxNode.nodes('(./ColumnGroup)') F_ColGrp(ColGrpNode)
cross apply
--Pull out the Usage of the Group? (EQUALITY of INEQUALITY or INCLUDE)
(select Usage=ColGrpNode.value('(./@Usage)','varchar(20)')) F_Usage
cross apply
--Get a comma-delimited list of the Column Names in the Group
(select ColList=stuff((select ','+ColNode.value('(./@Name)','sysname')
from ColGrpNode.nodes('(./Column)') F_ColNodes(ColNode)
for xml path(''))
,1,1,'')
) F_ColList
) F_ColGrps
cross apply
--Put together the Equality and InEquality Columns
(select KeyCols=isnull(EqCols,'')
+case
when EqCols is not null and InEqCols is not null
then ','
else ''
end
+isnull(InEqCols,'')
) F_KeyCols
cross apply
--Construct a CREATE INDEX command
(select IndexCommand='create index <InsertNameHere> on '
+IxDB+'.'+IxSchema+'.'+IxTable+' ('
+KeyCols+')'
+isnull(' include ('+IncludeCols+')','')) F_Cmd
where qs.cacheobjtype='Compiled Plan'
and usecounts>=5 --Only interested in those plans used at least 5 times
and NumKeys<=5 --Limit to the #columns we're willing to have in the index
and NumIncludes<=5 --Limit to the #columns we're willing to have in the INCLUDE list
and Impact>=50 --Only indexes that will have a 50% impact
order by Impact desc
Here is the output (without the hyperlinks) for the AdventureWorks query we were discussing:

/*
Impact TableName
------- -------------------------------------------
94.2762 [AdventureWorks].[Sales].[SalesOrderHeader]

KeyCols IncludeCols
--------------------- --------------------------
[PurchaseOrderNumber] [SalesOrderID],[OrderDate]

IndexCommand
----------------------------------------------------------------------------
create index <InsertNameHere> on [AdventureWorks].[Sales].[SalesOrderHeader]
([PurchaseOrderNumber]) include ([SalesOrderID],[OrderDate])

usecounts size_in_bytes objtype
--------- ------------- -------
9 40960 Adhoc
*/



The Case of the Key Lookup Killer

Now on to the part of this article I’m really excited about, because it will help you possibly tweak existing indexes to help improve your queries. This is something that you cannot get from any DMV’s.

Consider the following query:

select h.SalesOrderID
,h.CustomerID
,h.OrderDate
,d.LineTotal
from Sales.SalesOrderHeader h
join Sales.SalesOrderDetail d on h.SalesOrderID=d.SalesOrderID
where h.CustomerID in (117,119,126,196,236,435)
and d.ProductID=942
This is what its query plan looks like (click on the image to see a larger view):

Query with Key Lookups

Note that it is able to use the SalesOrderDetail’s index on ProductID to easily find the rows for ProductID 942. However, for each of those rows, it has to do a Key Lookup into the Clustered Index in order to get the columns for the LineTotal, which is a computed column based on the columns OrderQty, UnitPrice, and UnitPriceDiscount. The LineTotal is calculated by the Compute Scalar operator.

Similarly, the query makes use of SalesOrderHeader’s index on CustomerID to find the rows for the desired CustomerID’s. But again, it has to do Key Lookups to get the OrderDate.

What if we could eliminate those Key Lookups? If we were to INCLUDE the OrderQty and UnitPrice and UnitPriceDiscount columns in the ProductID index and INCLUDE the OrderDate column in the CustomerID index, then the query would be covered completely by those indexes. The query plan would then look like this:

Query Covered by Indexes

The cost of this query is 0.0128915, which is a 92% improvement over the Key Lookup query, which had a cost of 0.163265.

I put together a query to look in the cache for queries containing Key Lookups and then pull out the columns that we could potentially INCLUDE in an index in order to improve performance.

Here’s the section of the Execution Plan XML that involves the Sales.SalesOrderDetail Index Seek and its Key Lookup… I abbreviated it so that only the relevant portions are displayed:

/*
<?xml version="1.0" encoding="utf-16"?>
<ShowPlanXML ... xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan">
...
<RelOp ... PhysicalOp="Nested Loops" ...>
...
<NestedLoops ...>
...
<RelOp ... LogicalOp="Index Seek" ...>
...
<IndexScan ...>
...
<Object Database="[AdventureWorks]"
Schema="[Sales]"
Table="[SalesOrderDetail]"
Index="[IX_SalesOrderDetail_ProductID]"
Alias="[d]" ... />
</IndexScan>
</RelOp>
<RelOp ... LogicalOp="Compute Scalar" ...>
<ComputeScalar>
...
<RelOp ... LogicalOp="Clustered Index Seek" ...>
<OutputList>
<ColumnReference ... Column="OrderQty" />
<ColumnReference ... Column="UnitPrice" />
<ColumnReference ... Column="UnitPriceDiscount" />
</OutputList>
<IndexScan Lookup="true" ...>
...
<Object Database="[AdventureWorks]"
Schema="[Sales]"
Table="[SalesOrderDetail]"
Index="[PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID]"
Alias="[d]" ... />
</IndexScan>
</RelOp>
</ComputeScalar>
</RelOp>
</NestedLoops>
</RelOp>
...
</ShowPlanXML>
*/
You can see the Nested Loops RelOp Node has a node down its hierarchy for the Index Seek into the IX_SalesOrderDetail_ProductID index, and it also has a node down its hierarchy for the Clustered Index Seek.

Here’s what my query does, step by step…

It finds Clustered Index Seek RelOp Nodes with an Index Scan Node with the Lookup attribute equal to True. Those are the Key Lookups. For each one found, it saves the Database, Schema, Table, and (possible) Alias Name. It also gets the list of columns in the OutputList node and also counts how many columns there are.

Then it goes up the hierarchy, looking for a Nested Loops RelOp Node. It may be the immediate RelOp Node, or it could be two RelOp Nodes up the hierarchy. In this case, because of the Compute Scalar, we have to go up two RelOp Nodes.

Once it finds the appropriate Nested Loops RelOp Node, it goes down its hierarchy looking for any Index Seek or Index Scan RelOp Nodes.

For each of those found, it saves the Database, Schema, Table, and (possible) Alias Name. And it gets the name of the Index that was involved in the Seek or Scan.

So now we have Lookup Data, and we have to see if any of the Index Data we found (there could be more than one) match in terms of Database and Schema and Table and (possible) Alias. Once the match is found, we can output it.

It’s a bit complicated, but with a little study in reading the above steps and in reading the comments in the code, you can hopefully figure out what’s going on.

Like the Missing Index query, I only look for cached plans that have been used at least 5 times, and I’m only interested in an INCLUDE list of no more than 5 columns. The output includes the usual hyperlinks to the code and to the plan.

I order the query’s output by TableName and IndexName so you can see similar suggestions clustered together. Note also that the columns in the suggested INCLUDE list are in alphabetical order.

Here is what my query suggested for our AdventureWorks Key Lookup query:

/*
TableName IndexName
------------------------------------------- --------------------------------
[AdventureWorks].[Sales].[SalesOrderDetail] [IX_SalesOrderDetail_ProductID]
[AdventureWorks].[Sales].[SalesOrderHeader] [IX_SalesOrderHeader_CustomerID]

TableAliasInQuery ColumnsToInclude usecounts size_in_bytes objtype
----------------- ------------------------------------ --------- ------------- -------
[d] OrderQty,UnitPrice,UnitPriceDiscount 6 196608 Adhoc
[h] OrderDate 6 196608 Adhoc
*/
So it is suggesting that the SalesOrderDetail index called IX_SalesOrderDetail_ProductID should INCLUDE the columns OrderQty, UnitPrice, and UnitPriceDiscount. And it is making a similar suggestion to INCLUDE the OrderDate column in the IX_SalesOrderHeader_CustomerID index.

Again, as with any kind of tuning exercise, you should not make these changes blindly, but consider the possible implications before going forward.

The code for this query for INCLUDE Column Suggestions is below. I hope you found it as useful as I have. It actually helped me to find an error in a client’s stored procedure… In suggesting a column to INCLUDE, I could see by looking at the query that it was actually using an incorrect WHERE predicate, and so we were able to fix it before it caused a bug. I’ve also used its suggestions to improve some queries considerably.

So here’s the query… Please let me know if it’s been useful to you… Enjoy!

with xmlnamespaces 
(
default 'http://schemas.microsoft.com/sqlserver/2004/07/showplan'
)
select TableName=IxDB+'.'+IxSchema+'.'+IxTable
,IndexName=IxIndex
,TableAliasInQuery=isnull(IxAlias,IxTable)
,ColumnsToInclude=ColList
,usecounts
,size_in_bytes
,objtype
,BatchCode
,QueryPlan=qp.query_plan
from sys.dm_exec_cached_plans qs
cross apply
--Get the Query Text
sys.dm_exec_sql_text(qs.plan_handle) qt
cross apply
--Get the Query Plan
sys.dm_exec_query_plan(qs.plan_handle) qp
cross apply
--Get the Code for the Batch in Hyperlink Form
(select BatchCode
=(select [processing-instruction(q)]=':'+nchar(13)+qt.text+nchar(13)
for xml path(''),type)
) F_Code
cross apply
--Find the Key Lookups in the Plan
qp.query_plan.nodes
(
'//RelOp[@LogicalOp="Clustered Index Seek"]/IndexScan[@Lookup=1]'
) F_Lookup(LookupNode)
cross
apply
--Get the Database,Schema,Table of the Lookup
--Also get the Alias (if it exists) in case the table
-- is used more than once in the query
(select LookupDB=LookupNode.value('(./Object[1]/@Database)','sysname')
,LookupSchema=LookupNode.value('(./Object[1]/@Schema)','sysname')
,LookupTable=LookupNode.value('(./Object[1]/@Table)','sysname')
,LookupAlias=isnull(LookupNode.value('(./Object[1]/@Alias)','sysname'),'')
,ColumnCount=LookupNode.value('count(../OutputList[1]/ColumnReference)','int')
) F_LookupInfo
cross apply
--Get the Output Columns
(select stuff(
(select ','+ColName
from LookupNode.nodes('(../OutputList[1]/ColumnReference)') F_Col(ColNode)
cross apply
(select ColName=ColNode.value('(./@Column)','sysname')) F_ColInfo
order by ColName
for xml path(''),type).value('(./text())[1]','varchar(max)')
,1,1,'')
)
F_ColList(ColList)
outer
apply
--Get the Parent RelOp Node, hoping that it is a Nested Loops operator.
--Use OUTER APPLY because we may not find it
LookupNode.nodes
(
'(./../../..[@PhysicalOp="Nested Loops"])'
) F_ParentLoop(ParentLoopNode)
outer
apply
--Get the GrandParent RelOp Node, hoping that it is a Nested Loops operator.
--Use OUTER APPLY because we may not find it
LookupNode.nodes
(
'(./../../../../..[@PhysicalOp="Nested Loops"])'
) F_GrandParentLoop(GrandParentLoopNode)
cross
apply
--Get the Nested Loop Node... Could be the Parent or the GrandParent
(select LoopNode=isnull(ParentLoopNode.query('.')
,GrandParentLoopNode.query('.'))
) F_LoopNode
cross apply
--Now that we (hopefully) have a Nested Loops Node, let's find a descendant
--of that node that is an Index Seek or Index Scan and acquire its Object Information
LoopNode.nodes
(
'//RelOp[@LogicalOp="Index Scan" or @LogicalOp="Index Seek"]
/IndexScan[1]/Object[1]'
) F_SeekNode(SeekObjNode)
cross
apply
--Get the Database,Schema,Table and Index of the Index Seek/Scan
--Also get the Alias (if it exists) so we can match it up with
-- the Lookup Table
(select IxDB=SeekObjNode.value('(./@Database)','sysname')
,IxSchema=SeekObjNode.value('(./@Schema)','sysname')
,IxTable=SeekObjNode.value('(./@Table)','sysname')
,IxAlias=isnull(SeekObjNode.value('(./@Alias)','sysname'),'')
,IxIndex=SeekObjNode.value('(./@Index)','sysname')
) F_SeekInfo
where qs.cacheobjtype='Compiled Plan'
and usecounts>=5 --Only interested in those plans used at least 5 times
and LookupDB=IxDB --( Lookup and IndexSeek/Scan )
and LookupSchema=IxSchema --( Database,Schema,Table, )
and LookupTable=IxTable --( and [possible] Alias )
and LookupAlias=IxAlias --( must match )
and ColumnCount<=5 --Limit to the #columns we're willing to INCLUDE
order by TableName
,IndexName
,ColumnsToInclude

13 comments:

  1. Very nice! Immediately useful.
    I found some stuff already (even though I thought I knew this db like the back of my hand).

    ReplyDelete
  2. Sweet. The key lookup problem is an issue that has been near and dear to my heart for some time. I greatly appreciate the article; thank you.

    :)

    ReplyDelete
  3. @Michael and @Kent:

    Thanks for the feedback! (And great to hear from you, Kent!). The Key Lookup query has been very enlightening for me in what it finds.

    --Brad

    ReplyDelete
  4. Brad,

    Nice work with the 'missing include columns' query.
    Thanks for sharing, it works a treat!

    Indexes that appear to be used can easily be used as springboards for key lookups.
    Your query nails them down.

    Keep up the good work!

    ReplyDelete
  5. The two XML queries work on my staging server but fail on my QA and Production servers with the following error:

    Msg 6841, Level 16, State 1, Line 5
    FOR XML could not serialize the data for node 'processing-instruction(q)' because it contains a character (0x0000) which is not allowed in XML. To retrieve this data using FOR XML, convert it to binary, varbinary or image data type and use the BINARY BASE64 directive.

    I glanced at BOL but didn't see a mention of the BINARY BASE64 directive for the CONVERT() function. Is the error perhaps referring to a XQuery conversion function?

    ReplyDelete
  6. @Greg:

    It looks like your T-SQL Code has a CHAR(0) in it somewhere, and FOR XML does not like that.

    Remove the processing-instruction stuff from the CROSS APPLY and instead make it say:

    select BatchCode=qt.text

    At least this way you can get the text of the batch, but it won't be in a hyperlink and it won't be nicely formatted.

    You might also want to add a column to the main query that checks for the CHAR(0) within the text:

    PositionOfChar0=CHARINDEX(CHAR(0),BatchCode)

    This may help you find the offending code.

    Let me know how it goes.

    --Brad

    ReplyDelete
  7. Brad,

    I'd love to test out your query which is looking for index suggestions using the xml namespace. SSMS is telling me though that the following namespaces cannot be found 'qp.query_plan.nodes' along with 'GrpNode.nodes' (Invalid object name) and 'IxNode.value' (possibly ambiguous). Could you help me out there?

    Thanks lots.

    P.S. I'm still sort of new to this, but my environment is huge and our developers still develop with SQL2000 in their heads on SQL2008 machines. I'm trying to hunt for suggestions (facts) to them to change their techniques and upgrade their query developing style by adding more/better indexes.

    ReplyDelete
  8. Its opportunity are so fantastic and working style so speedy

    ReplyDelete
  9. I am looking for some good blog sites for studying.

    ReplyDelete
  10. Very great post. I just came across your weblog

    ReplyDelete
  11. I love reading through and I believe this website got some genuinely utilitarian stuff on it!

    ReplyDelete
  12. I’m really enjoying the design and layout of your site.

    ReplyDelete