Monday, October 26, 2009

Query Optimizer: Genius or Bonehead?

Sometimes I marvel at the ingenuity of the Query Optimizer in generating query plans. In past blog entries (here and here), I’ve thrown some things at it that it handled quite cleverly.

On the other hand, sometimes it does some pretty boneheaded things that don’t seem to make much sense.

Let’s take a look at some examples… from one side of the spectrum to the other.

(Important Note: All the boneheaded behavior I talk about in this article is found in SQL2005. SQL2008, on the other hand, does not exhibit any of the boneheaded behavior.)

Take the following query, which we’ll call Query #1:

/* Query #1 */
select distinct CustomerID,ProductID
from AdventureWorks.Sales.SalesOrderDetail sod
join AdventureWorks.Sales.SalesOrderHeader soh on sod.SalesOrderID=soh.SalesOrderID
where 1=2
and power(2,sqrt(0))<>1
and char(-1) is not null
and sin(pi()/2)<>1
and soh.SalesOrderID is null
What goes through the Optimizer’s head when it constructs this query?

SQL ServerHmmm… I can see right away that the first 4 predicates of the WHERE clause will evaluate to False. And since I know that SalesOrderID is a non-nullable column, it’s impossible for it to be NULL, so the 5th predicate is also False. Since the WHERE clause can never be True, this query will produce no rows. So, heck, why waste my time doing any SCANning or SEEKing of any tables? I don’t have to do anything! I can just sit back and relax.

This is the query plan that it puts together for the above query. It just plops out an empty set:

Query #1 Plan

Excellent.

The following query (Query #2) is exactly like Query #1, except all the predicates are negated so that they will all evaluate to True:

/* Query #2 */
select distinct CustomerID,ProductID
from AdventureWorks.Sales.SalesOrderDetail sod
join AdventureWorks.Sales.SalesOrderHeader soh on sod.SalesOrderID=soh.SalesOrderID
where 1<>2
and power(2,sqrt(0))=1
and char(-1) is null
and sin(pi()/2)=1
and soh.SalesOrderID is not null
This is the plan that is generated:

Query #2 Plan

The Optimizer’s thinking is:

SQL ServerHmmm… I can tell ahead of time that the WHERE clause is going to evaluate to True, so I can just throw it out the window… it’s unnecessary. So, it looks like I’ll have to join ALL rows from both tables. Okay, as far as the SalesOrderHeader table is concerned, all I need is the CustomerID and SalesOrderID, and I can just get both of those from the CustomerID nonclustered index and it will be faster to SCAN that than it would be to SCAN the entire clustered index, so I’ll attack that table that way. Similarly, I’ll use the ProductID nonclustered index from the SalesOrderDetail table to pick up the ProductID and SalesOrderID from that table. Then I’ll use a Hash Match to JOIN them together, and finally do a SELECT DISTINCT CustomerID, ProductID.

So it knows the WHERE clause is not needed, and it decides to SCAN the smaller-sized nonclustered indexes to get the data it needs. Brilliant! The Optimizer should be applauded.

What do our judges think of the plans for Query #1 and Query #2?

Two Einsteins

They awarded the plans Two Einsteins! Very impressive.

Now take a look at this query (Query #3). It’s exactly the same as Query #2, only with a different WHERE clause. Again, though, all the predicates evaluate to True… at least you and I can easily see that:

/* Query #3 */
select distinct CustomerID,ProductID
from AdventureWorks.Sales.SalesOrderDetail sod
join AdventureWorks.Sales.SalesOrderHeader soh on sod.SalesOrderID=soh.SalesOrderID
where SalesOrderDetailID=SalesOrderDetailID
and ProductID=ProductID
and SpecialOfferID=SpecialOfferID
and OrderQty=OrderQty
and UnitPrice=UnitPrice
But the Optimizer doesn’t see it that way. Here’s the query plan for that query:

Query #3 Plan

Note the arrows between the operators are a lot thinner than in the previous query, and note that it is doing a SCAN of the SalesOrderDetail clustered index and a SEEK into the SalesOrderHeader table. Here is apparently what’s going on inside the Optimizer’s head as it constructs this plan:

SQL ServerHmmm… All of the predicates in the WHERE clause refer to columns in the SalesOrderDetail table, which has 121,317 rows. If any of the predicates were referring to a constant value (like WHERE ProductID=777 for example), then I could use statistics to calculate the selectivity of each predicate. But all 5 predicates are comparing columns to some unknown value, so I will assume a selectivity of 10% for each one. Therefore my estimate is that I will choose 121,317 * 10% * 10% * 10% * 10% * 10% = 1.21317 rows from SalesOrderDetail. Since that’s such a small number of rows, I’ll SCAN the clustered index of the SalesOrderDetail table (since I have to get to the columns in the WHERE clause to test each one of them), and then, for each row found, I’ll do a Seek into the Clustered Index of the SalesOrderHeader table to get the CustomerID.

You can see the estimate of 1.21317 rows here in the properties of the Clustered Index Scan:

Estimated Rows in Query #3

This doesn’t make much sense. Why does the Optimizer think that it’s comparing columns to some unknown value? The only time something isn’t equal to itself is when it has a value of NULL, but all 5 columns referred to in the WHERE clause are non-nullable columns. It’s surprising that the Optimizer can’t see that each predicate is True.

And the plan that it did come up with performs terribly. Look at a comparison in performance between Query #2 and Query #3:

/*           CPU   #Reads
-------------------------
Query #2: 210ms 299
Query #3: 695ms 365,458
*/
Yikes! Look at the number of Reads! Think about it… in executing Query #3, for each one of the 121,317 rows in SalesOrderDetail, it does a SEEK into SalesOrderHeader. That’s really inefficient.

I don’t think our panel of judges will be thrilled with this, do you? Let’s see what they thought of the Query #3 plan:

Four Curlys

Four Curlys! That’s gotta hurt. But you have to admit... It was a boneheaded plan.

(Important Note: The above phenomenon occurs in SQL2005. It does not behave the same in SQL2008. In SQL2008, Query #2 and Query #3 have identical query plans.)

Here’s another example in inconsistent approaches by the Optimizer. Look at the following query (Query #4), which uses window functions in OVER clauses in a CTE:

/* Query #4 */
with cte as
(
select SalesOrderID,ProductID
,RNum=row_number() over (partition by ProductID order by SalesOrderID)
,Rnk=rank() over (partition by SpecialOfferID order by CarrierTrackingNumber)
,DRnk=dense_rank() over (partition by UnitPrice order by RowGUID)
from AdventureWorks.Sales.SalesOrderDetail
)
select SalesOrderID,ProductID
from cte
What does the Optimizer think when it constructs the plan for that query?

SQL ServerHmmm… Looks like there are a few window functions in there, creating columns called Rnum, Rnk, and DRnk. But look! Those columns aren’t referred to in the SELECT list of the main query at all! So heck, I’m not going to waste my time executing those window functions if they aren’t needed. So I’ll put together a plan that simply spits out SalesOrderID and ProductID from the SalesOrderDetail table. And I’ll minimize my time in acquiring those by doing a SCAN of the ProductID nonclustered index instead of SCANning the entire table’s data in the clustered index.

And that’s what it does, as we can see from the following plan:

Query #4 Plan

That’s a very sensible plan. What do the judges say?

Three Doc Browns

Three Emmett Browns, huh?. Sure, he’s a bit quirky, and he’s not as impressive as Einstein or Plato or Galileo or Leonardo or Mozart, but hey, he invented time travel… that’s pretty cool.

Now look at the following query (Query #5), which is exactly like Query #4, except that it uses aggregates in OVER clauses in the CTE rather than window functions:

/* Query #5 */
with cte as
(
select SalesOrderID,ProductID
,Cnt=count(*) over (partition by ProductID)
,TotQty=sum(OrderQty) over (partition by SpecialOfferID)
,AvgLine=avg(LineTotal) over (partition by UnitPrice)
from AdventureWorks.Sales.SalesOrderDetail
)
select SalesOrderID,ProductID
from cte
Wouldn’t you think that the Optimizer would come up with the exact same plan as Query #4? But nooooo… here’s what it’s thinking:

SQL ServerOh boy! Aggregates! I love aggregates! Any chance I can, I’ll calculate aggregates! I can’t get enough! And this query has 3 of them! Hot dog! I get to do a COUNT(*) and I get to do a SUM and an AVG too! And they each use different PARTITIONs, so I get to do three different SORTs. This is great! Wow! I feel tingly all over! Whoopee! Oh Happy Day!

That kind of sounds like my dog when I pull out the bag of dog food every single night. Sure enough, here’s the plan:

Query #5 Plan

Look at all those operators! All those aggregations and sorts and everything, and it’s all a waste of time because none of the calculations are in the final result set.

Look at the comparison in performance between Query #4 and Query #5:

/*           CPU   #Reads
-------------------------
Query #4: 39ms 236
Query #5: 3059ms 811,834
*/
Incredible. Query #5 had over 3400 times the number of reads as Query #4 and almost 100 times the CPU time.

Judges? Your verdict?

Nine SpongeBobs

Ouch!! Nine SpongeBobs! The judges definitely thought that was an atrocious, putrid, abysmal, boneheaded plan.

(Important Note: The above phenomenon occurs in SQL2005. It does not behave the same in SQL2008. In SQL2008, Query #4 and Query #5 have identical query plans.)

Of course, none of the above queries are necessarily ones you’d actually put together in the real world (though I did stumble upon the behavior of Query #4 and #5 while answering a question in the MSDN T-SQL Forum). And I don’t mean to put down the Optimizer. I guess “bonehead” is kind of a strong word. Perhaps the Optimizer is sometimes just a little bit naïve.

I guess the moral of the story, as always, is to look carefully at your query plans. In the words of another person bordering on genius and naïveté: “Query Plans are like a box of chocolates… You never know what you’re gonna get.”

Friday, October 16, 2009

Un-Making a List (or Shredding of Evidence)

In my previous blog entry, I described in detail how you could use the FOR XML clause to create a column with a comma-separated list of entries acquired from rows in a table somewhere.

In this article, we will do the opposite… We will take a comma-separated list and “shred” the individual items into rows.

Let’s start by creating and populating a test table:

create table #CSVLists 
(
ID int
,Description varchar(30)
,CSVList varchar(200)
)
insert #CSVLists select 1,'Stooges' ,'Moe,Larry,Curly'
union all select 2,'Castaways','Gilligan,Skipper,Thurston,Lovey,'
+'Ginger,Professor,MaryAnne'
union all select 3,'Monkees' ,'Davy,Micky,Peter,Mike'
(Okay, I know you Gilligan’s Island fanatics are chomping at the bit to tell me that the Skipper’s name was Jonas and the Professor’s name was Roy. I know that… Don’t bother e-mailing me about it… save a virtual tree).

There is a handy .nodes() method which you can apply to an XML datatype that will shred it into relational data. So our first step is to convert our CSVList column into a valid XML representation of the data. We do that this way:

select ID
,Description
,XmlList
from #CSVLists
cross apply (select TagsList='<x>'+replace(CSVList,',','</x><x>')+'</x>') F1
cross apply (select XmlList=cast(TagsList as xml)) F2
/*
ID Description XmlList
-- ----------- -----------------------------------------------------------
1 Stooges <x>Moe</x><x>Larry</x><x>Curly</x>
2 Castaways <x>Gilligan</x><x>Skipper</x><x>Thurston</x><x>Lovey</x>...
3 Monkees <x>Davy</x><x>Micky</x><x>Peter</x><x>Mike</x>
*/
The first CROSS APPLY surrounded each item in the CSVList with the tags <x></x>. And the second CROSS APPLY converted that into an actual XML datatype.

Now that we have an XML column, we can apply the .nodes() method to it. We will pass 'x' to the .nodes() method so that it knows the tag (<x></x>) to shred. And let’s take a look at that result:

select ID
,Description
,XMLNode
from #CSVLists
cross apply (select TagsList='<x>'+replace(CSVList,',','</x><x>')+'</x>') F1
cross apply (select XmlList=cast(TagsList as xml)) F2
cross apply XmlList.nodes('x') F3(XmlNode)
/*
Msg 493, Level 16, State 1, Line 1
The column 'XmlNode' that was returned from the nodes() method cannot be used
directly. It can only be used with one of the four xml data type methods,
exist(), nodes(), query(), and value(), or in IS NULL and IS NOT NULL checks.
*/
Oops… This new XmlNode object we created is not something that we can actually look at or use directly. So, as the error message indicated, we will use the .value() method against that XmlNode object to pull out an actual VARCHAR(20) value:

select ID
,Description
,ListItem
from #CSVLists
cross apply (select TagsList='<x>'+replace(CSVList,',','</x><x>')+'</x>') F1
cross apply (select XmlList=cast(TagsList as xml)) F2
cross apply XmlList.nodes('x') F3(XmlNode)
cross
apply (select ListItem=XmlNode.value('.','varchar(20)')) F4
/*
ID Description ListItem
-- ----------- ---------
1 Stooges Moe
1 Stooges Larry
1 Stooges Curly
2 Castaways Gilligan
2 Castaways Skipper
2 Castaways Thurston
2 Castaways Lovey
2 Castaways Ginger
2 Castaways Professor
2 Castaways MaryAnne
3 Monkees Davy
3 Monkees Micky
3 Monkees Peter
3 Monkees Mike
*/
Wow, it works!

But, as is often the case, you know there’s more. Just like I did in my previous blog entry, I’m going to do something to purposely trip things up. Let’s add a list of TV Networks:

insert #CSVLists
select 4,'Networks','ABC,NBC,CBS,FX,TNT,A&E'
And let’s try the same query:

select ID
,Description
,ListItem
from #CSVLists
cross apply (select TagsList='<x>'+replace(CSVList,',','</x><x>')+'</x>') F1
cross apply (select XmlList=cast(TagsList as xml)) F2
cross apply XmlList.nodes('x') F3(XmlNode)
cross
apply (select ListItem=XmlNode.value('.','varchar(20)')) F4
/*
ID Description ListItem
-- ----------- ---------
1 Stooges Moe
1 Stooges Larry
1 Stooges Curly
2 Castaways Gilligan
2 Castaways Skipper
2 Castaways Thurston
2 Castaways Lovey
2 Castaways Ginger
2 Castaways Professor
2 Castaways MaryAnne
3 Monkees Davy
3 Monkees Micky
3 Monkees Peter
3 Monkees Mike
Msg 9411, Level 16, State 1, Line 1
XML parsing: line 1, character 56, semicolon expected
*/
Ouch! Once again, we’ve run into the problem with the way that XML encodes and decodes special characters like ampersand (&).

We cannot CAST something as XML if it has an ampersand as part of the data, because XML will try to interpret it as an entity that must end in a semicolon:

select Info=cast('<x>A&E</x>' as xml)
/*
Msg 9411, Level 16, State 1, Line 1
XML parsing: line 1, character 7, semicolon expected
*/
In order for it to be accepted, we have to encode that ampersand as follows, so that XML will understand it to represent an actual ampersand character:

select Info=cast('<x>A&amp;E</x>' as xml)
/*
Info
--------------
<x>A&amp;E</x>
*/
And, as we learned in my previous blog entry, we can then take the .value() of that to un-encode it:

select Info=cast('<x>A&amp;E</x>' as xml).value('.','varchar(10)')
/*
Info
----
A&E
*/
So in our shredding query, we have to FIRST encode any special characters in our CSVList column, and THEN we can successfully CAST it as an XML column.

One might be tempted to use multiple REPLACE() functions to convert all ampersands to &amp; and less-thans to &lt; and greater-thans to &gt;, but who knows what others characters are out there that will create problems? We already have a tool at our disposal that will encode the characters for us… the FOR XML clause.

select ID
,Description
,XMLEncoded
from #CSVLists
cross apply (select XMLEncoded=(select [*]=CSVList for xml path(''))) F0
/*
ID Description XMLEncoded
-- ----------- ---------------------------------------------------------
1 Stooges Moe,Larry,Curly
2 Castaways Gilligan,Skipper,Thurston,Lovey,Ginger,Professor,MaryAnne
3 Monkees Davy,Micky,Peter,Mike
4 Networks ABC,NBC,CBS,FX,TNT,A&amp;E
*/
We learned in my previous blog entry that we can eliminate tags being created by explicitly specifying a column a name of [*]. If we hadn’t done that, then we would end up with our lists surrounded by <CSVList></CSVList> tags. And we eliminated yet another tag surrounding our data by passing the empty string to FOR XML PATH. So our lists are really the same as they were originally, except for the special characters being automatically encoded for us.

Now we can incorporate that CROSS APPLY into our query:

select ID
,Description
,ListItem
from #CSVLists
cross apply (select XMLEncoded=(select [*]=CSVList for xml path(''))) F0
cross apply (select TagsList='<x>'+replace(XMLEncoded,',','</x><x>')+'</x>') F1
cross apply (select XmlList=cast(TagsList as xml)) F2
cross apply XmlList.nodes('x') F3(XmlNode)
cross
apply (select ListItem=XmlNode.value('.','varchar(20)')) F4
/*
ID Description ListItem
-- ----------- ---------
1 Stooges Moe
1 Stooges Larry
1 Stooges Curly
2 Castaways Gilligan
2 Castaways Skipper
2 Castaways Thurston
2 Castaways Lovey
2 Castaways Ginger
2 Castaways Professor
2 Castaways MaryAnne
3 Monkees Davy
3 Monkees Micky
3 Monkees Peter
3 Monkees Mike
4 Networks ABC
4 Networks NBC
4 Networks CBS
4 Networks FX
4 Networks TNT
4 Networks A&E
*/
Ahhh… Success!

By the way, you may have noticed that the making and un-making (shredding) of these comma-delimited lists is like PIVOTing and UNPIVOTing. PIVOT converts rows to columns and UNPIVOT converts columns to rows. We have done something similar, except in making a list, we converted rows into a single column, and in shredding a list, we converted a single column into rows.

I’ll leave you with one final demonstration. Our #CSVLists table that we created had some comma-separated lists. What if we want to sort the items in the lists?

We would first shred the items into rows, and then, using the FOR XML logic we put together in my previous blog entry, we will re-assemble the lists and use an ORDER BY to sort them:

;with ShreddedLists as
(
select ID
,ListItem
from #CSVLists
cross apply (select XMLEncoded=(select [*]=CSVList for xml path(''))) F0
cross apply (select TagsList='<x>'+replace(XMLEncoded,',','</x><x>')+'</x>') F1
cross apply (select XmlList=cast(TagsList as xml)) F2
cross apply XmlList.nodes('x') F3(XmlNode)
cross apply (select ListItem=XmlNode.value('.','varchar(20)')) F4
)
select ID
,Description
,SortedCSVList
from #CSVLists
cross apply (select CSVStringXML=(select ','+ListItem
from ShreddedLists
where ID=#CSVLists.ID
order by ListItem
for xml path(''),type)) F1
cross apply (select CSVString=CSVStringXML.value('.','varchar(max)')) F2
cross apply (select SortedCSVList=stuff(CSVString,1,1,'')) F3
/*
ID Description SortedCSVList
-- ----------- ---------------------------------------------------------
1 Stooges Curly,Larry,Moe
2 Castaways Gilligan,Ginger,Lovey,MaryAnne,Professor,Skipper,Thurston
3 Monkees Davy,Micky,Mike,Peter
4 Networks A&E,ABC,CBS,FX,NBC,TNT
*/
I hope you have many hours of happy list-making and shredding activity. Until next time…

UPDATE ON Dec12,2009: Please read an important follow-up to this article in which I discuss some issues with the speed of the XML shredding method.

Wednesday, October 14, 2009

Making a List (and Checking It Twice)

It seems like the most frequently-asked question on the MSDN T-SQL Forum is how to put together a column that contains a list of some kind.

For example, take a look at the Northwind database. There’s a Categories table and a Products table. We’d like to produce a list of all Categories and, for each Category, produce a column containing a comma-delimited (and sorted) list of ProductID’s that are discontinued.

The standard answer that we give to these kinds of questions (providing that the original poster of the question is using SQL2005 or beyond) is something like this:

select CategoryName
,DiscontinuedIDs=
stuff(
(
select ','+convert(varchar,ProductID)
from Products
where CategoryID=Categories.CategoryID
and Discontinued=1
order by ProductID
for xml path('')
)
,1,1,'')
from Categories
order by CategoryName
/*
CategoryName DiscontinuedIDs
--------------- ---------------
Beverages 24
Condiments 5
Confections NULL
Dairy Products NULL
Grains/Cereals 42
Meat/Poultry 9,17,29,53
Produce 28
Seafood NULL
*/
Most times the poster will come back and say, “Er… uh… Thanks.” But occasionally someone will say, “Huh? What’s with all that XML stuff? Could you explain what the heck that complicated mess o’ code is doing please?”

I know when I first started studying T-SQL and started frequenting the forums and saw this kind of answer, I felt kind of intimidated and figured that this stuff was more complex than I had thought.

But it all becomes clear if you go through how this works step by step.

For illustration purposes, let’s create a temporary table #Stooges and populate it:

create table #Stooges (ID int, Name varchar(10))
insert #Stooges select 1,'Moe'
union all select 2,'Larry'
union all select 3,'Curly'
Now if we do a normal query on this table, we get rows and columns:

select * from #Stooges
/*
ID Name
--- ------
1 Moe
2 Larry
3 Curly
*/
But if we add a FOR XML clause to the query, we get an XML representation of the data:

select * from #Stooges for xml path
/*
XML_F52E2B61-18A1-11d1-B105-00805F49916B
----------------------------------------
<row><ID>1</ID><Name>Moe</Name></row>
<row><ID>2</ID><Name>Larry</Name></row>
<row><ID>3</ID><Name>Curly</Name></row>
*/
In reality, it did not return 3 rows of data… I just presented it that way for readability. In fact, it didn’t produce any rows or columns… it actually produced an XML stream. Any query that has the FOR XML clause (without the ,TYPE directive) produces an XML stream. The data is streamed out in (2033-character) chunks as the query is executed, as you can see below, where I query the 120,000-record SalesOrderDetail table in AdventureWorks. The first part of the XML stream is plopped into the SSMS Results window almost immediately, and the query is still grinding away to finish up the rest of the stream.

XML Stream in SSMS

(If you want to read more about how the XML streams are created, read this blog entry from Eugene Kogan of Microsoft).

In case you’re curious about where the mysterious F52E2B61-18A1-11d1-B105-00805F49916B identifier name comes from, see the snapshot of the Registry Editor below.

IXMLDocument

It’s the GUID associated with the IXMLDocument Interface that’s part of the Type Library associated with the MSXML DLL files installed into the Windows System Directory. (If that last sentence sounded like a lot of gibberish to you, I can relate… Don’t worry… It’s not important… As far as I’m concerned, we can pretend that the identifier name came from little magic elves sprinkling pixie dust on pretty unicorns… It makes about as much sense and is easier to understand and relate to).

Anyway, back to the subject at hand…

If you use FOR XML in a scalar subquery it will produce an NVARCHAR(MAX) column, unless you use the ,TYPE directive, which will produce a true XML column. You can see this below, where I do FOR XML subqueries with and without the ,TYPE directive and I put the results INTO a temporary table. If we look at that temporary table’s structure (via INFORMATION_SCHEMA.COLUMNS), we can see the datatypes that got created:

select ForXmlPath=(select * 
from #Stooges
for xml path)
,ForXmlPathType=(select *
from #Stooges
for xml path,type)
into #ThrowMeAway

select Table_Name
,Column_Name
,Data_Type
,Character_Maximum_Length
from tempdb.information_schema.columns
where Table_Name like '#ThrowMeAway%'
/*
Table_Name Column_Name Data_Type Character_Maximum_Length
------------------------------- -------------- --------- ------------------------
#ThrowMeAway__...__000000000018 ForXmlPath nvarchar -1
#ThrowMeAway__...__000000000018 ForXmlPathType xml -1
*/
So, with this knowledge, we can perform any kind of string functions on a FOR XML (non ,TYPE) subquery that we like.

Note that when we do a FOR XML PATH without specifying the actual path, SQL Server assumes a path of ‘row’. (Again, for clarification, the result below is a single row, single column string value… I only show it as 3 lines for readability):

select Info=(select Name 
from #Stooges
for xml path)
/*
Info
-----------------------------
<row><Name>Moe</Name></row>
<row><Name>Larry</Name></row>
<row><Name>Curly</Name></row>
*/
If we specify a path of ‘Stooge’ and change the name of the column to StoogeName, we get the following (again, showing the string column as 3 lines for readability):

select Info=(select StoogeName=Name 
from #Stooges
for xml path('Stooge'))
/*
Info
-----------------------------------------------
<Stooge><StoogeName>Moe</StoogeName></Stooge>
<Stooge><StoogeName>Larry</StoogeName></Stooge>
<Stooge><StoogeName>Curly</StoogeName></Stooge>
*/
Now, note that we can choose to have no name (and therefore no tag in the XML) by either making the column into an expression or by giving the column a deliberate name of [*]:

select Info=(select ''+Name
from #Stooges
for xml path('Stooge'))
/*
Info
----------------------------------------------------------------
<Stooge>Moe</Stooge><Stooge>Larry</Stooge><Stooge>Curly</Stooge>
*/


select Info=(select [*]=Name
from #Stooges
for xml path('Stooge'))
/*
Info
----------------------------------------------------------------
<Stooge>Moe</Stooge><Stooge>Larry</Stooge><Stooge>Curly</Stooge>
*/
We can also choose to specify an empty PATH:

select Info=(select [*]=Name
from #Stooges
for xml path(''))
/*
Info
-------------
MoeLarryCurly
*/
Now you can probably see where this is going. Let’s put commas in front of those names:

select Info=(select ','+Name
from #Stooges
for xml path(''))
/*
Info
----------------
,Moe,Larry,Curly
*/
And now we can use the STUFF function to substitute that first comma with an empty string:

select Info=stuff(
(
select ','+Name
from #Stooges
for xml path('')
)
,1,1,'')
/*
Info
---------------
Moe,Larry,Curly
*/
This is great! But WAIT! We’re not done.

Let’s say some smart-aleck (probably from California) with a somewhat esoteric name of X&Y><Z&T decides to join the gang:

insert #Stooges select 4,'X&Y><Z&T'
What happens now?

select Info=stuff(
(
select ','+Name
from #Stooges
for xml path('')
)
,1,1,'')
/*
Info
--------------------------------------
Moe,Larry,Curly,X&amp;Y&gt;&lt;Z&amp;T
*/
Hmmm… That’s the problem with doing these kinds of tricks by creating XML output… certain characters (like ampersand, less-than, and greater-than) are encoded in a special way, as you can see below:

select Info=(select 'X&Y' for xml path('test'))
/*
Info
--------------------
<test>X&amp;Y</test>
*/
The FOR XML clause encodes those special characters as it constructs the XML representation of the data. In order for us to UN-encode those characters, we have to work backwards… in other words, extract the data back OUT of the XML representation of the data. We do this through the .value() method. But since this method only works on a true XML datatype, we have to introduce the ,TYPE directive into our query to force the subquery to produce an XML datatype:

select Info=(select 'X&Y' for xml path('test'),type).value('.','varchar(10)')
/*
Info
----
X&Y
*/
The .value() method takes two parameters. The first parameter describes the specific “piece” of the XML data that we want to extract. In our case, we just passed it '.', which indicates that we want to extract all the data from the XML as a whole. And, as you noticed, it only pulled out the data without the tags (<test></test>). The second parameter indicates what kind of datatype we want to convert it to as we extract it.

So, incorporating the ,TYPE directive and the .value() method into our #Stooges list, we finally get the expected result:

select Info=stuff(
(
select ','+Name
from #Stooges
for xml path(''),type
).value('.','varchar(max)')
,1,1,'')
/*
Info
------------------------
Moe,Larry,Curly,X&Y><Z&T
*/
If you know for sure that the list you’re creating does NOT have any of these special characters, like the query at the beginning of this article which only constructs lists of numbers, then you don’t need to incorporate the ,TYPE directive and .value() method. It will speed things up a bit if you leave those out.

Now we can put together lists of practically anything. For example, below is a query that pulls out all the orders for the CustomerID of OCEAN from the Northwind database, and shows a list of all the products in those orders (sorted by ProductID), with the quantity and price paid.

select OrderID
,ProductList=
stuff(
(
select '; '+convert(varchar,p.ProductID)
+'-'+p.ProductName
+' ('+convert(varchar,d.Quantity)
+'@$'+convert(varchar,d.UnitPrice)+')'
from [Order Details] d
join Products p on d.ProductId=p.ProductID
where d.OrderID=Orders.OrderID
order by p.ProductID
for xml path(''),type
).value('.','nvarchar(max)')
,1,2,'')
from Orders
where CustomerID='OCEAN'
order by OrderID
/*
OrderID ProductList
------- -------------------------------------------------------------------------
10409 14-Tofu (12@$18.60); 21-Sir Rodney's Scones (12@$8.00)
10531 59-Raclette Courdavault (2@$55.00)
10898 13-Konbu (5@$6.00)
10958 5-Chef Anton's Gumbo Mix (20@$21.35); 7-Uncle Bob's Organic Dried Pear...
10986 11-Queso Cabrales (30@$21.00); 20-Sir Rodney's Marmalade (15@$81.00); ...
*/
You can create lists within lists also, as in the example below, which lists all Customers and, for each Customer, lists all Categories (in order of the CategoryName) of Products that he purchased, and, for each Category, lists all ProductIDs (in numerical order) of that Category that were purchased:

with CustsCatgsProds as
(
select o.CustomerID
,p.CategoryID
,p.ProductID
from Orders o
join [Order Details] d on o.OrderID=d.OrderID
join Products p on d.ProductID=p.ProductID
join Categories c on p.CategoryID=c.CategoryID
group by o.CustomerID
,p.CategoryID
,p.ProductID
)
select CustomerID
,CategorizedProductList=
stuff(
(
select '; '+CategoryName+':'
+stuff(
(
select ','+convert(varchar,ProductID)
from CustsCatgsProds
where CustomerID=Customers.CustomerID
and CategoryID=CCP.CategoryID
order by ProductID
for xml path('')
)
,1,1,'')
from (select distinct CategoryID
from CustsCatgsProds
where CustomerID=Customers.CustomerID) CCP
join Categories on CCP.CategoryID=Categories.CategoryID
order by CategoryName
for xml path(''),type
).value('.','nvarchar(max)')
,1,2,'')
from Customers
order by CustomerID
/*
CustomerID CategorizedProductList
---------- ----------------------------------------------------------------
ALFKI Beverages:39,76; Condiments:3,6,63,77; Dairy Products:59,71; ...
ANATR Beverages:70; Confections:19; Dairy Products:11,32,60,69,72; ...
ANTON Beverages:2,34,43,75; Condiments:66; Confections:26,48; Dairy...
...
WHITC Beverages:2,34,35,38,75; Condiments:4,8,44,61,63,77; Confecti...
WILMK Beverages:1,24,43,76; Condiments:8,61,65; Confections:16,19,2...
WOLZA Beverages:1,2,24,35,75; Condiments:61; Confections:62,68; Dai...
*/
If you’ve read some of my past blog posts, you know I’m a big promoter of the CROSS APPLY operator to improve readability of a query. For example, the very first query in this article could be re-written as follows:

select CategoryName
,DiscontinuedIDs
from Categories
cross apply (select DiscIDString=(select ','+convert(varchar,ProductID)
from Products
where CategoryID=Categories.CategoryID
and Discontinued=1
order by ProductID
for xml path(''))) F1
cross apply (select DiscontinuedIDs=stuff(DiscIDString,1,1,'')) F2
order by CategoryName
/*
CategoryName DiscontinuedIDs
--------------- ---------------
Beverages 24
Condiments 5
Confections NULL
Dairy Products NULL
Grains/Cereals 42
Meat/Poultry 9,17,29,53
Produce 28
Seafood NULL
*/
If you do use the CROSS APPLY approach, be very careful in checking the performance and the plan of the query. Adam Haines wrote a terrific blog post that demonstrated that the CROSS APPLY approach to constructing lists can yield terrible results under certain circumstances because of the way that SQL Server performs the query. You can usually re-formulate the query to eliminate that poor performance.

I hope this article was helpful in explaining how the FOR XML PATH method works in creating lists. In my next blog post, I'll show how we can tear those lists apart (or shred them) back into rows.