Wednesday, October 14, 2009

Making a List (and Checking It Twice)

It seems like the most frequently-asked question on the MSDN T-SQL Forum is how to put together a column that contains a list of some kind.

For example, take a look at the Northwind database. There’s a Categories table and a Products table. We’d like to produce a list of all Categories and, for each Category, produce a column containing a comma-delimited (and sorted) list of ProductID’s that are discontinued.

The standard answer that we give to these kinds of questions (providing that the original poster of the question is using SQL2005 or beyond) is something like this:

select CategoryName
,DiscontinuedIDs=
stuff(
(
select ','+convert(varchar,ProductID)
from Products
where CategoryID=Categories.CategoryID
and Discontinued=1
order by ProductID
for xml path('')
)
,1,1,'')
from Categories
order by CategoryName
/*
CategoryName DiscontinuedIDs
--------------- ---------------
Beverages 24
Condiments 5
Confections NULL
Dairy Products NULL
Grains/Cereals 42
Meat/Poultry 9,17,29,53
Produce 28
Seafood NULL
*/
Most times the poster will come back and say, “Er… uh… Thanks.” But occasionally someone will say, “Huh? What’s with all that XML stuff? Could you explain what the heck that complicated mess o’ code is doing please?”

I know when I first started studying T-SQL and started frequenting the forums and saw this kind of answer, I felt kind of intimidated and figured that this stuff was more complex than I had thought.

But it all becomes clear if you go through how this works step by step.

For illustration purposes, let’s create a temporary table #Stooges and populate it:

create table #Stooges (ID int, Name varchar(10))
insert #Stooges select 1,'Moe'
union all select 2,'Larry'
union all select 3,'Curly'
Now if we do a normal query on this table, we get rows and columns:

select * from #Stooges
/*
ID Name
--- ------
1 Moe
2 Larry
3 Curly
*/
But if we add a FOR XML clause to the query, we get an XML representation of the data:

select * from #Stooges for xml path
/*
XML_F52E2B61-18A1-11d1-B105-00805F49916B
----------------------------------------
<row><ID>1</ID><Name>Moe</Name></row>
<row><ID>2</ID><Name>Larry</Name></row>
<row><ID>3</ID><Name>Curly</Name></row>
*/
In reality, it did not return 3 rows of data… I just presented it that way for readability. In fact, it didn’t produce any rows or columns… it actually produced an XML stream. Any query that has the FOR XML clause (without the ,TYPE directive) produces an XML stream. The data is streamed out in (2033-character) chunks as the query is executed, as you can see below, where I query the 120,000-record SalesOrderDetail table in AdventureWorks. The first part of the XML stream is plopped into the SSMS Results window almost immediately, and the query is still grinding away to finish up the rest of the stream.

XML Stream in SSMS

(If you want to read more about how the XML streams are created, read this blog entry from Eugene Kogan of Microsoft).

In case you’re curious about where the mysterious F52E2B61-18A1-11d1-B105-00805F49916B identifier name comes from, see the snapshot of the Registry Editor below.

IXMLDocument

It’s the GUID associated with the IXMLDocument Interface that’s part of the Type Library associated with the MSXML DLL files installed into the Windows System Directory. (If that last sentence sounded like a lot of gibberish to you, I can relate… Don’t worry… It’s not important… As far as I’m concerned, we can pretend that the identifier name came from little magic elves sprinkling pixie dust on pretty unicorns… It makes about as much sense and is easier to understand and relate to).

Anyway, back to the subject at hand…

If you use FOR XML in a scalar subquery it will produce an NVARCHAR(MAX) column, unless you use the ,TYPE directive, which will produce a true XML column. You can see this below, where I do FOR XML subqueries with and without the ,TYPE directive and I put the results INTO a temporary table. If we look at that temporary table’s structure (via INFORMATION_SCHEMA.COLUMNS), we can see the datatypes that got created:

select ForXmlPath=(select * 
from #Stooges
for xml path)
,ForXmlPathType=(select *
from #Stooges
for xml path,type)
into #ThrowMeAway

select Table_Name
,Column_Name
,Data_Type
,Character_Maximum_Length
from tempdb.information_schema.columns
where Table_Name like '#ThrowMeAway%'
/*
Table_Name Column_Name Data_Type Character_Maximum_Length
------------------------------- -------------- --------- ------------------------
#ThrowMeAway__...__000000000018 ForXmlPath nvarchar -1
#ThrowMeAway__...__000000000018 ForXmlPathType xml -1
*/
So, with this knowledge, we can perform any kind of string functions on a FOR XML (non ,TYPE) subquery that we like.

Note that when we do a FOR XML PATH without specifying the actual path, SQL Server assumes a path of ‘row’. (Again, for clarification, the result below is a single row, single column string value… I only show it as 3 lines for readability):

select Info=(select Name 
from #Stooges
for xml path)
/*
Info
-----------------------------
<row><Name>Moe</Name></row>
<row><Name>Larry</Name></row>
<row><Name>Curly</Name></row>
*/
If we specify a path of ‘Stooge’ and change the name of the column to StoogeName, we get the following (again, showing the string column as 3 lines for readability):

select Info=(select StoogeName=Name 
from #Stooges
for xml path('Stooge'))
/*
Info
-----------------------------------------------
<Stooge><StoogeName>Moe</StoogeName></Stooge>
<Stooge><StoogeName>Larry</StoogeName></Stooge>
<Stooge><StoogeName>Curly</StoogeName></Stooge>
*/
Now, note that we can choose to have no name (and therefore no tag in the XML) by either making the column into an expression or by giving the column a deliberate name of [*]:

select Info=(select ''+Name
from #Stooges
for xml path('Stooge'))
/*
Info
----------------------------------------------------------------
<Stooge>Moe</Stooge><Stooge>Larry</Stooge><Stooge>Curly</Stooge>
*/


select Info=(select [*]=Name
from #Stooges
for xml path('Stooge'))
/*
Info
----------------------------------------------------------------
<Stooge>Moe</Stooge><Stooge>Larry</Stooge><Stooge>Curly</Stooge>
*/
We can also choose to specify an empty PATH:

select Info=(select [*]=Name
from #Stooges
for xml path(''))
/*
Info
-------------
MoeLarryCurly
*/
Now you can probably see where this is going. Let’s put commas in front of those names:

select Info=(select ','+Name
from #Stooges
for xml path(''))
/*
Info
----------------
,Moe,Larry,Curly
*/
And now we can use the STUFF function to substitute that first comma with an empty string:

select Info=stuff(
(
select ','+Name
from #Stooges
for xml path('')
)
,1,1,'')
/*
Info
---------------
Moe,Larry,Curly
*/
This is great! But WAIT! We’re not done.

Let’s say some smart-aleck (probably from California) with a somewhat esoteric name of X&Y><Z&T decides to join the gang:

insert #Stooges select 4,'X&Y><Z&T'
What happens now?

select Info=stuff(
(
select ','+Name
from #Stooges
for xml path('')
)
,1,1,'')
/*
Info
--------------------------------------
Moe,Larry,Curly,X&amp;Y&gt;&lt;Z&amp;T
*/
Hmmm… That’s the problem with doing these kinds of tricks by creating XML output… certain characters (like ampersand, less-than, and greater-than) are encoded in a special way, as you can see below:

select Info=(select 'X&Y' for xml path('test'))
/*
Info
--------------------
<test>X&amp;Y</test>
*/
The FOR XML clause encodes those special characters as it constructs the XML representation of the data. In order for us to UN-encode those characters, we have to work backwards… in other words, extract the data back OUT of the XML representation of the data. We do this through the .value() method. But since this method only works on a true XML datatype, we have to introduce the ,TYPE directive into our query to force the subquery to produce an XML datatype:

select Info=(select 'X&Y' for xml path('test'),type).value('.','varchar(10)')
/*
Info
----
X&Y
*/
The .value() method takes two parameters. The first parameter describes the specific “piece” of the XML data that we want to extract. In our case, we just passed it '.', which indicates that we want to extract all the data from the XML as a whole. And, as you noticed, it only pulled out the data without the tags (<test></test>). The second parameter indicates what kind of datatype we want to convert it to as we extract it.

So, incorporating the ,TYPE directive and the .value() method into our #Stooges list, we finally get the expected result:

select Info=stuff(
(
select ','+Name
from #Stooges
for xml path(''),type
).value('.','varchar(max)')
,1,1,'')
/*
Info
------------------------
Moe,Larry,Curly,X&Y><Z&T
*/
If you know for sure that the list you’re creating does NOT have any of these special characters, like the query at the beginning of this article which only constructs lists of numbers, then you don’t need to incorporate the ,TYPE directive and .value() method. It will speed things up a bit if you leave those out.

Now we can put together lists of practically anything. For example, below is a query that pulls out all the orders for the CustomerID of OCEAN from the Northwind database, and shows a list of all the products in those orders (sorted by ProductID), with the quantity and price paid.

select OrderID
,ProductList=
stuff(
(
select '; '+convert(varchar,p.ProductID)
+'-'+p.ProductName
+' ('+convert(varchar,d.Quantity)
+'@$'+convert(varchar,d.UnitPrice)+')'
from [Order Details] d
join Products p on d.ProductId=p.ProductID
where d.OrderID=Orders.OrderID
order by p.ProductID
for xml path(''),type
).value('.','nvarchar(max)')
,1,2,'')
from Orders
where CustomerID='OCEAN'
order by OrderID
/*
OrderID ProductList
------- -------------------------------------------------------------------------
10409 14-Tofu (12@$18.60); 21-Sir Rodney's Scones (12@$8.00)
10531 59-Raclette Courdavault (2@$55.00)
10898 13-Konbu (5@$6.00)
10958 5-Chef Anton's Gumbo Mix (20@$21.35); 7-Uncle Bob's Organic Dried Pear...
10986 11-Queso Cabrales (30@$21.00); 20-Sir Rodney's Marmalade (15@$81.00); ...
*/
You can create lists within lists also, as in the example below, which lists all Customers and, for each Customer, lists all Categories (in order of the CategoryName) of Products that he purchased, and, for each Category, lists all ProductIDs (in numerical order) of that Category that were purchased:

with CustsCatgsProds as
(
select o.CustomerID
,p.CategoryID
,p.ProductID
from Orders o
join [Order Details] d on o.OrderID=d.OrderID
join Products p on d.ProductID=p.ProductID
join Categories c on p.CategoryID=c.CategoryID
group by o.CustomerID
,p.CategoryID
,p.ProductID
)
select CustomerID
,CategorizedProductList=
stuff(
(
select '; '+CategoryName+':'
+stuff(
(
select ','+convert(varchar,ProductID)
from CustsCatgsProds
where CustomerID=Customers.CustomerID
and CategoryID=CCP.CategoryID
order by ProductID
for xml path('')
)
,1,1,'')
from (select distinct CategoryID
from CustsCatgsProds
where CustomerID=Customers.CustomerID) CCP
join Categories on CCP.CategoryID=Categories.CategoryID
order by CategoryName
for xml path(''),type
).value('.','nvarchar(max)')
,1,2,'')
from Customers
order by CustomerID
/*
CustomerID CategorizedProductList
---------- ----------------------------------------------------------------
ALFKI Beverages:39,76; Condiments:3,6,63,77; Dairy Products:59,71; ...
ANATR Beverages:70; Confections:19; Dairy Products:11,32,60,69,72; ...
ANTON Beverages:2,34,43,75; Condiments:66; Confections:26,48; Dairy...
...
WHITC Beverages:2,34,35,38,75; Condiments:4,8,44,61,63,77; Confecti...
WILMK Beverages:1,24,43,76; Condiments:8,61,65; Confections:16,19,2...
WOLZA Beverages:1,2,24,35,75; Condiments:61; Confections:62,68; Dai...
*/
If you’ve read some of my past blog posts, you know I’m a big promoter of the CROSS APPLY operator to improve readability of a query. For example, the very first query in this article could be re-written as follows:

select CategoryName
,DiscontinuedIDs
from Categories
cross apply (select DiscIDString=(select ','+convert(varchar,ProductID)
from Products
where CategoryID=Categories.CategoryID
and Discontinued=1
order by ProductID
for xml path(''))) F1
cross apply (select DiscontinuedIDs=stuff(DiscIDString,1,1,'')) F2
order by CategoryName
/*
CategoryName DiscontinuedIDs
--------------- ---------------
Beverages 24
Condiments 5
Confections NULL
Dairy Products NULL
Grains/Cereals 42
Meat/Poultry 9,17,29,53
Produce 28
Seafood NULL
*/
If you do use the CROSS APPLY approach, be very careful in checking the performance and the plan of the query. Adam Haines wrote a terrific blog post that demonstrated that the CROSS APPLY approach to constructing lists can yield terrible results under certain circumstances because of the way that SQL Server performs the query. You can usually re-formulate the query to eliminate that poor performance.

I hope this article was helpful in explaining how the FOR XML PATH method works in creating lists. In my next blog post, I'll show how we can tear those lists apart (or shred them) back into rows.

38 comments:

  1. Brilliant article Brad. You did a great job of explaining the inner details of FOR XML (and I know you just scratched the surface), congrats.
    Although I have to confess the fairy dust theory was working pretty well for me until now :-)

    ReplyDelete
  2. Thanx Brad.

    I never played with XML here, even though i have seen it a lot. I bookmarked this for reference when i do next need it.

    ReplyDelete
  3. In some cases, you have to deal with ascii values less than 32 (space).

    This is one way to work around the problem.

    DECLARE @Sample TABLE
    (
    ID INT,
    Data VARCHAR(100)
    )

    INSERT @Sample
    VALUES (1, 'Peso & Performance SQL'),
    (1, 'MVP'),
    (2, 'Need help ->' + CHAR(2) + '<- '),
    (2, 'With XML string concatenation ?')

    ;WITH cteSource(ID, Content)
    AS (
    SELECT i.ID,
    f.Content.value('.', 'NVARCHAR(MAX)')
    FROM (
    SELECT ID
    FROM @Sample
    GROUP BY ID
    ) AS i
    CROSS APPLY (
    SELECT CAST(', ' + w.Data AS VARBINARY(MAX))
    FROM @Sample AS w
    WHERE w.ID = i.ID
    FOR XML PATH(''),
    TYPE
    ) AS f(Content)
    )
    SELECT ID,
    STUFF(Content, 1, 2, '') AS Content
    FROM (
    SELECT ID,
    CAST(N'' AS XML).value('xs:base64Binary(sql:column("Content"))', 'VARBINARY(MAX)') AS Content
    FROM cteSource
    ) AS d

    ReplyDelete
  4. Thanks for the extra info, Peso!

    ReplyDelete
  5. Hi Brad,
    Nice article. I have a problem with '&' character.ProductID contains '&' , for example "pen&pencil", when i tried to update this as "pen&paper" from front end into database.But database doesn't accept this changes..what would be the problem?

    Thanks in advance

    ReplyDelete
  6. @Anonymous:

    What is the front end? How are you doing the update? What is the exact command? In what way does the database not accept the change? Is there an error? Or does the ProductID come out differently from 'pen&paper'?

    --Brad

    ReplyDelete
  7. Dear Brad,

    In my query I am selecting "COUNT(*)" as comma separated
    The problem is when COUNT(*) returns 0 then XML PATH skips it.
    I am not getting why?
    any idea?

    ReplyDelete
  8. Can you be more specific? Can you post some code where you're trying to use COUNT(*)?

    If you like, you can send me an email at brad at stockciphering dot com.

    ReplyDelete
  9. Peso 's solution for lower ascii values has a problem with text entries with an even number of characters.

    Just try:
    ...
    INSERT @Sample
    VALUES (1, 'Peso & Performance SQL'),
    (1, 'MVP'),
    (2, 'Need help ->' + CHAR(2) + '<- '),
    (2, 'With XML string concatenation ?'),
    (2, 'Add new data!'),
    (2, 'ABC'),
    (2, 'ABCD'),
    (2, 'ABCD with even no. of chars is ignored')

    Any solutions?

    ReplyDelete
  10. Excellent tutorial Brad!! Having worked with Sybase's awesomely simple 'list' function, I was struggling to understand XPATH queries. Your tutorial has finally helped me figure out what XPath is all about! Thanks!!

    ReplyDelete
  11. Karthiek... Thanks so much for the positive feedback... I'm so glad it helped you out!

    --Brad

    ReplyDelete
  12. Frans: I'm afraid I don't have a solution to get around the problem that you posed with Peso's code. I'll have to look more closely at it.

    --Brad

    ReplyDelete
    Replies
    1. Brad,
      were you able to look more closely at Fran's code.

      Delete
  13. I've spent the whole day trying to apply this to a query I'm working on. I get the XML Path and the concatenated results. The problem is that every row has the same values. I know it's wrong, I just can't figure out how/where your code applies a "break" so that each row/Order ID has a different set of Products.

    Thanks.

    ReplyDelete
  14. Ooops - In case it helps, here's my code:

    SELECT
    CONVERT(varchar(10), CONVERT(smalldatetime, CAST(sched.next_run_date as varchar), 120), 120)
    AS next_run_date_time
    ,jobname=
    stuff(
    (
    select '; '+job.[name]
    FROM msdb.dbo.sysjobs job
    JOIN msdb.dbo.sysjobschedules sched
    ON sched.job_id = job.job_id
    WHERE job.enabled = 1
    AND job.category_id = 103
    order by job.[name]
    for xml path ('')
    )
    ,1,2,'')
    FROM msdb.dbo.sysjobschedules sched
    WHERE sched.next_run_date > 0
    order by next_run_date_time

    ReplyDelete
  15. When I follow the syntax for the sub-query join:

    select *
    FROM msdb.dbo.sysjobs job
    WHERE job.job_id=msdb.dbo.sysjobschedules.job_id

    I get an error:
    Msg 4104, Level 16, State 1, Line 1
    The multi-part identifier "msdb.dbo.sysjobschedules.job_id" could not be bound.

    Is this syntax only valid in certain version of SQL Server? I have SQL Server 2008 R2 installed.

    Thanks again

    ReplyDelete
  16. @Anonymous:

    I remember working with Frans' code for a while, but I couldn't get it to work. Peso would be the one to ask.

    --Brad

    ReplyDelete
  17. @Opher:

    Sorry I'm late in responding... I didn't see your posts until today.

    Here is how I would accomplish what you're trying to do:

    select nextrundate=convert(varchar(10),convert(smalldatetime,convert(varchar(10),x.next_run_date)),120)
    ,joblist=stuff((select '; '+job.name
    from msdb.dbo.sysjobschedules sched
    join msdb.dbo.sysjobs job on sched.job_id=job.job_id
    where sched.next_run_date=x.next_run_date
    and job.enabled=1
    and job.category_id=103
    order by job.name
    for xml path(''))
    ,1,2,'')
    from (select distinct next_run_date
    from msdb.dbo.sysjobschedules) x

    --Brad

    ReplyDelete
  18. Brad! you are one of the best, God bless you.

    ReplyDelete
  19. Don from CaliforniaMay 19, 2012 at 7:30 AM

    Awesome job - thank you very much for the post.

    ReplyDelete
  20. thanks a lot sir

    ReplyDelete
  21. Outstanding!! I googled for a way to concatenate columnar data, and found the basic solution using FOR XML and STUFF. But I was totally in the dark as to how it worked. Your post has cleared the fog! Thanks a million.

    ReplyDelete
  22. Thank you for the great feedback, John! I'm glad the post was helpful. You made my day!

    ReplyDelete
  23. Great explanation. I had been able to fumble thru getting for xml path to work before, but did not really understand what it was doing. This makes sense, thanks!

    ReplyDelete
  24. brillant piece of information, I had come to know about your web-page from my friend hardkik, chennai,i have read atleast 9 posts of yours by now, and let me tell you, your webpage gives the best and the most interesting information. This is just the kind of information that i had been looking for, i'm already your rss reader now and i would regularly watch out for the new posts, once again hats off to you! Thanx a million once again, Regards, Single Row Function in sql




    ReplyDelete
  25. attractive piece of information, I had come to know about your blog from my friend arjun, ahmedabad,i have read atleast eleven posts of yours by now, and let me tell you, your website gives the best and the most interesting information. This is just the kind of information that i had been looking for, i'm already your rss reader now and i would regularly watch out for the new posts, once again hats off to you! Thanks a lot once again, Regards, Single Row Function

    ReplyDelete
  26. It is outstanding article.I havent seen article like this.

    ReplyDelete
  27. Thanks for the feedback! Glad you found it helpful!

    ReplyDelete
  28. Great post. Like many other readers I found other solutions to this problem online but none as clear.

    Cheers, Max

    ReplyDelete
  29. It doesn't matter how old this post is, I have to admit: firstly that fairy dust is my preferred method, but when the fairies are on holiday I will resort to understanding, and you have helped me with this in such a great way.
    I am very new to SQL but I enjoy a good challenge; having encountered STUFF & FOR XML PATH some times in team scripts, but not fully needing to use it I left it to one or two other individuals to handle. Now I have found not only a reason to need it but a time where no one has been available to assign it to, but what's best is I have understood it from your post here, and been able to apply it. I will be using it again and again, when necessary of course, and I want to thank you for that.

    I enjoy learning and you have made my day on this one. =)

    Cheers, Mike P.

    ReplyDelete
  30. @MikeP:

    Thank you so much for your comments. Like you, I was confused by all this STUFF and XML PATH stuff when I saw code posted on the forums and I set out to understand what's going on. This post was the result... I wanted to make sure others could grasp it as well. I'm glad it did the job!

    ReplyDelete
  31. Dear Brad,

    your code works so great, i really really thank you for that!
    I helped me a lot.
    Just a short question to improve my query:

    What would be the best way to get just the first 5 entries comma separated?
    Maybe just like you have 10 Stooges and you just need to now the first 5 one's :)

    ReplyDelete
    Replies
    1. ahhh sorry Brad, i'm so dumb.

      I found it b myself:

      (select top 5 '|' + ....

      THANK YOU!!

      Delete
  32. Brad,
    Bloody 'ell!
    Took a while to find something that I could use without pivot and then a short time to apply your workings to what I was looking for.
    Cheers!
    Peter

    ReplyDelete
  33. @Peter:

    Thanks for the feedback... So glad that this was a help to you!

    ReplyDelete
  34. IT nearshoring is a whole industry that has been appreciated by many companies from all over the world. And by clicking on the link besuche die Website you can find out more information about him in Poland, because employees from this country will give a tangible increase to your company with the help of first-class skills.

    ReplyDelete