Friday, October 16, 2009

Un-Making a List (or Shredding of Evidence)

In my previous blog entry, I described in detail how you could use the FOR XML clause to create a column with a comma-separated list of entries acquired from rows in a table somewhere.

In this article, we will do the opposite… We will take a comma-separated list and “shred” the individual items into rows.

Let’s start by creating and populating a test table:

create table #CSVLists 
(
ID int
,Description varchar(30)
,CSVList varchar(200)
)
insert #CSVLists select 1,'Stooges' ,'Moe,Larry,Curly'
union all select 2,'Castaways','Gilligan,Skipper,Thurston,Lovey,'
+'Ginger,Professor,MaryAnne'
union all select 3,'Monkees' ,'Davy,Micky,Peter,Mike'
(Okay, I know you Gilligan’s Island fanatics are chomping at the bit to tell me that the Skipper’s name was Jonas and the Professor’s name was Roy. I know that… Don’t bother e-mailing me about it… save a virtual tree).

There is a handy .nodes() method which you can apply to an XML datatype that will shred it into relational data. So our first step is to convert our CSVList column into a valid XML representation of the data. We do that this way:

select ID
,Description
,XmlList
from #CSVLists
cross apply (select TagsList='<x>'+replace(CSVList,',','</x><x>')+'</x>') F1
cross apply (select XmlList=cast(TagsList as xml)) F2
/*
ID Description XmlList
-- ----------- -----------------------------------------------------------
1 Stooges <x>Moe</x><x>Larry</x><x>Curly</x>
2 Castaways <x>Gilligan</x><x>Skipper</x><x>Thurston</x><x>Lovey</x>...
3 Monkees <x>Davy</x><x>Micky</x><x>Peter</x><x>Mike</x>
*/
The first CROSS APPLY surrounded each item in the CSVList with the tags <x></x>. And the second CROSS APPLY converted that into an actual XML datatype.

Now that we have an XML column, we can apply the .nodes() method to it. We will pass 'x' to the .nodes() method so that it knows the tag (<x></x>) to shred. And let’s take a look at that result:

select ID
,Description
,XMLNode
from #CSVLists
cross apply (select TagsList='<x>'+replace(CSVList,',','</x><x>')+'</x>') F1
cross apply (select XmlList=cast(TagsList as xml)) F2
cross apply XmlList.nodes('x') F3(XmlNode)
/*
Msg 493, Level 16, State 1, Line 1
The column 'XmlNode' that was returned from the nodes() method cannot be used
directly. It can only be used with one of the four xml data type methods,
exist(), nodes(), query(), and value(), or in IS NULL and IS NOT NULL checks.
*/
Oops… This new XmlNode object we created is not something that we can actually look at or use directly. So, as the error message indicated, we will use the .value() method against that XmlNode object to pull out an actual VARCHAR(20) value:

select ID
,Description
,ListItem
from #CSVLists
cross apply (select TagsList='<x>'+replace(CSVList,',','</x><x>')+'</x>') F1
cross apply (select XmlList=cast(TagsList as xml)) F2
cross apply XmlList.nodes('x') F3(XmlNode)
cross
apply (select ListItem=XmlNode.value('.','varchar(20)')) F4
/*
ID Description ListItem
-- ----------- ---------
1 Stooges Moe
1 Stooges Larry
1 Stooges Curly
2 Castaways Gilligan
2 Castaways Skipper
2 Castaways Thurston
2 Castaways Lovey
2 Castaways Ginger
2 Castaways Professor
2 Castaways MaryAnne
3 Monkees Davy
3 Monkees Micky
3 Monkees Peter
3 Monkees Mike
*/
Wow, it works!

But, as is often the case, you know there’s more. Just like I did in my previous blog entry, I’m going to do something to purposely trip things up. Let’s add a list of TV Networks:

insert #CSVLists
select 4,'Networks','ABC,NBC,CBS,FX,TNT,A&E'
And let’s try the same query:

select ID
,Description
,ListItem
from #CSVLists
cross apply (select TagsList='<x>'+replace(CSVList,',','</x><x>')+'</x>') F1
cross apply (select XmlList=cast(TagsList as xml)) F2
cross apply XmlList.nodes('x') F3(XmlNode)
cross
apply (select ListItem=XmlNode.value('.','varchar(20)')) F4
/*
ID Description ListItem
-- ----------- ---------
1 Stooges Moe
1 Stooges Larry
1 Stooges Curly
2 Castaways Gilligan
2 Castaways Skipper
2 Castaways Thurston
2 Castaways Lovey
2 Castaways Ginger
2 Castaways Professor
2 Castaways MaryAnne
3 Monkees Davy
3 Monkees Micky
3 Monkees Peter
3 Monkees Mike
Msg 9411, Level 16, State 1, Line 1
XML parsing: line 1, character 56, semicolon expected
*/
Ouch! Once again, we’ve run into the problem with the way that XML encodes and decodes special characters like ampersand (&).

We cannot CAST something as XML if it has an ampersand as part of the data, because XML will try to interpret it as an entity that must end in a semicolon:

select Info=cast('<x>A&E</x>' as xml)
/*
Msg 9411, Level 16, State 1, Line 1
XML parsing: line 1, character 7, semicolon expected
*/
In order for it to be accepted, we have to encode that ampersand as follows, so that XML will understand it to represent an actual ampersand character:

select Info=cast('<x>A&amp;E</x>' as xml)
/*
Info
--------------
<x>A&amp;E</x>
*/
And, as we learned in my previous blog entry, we can then take the .value() of that to un-encode it:

select Info=cast('<x>A&amp;E</x>' as xml).value('.','varchar(10)')
/*
Info
----
A&E
*/
So in our shredding query, we have to FIRST encode any special characters in our CSVList column, and THEN we can successfully CAST it as an XML column.

One might be tempted to use multiple REPLACE() functions to convert all ampersands to &amp; and less-thans to &lt; and greater-thans to &gt;, but who knows what others characters are out there that will create problems? We already have a tool at our disposal that will encode the characters for us… the FOR XML clause.

select ID
,Description
,XMLEncoded
from #CSVLists
cross apply (select XMLEncoded=(select [*]=CSVList for xml path(''))) F0
/*
ID Description XMLEncoded
-- ----------- ---------------------------------------------------------
1 Stooges Moe,Larry,Curly
2 Castaways Gilligan,Skipper,Thurston,Lovey,Ginger,Professor,MaryAnne
3 Monkees Davy,Micky,Peter,Mike
4 Networks ABC,NBC,CBS,FX,TNT,A&amp;E
*/
We learned in my previous blog entry that we can eliminate tags being created by explicitly specifying a column a name of [*]. If we hadn’t done that, then we would end up with our lists surrounded by <CSVList></CSVList> tags. And we eliminated yet another tag surrounding our data by passing the empty string to FOR XML PATH. So our lists are really the same as they were originally, except for the special characters being automatically encoded for us.

Now we can incorporate that CROSS APPLY into our query:

select ID
,Description
,ListItem
from #CSVLists
cross apply (select XMLEncoded=(select [*]=CSVList for xml path(''))) F0
cross apply (select TagsList='<x>'+replace(XMLEncoded,',','</x><x>')+'</x>') F1
cross apply (select XmlList=cast(TagsList as xml)) F2
cross apply XmlList.nodes('x') F3(XmlNode)
cross
apply (select ListItem=XmlNode.value('.','varchar(20)')) F4
/*
ID Description ListItem
-- ----------- ---------
1 Stooges Moe
1 Stooges Larry
1 Stooges Curly
2 Castaways Gilligan
2 Castaways Skipper
2 Castaways Thurston
2 Castaways Lovey
2 Castaways Ginger
2 Castaways Professor
2 Castaways MaryAnne
3 Monkees Davy
3 Monkees Micky
3 Monkees Peter
3 Monkees Mike
4 Networks ABC
4 Networks NBC
4 Networks CBS
4 Networks FX
4 Networks TNT
4 Networks A&E
*/
Ahhh… Success!

By the way, you may have noticed that the making and un-making (shredding) of these comma-delimited lists is like PIVOTing and UNPIVOTing. PIVOT converts rows to columns and UNPIVOT converts columns to rows. We have done something similar, except in making a list, we converted rows into a single column, and in shredding a list, we converted a single column into rows.

I’ll leave you with one final demonstration. Our #CSVLists table that we created had some comma-separated lists. What if we want to sort the items in the lists?

We would first shred the items into rows, and then, using the FOR XML logic we put together in my previous blog entry, we will re-assemble the lists and use an ORDER BY to sort them:

;with ShreddedLists as
(
select ID
,ListItem
from #CSVLists
cross apply (select XMLEncoded=(select [*]=CSVList for xml path(''))) F0
cross apply (select TagsList='<x>'+replace(XMLEncoded,',','</x><x>')+'</x>') F1
cross apply (select XmlList=cast(TagsList as xml)) F2
cross apply XmlList.nodes('x') F3(XmlNode)
cross apply (select ListItem=XmlNode.value('.','varchar(20)')) F4
)
select ID
,Description
,SortedCSVList
from #CSVLists
cross apply (select CSVStringXML=(select ','+ListItem
from ShreddedLists
where ID=#CSVLists.ID
order by ListItem
for xml path(''),type)) F1
cross apply (select CSVString=CSVStringXML.value('.','varchar(max)')) F2
cross apply (select SortedCSVList=stuff(CSVString,1,1,'')) F3
/*
ID Description SortedCSVList
-- ----------- ---------------------------------------------------------
1 Stooges Curly,Larry,Moe
2 Castaways Gilligan,Ginger,Lovey,MaryAnne,Professor,Skipper,Thurston
3 Monkees Davy,Micky,Mike,Peter
4 Networks A&E,ABC,CBS,FX,NBC,TNT
*/
I hope you have many hours of happy list-making and shredding activity. Until next time…

UPDATE ON Dec12,2009: Please read an important follow-up to this article in which I discuss some issues with the speed of the XML shredding method.

7 comments:

  1. Good post Brad! This is a great way to unpack data oppose a lot of other solutions out there. I wonder how this stacks against the numbers table solution though. Also, you probably know this but you can consolidate the number of cross apply operators.

    DECLARE @t TABLE(
    id INT,
    col VARCHAR(50)
    );

    INSERT INTO @t VALUES (1,'A&E,AdamHaines');
    INSERT INTO @t VALUES (2,'Lauren,Addison');

    SELECT ID
    ,x.i.value('.','varchar(50)')
    FROM @t t
    CROSS APPLY (SELECT XMLEncoded=(SELECT col AS [*] FOR XML PATH(''))) EncodeXML
    CROSS APPLY (SELECT NewXML=CAST(''+REPLACE(XMLEncoded,',','')+'' AS XML)) CastXML
    CROSS APPLY NewXML.nodes('/i') x(i)

    ReplyDelete
  2. Thanks!

    Yes, I considered consolidating, but I wanted to show each step in detail. (Also, I think if I did combine the CAST and REPLACE in one line, it would have made the line too long and I can't stand it when code wraps). In reality, whether it's consolidated or not, the underlying code (when you look at the query plan) is exactly the same... SQL Server will combine them all into one monster expression.

    As far as the Numbers table, I'll bet that it's faster... as long as the comma-delimited list is not really long (remember my blog post from a week or two ago).

    I mainly wanted to demonstrate in these two articles how the XML method works... so many people ask about how to make lists... I wanted to put together an "ultimate" guide, so to speak, so I could just point them in that direction.

    Thanks for the feedback!

    ReplyDelete
  3. Follow-up post from http://bradsruminations.blogspot.com/2010/01/delimited-string-tennis-again-final.html.

    I have some data that includes invalid XML characters that I am trying to shred using the method from this article. Love the technique! The character I am encountering issues with is char(8). "For XML" does escape this character as " ". However, line "cross apply (select XmlList=cast(TagsList as xml)) F3" throws the following error when attempting to cast this escaped character back to an XML data type:

    Msg 9420, Level 16, State 1, Line 1 - XML parsing: line 1, character 200, illegal xml character

    Unfortunately, we do not scrub the data prior to insertion into our production environment. Tons of nested replace statements seems like overkill. I would love to know if there are any recommended techniques to handle these illegal characters and still utilize this article's shredding technique.

    Thanks.

    ReplyDelete
  4. According to the W3C Recommendations on XML, only certain characters are allowed inside XML:

    http://www.w3.org/TR/xml/#charsets

    So illegal characters are CHAR(1) thru CHAR(8), CHAR(11), CHAR(12), CHAR(14) thru CHAR(31), as well as a handful of others above CHAR(126).

    I'm afraid your only alternative is to clean your data first.

    --Brad

    ReplyDelete
  5. This comment has been removed by a blog administrator.

    ReplyDelete