SELECT Blog FROM Brad.Schulz CROSS APPLY SQL.Server(): Un-Making a List (or Shredding of Evidence)

In my previous blog entry, I described in detail how you could use the FOR XML clause to create a column with a comma-separated list of entries acquired from rows in a table somewhere.

In this article, we will do the opposite… We will take a comma-separated list and “shred” the individual items into rows.

Let’s start by creating and populating a test table:

create table #CSVLists 
(
   ID          int
  ,Description varchar(30)
  ,CSVList     varchar(200)
)
insert #CSVLists select 1,'Stooges'  ,'Moe,Larry,Curly'
       union all select 2,'Castaways','Gilligan,Skipper,Thurston,Lovey,'
                                     +'Ginger,Professor,MaryAnne'
       union all select 3,'Monkees'  ,'Davy,Micky,Peter,Mike'

(Okay, I know you Gilligan’s Island fanatics are chomping at the bit to tell me that the Skipper’s name was Jonas and the Professor’s name was Roy. I know that… Don’t bother e-mailing me about it… save a virtual tree).

There is a handy .nodes() method which you can apply to an XML datatype that will shred it into relational data. So our first step is to convert our CSVList column into a valid XML representation of the data. We do that this way:

select ID
      ,Description
      ,XmlList
from #CSVLists
cross apply (select TagsList='<x>'+replace(CSVList,',','</x><x>')+'</x>') F1
cross apply (select XmlList=cast(TagsList as xml)) F2
/*
ID Description XmlList
-- ----------- -----------------------------------------------------------
 1 Stooges     <x>Moe</x><x>Larry</x><x>Curly</x>
 2 Castaways   <x>Gilligan</x><x>Skipper</x><x>Thurston</x><x>Lovey</x>...
 3 Monkees     <x>Davy</x><x>Micky</x><x>Peter</x><x>Mike</x>
*/

The first CROSS APPLY surrounded each item in the CSVList with the tags <x></x>. And the second CROSS APPLY converted that into an actual XML datatype.

Now that we have an XML column, we can apply the .nodes() method to it. We will pass 'x' to the .nodes() method so that it knows the tag (<x></x>) to shred. And let’s take a look at that result:

select ID
      ,Description
      ,XMLNode
from #CSVLists
cross apply (select TagsList='<x>'+replace(CSVList,',','</x><x>')+'</x>') F1
cross apply (select XmlList=cast(TagsList as xml)) F2
cross apply XmlList.nodes('x') F3(XmlNode)
/*
Msg 493, Level 16, State 1, Line 1
The column 'XmlNode' that was returned from the nodes() method cannot be used 
directly. It can only be used with one of the four xml data type methods, 
exist(), nodes(), query(), and value(), or in IS NULL and IS NOT NULL checks.
*/

Oops… This new XmlNode object we created is not something that we can actually look at or use directly. So, as the error message indicated, we will use the .value() method against that XmlNode object to pull out an actual VARCHAR(20) value:

select ID
      ,Description
      ,ListItem
from #CSVLists
cross apply (select TagsList='<x>'+replace(CSVList,',','</x><x>')+'</x>') F1
cross apply (select XmlList=cast(TagsList as xml)) F2
cross apply XmlList.nodes('x') F3(XmlNode)
cross apply (select ListItem=XmlNode.value('.','varchar(20)')) F4
/*
ID Description ListItem
-- ----------- ---------
 1 Stooges     Moe
 1 Stooges     Larry
 1 Stooges     Curly
 2 Castaways   Gilligan
 2 Castaways   Skipper
 2 Castaways   Thurston
 2 Castaways   Lovey
 2 Castaways   Ginger
 2 Castaways   Professor
 2 Castaways   MaryAnne
 3 Monkees     Davy
 3 Monkees     Micky
 3 Monkees     Peter
 3 Monkees     Mike
*/

Wow, it works!

But, as is often the case, you know there’s more. Just like I did in my previous blog entry, I’m going to do something to purposely trip things up. Let’s add a list of TV Networks:

insert #CSVLists
select 4,'Networks','ABC,NBC,CBS,FX,TNT,A&E'

And let’s try the same query:

select ID
      ,Description
      ,ListItem
from #CSVLists
cross apply (select TagsList='<x>'+replace(CSVList,',','</x><x>')+'</x>') F1
cross apply (select XmlList=cast(TagsList as xml)) F2
cross apply XmlList.nodes('x') F3(XmlNode)
cross apply (select ListItem=XmlNode.value('.','varchar(20)')) F4
/*
ID Description ListItem
-- ----------- ---------
 1 Stooges     Moe
 1 Stooges     Larry
 1 Stooges     Curly
 2 Castaways   Gilligan
 2 Castaways   Skipper
 2 Castaways   Thurston
 2 Castaways   Lovey
 2 Castaways   Ginger
 2 Castaways   Professor
 2 Castaways   MaryAnne
 3 Monkees     Davy
 3 Monkees     Micky
 3 Monkees     Peter
 3 Monkees     Mike
Msg 9411, Level 16, State 1, Line 1
XML parsing: line 1, character 56, semicolon expected
*/

Ouch! Once again, we’ve run into the problem with the way that XML encodes and decodes special characters like ampersand (&).

We cannot CAST something as XML if it has an ampersand as part of the data, because XML will try to interpret it as an entity that must end in a semicolon:

select Info=cast('<x>A&E</x>' as xml)
/*
Msg 9411, Level 16, State 1, Line 1
XML parsing: line 1, character 7, semicolon expected
*/

In order for it to be accepted, we have to encode that ampersand as follows, so that XML will understand it to represent an actual ampersand character:

select Info=cast('<x>A&amp;E</x>' as xml)
/*
Info
--------------
<x>A&amp;E</x>
*/

And, as we learned in my previous blog entry, we can then take the .value() of that to un-encode it:

select Info=cast('<x>A&amp;E</x>' as xml).value('.','varchar(10)')
/*
Info
----
A&E
*/

So in our shredding query, we have to FIRST encode any special characters in our CSVList column, and THEN we can successfully CAST it as an XML column.

One might be tempted to use multiple REPLACE() functions to convert all ampersands to & and less-thans to < and greater-thans to >, but who knows what others characters are out there that will create problems? We already have a tool at our disposal that will encode the characters for us… the FOR XML clause.

select ID
      ,Description
      ,XMLEncoded
from #CSVLists
cross apply (select XMLEncoded=(select [*]=CSVList for xml path(''))) F0
/*
ID Description XMLEncoded
-- ----------- ---------------------------------------------------------
 1 Stooges     Moe,Larry,Curly
 2 Castaways   Gilligan,Skipper,Thurston,Lovey,Ginger,Professor,MaryAnne
 3 Monkees     Davy,Micky,Peter,Mike
 4 Networks    ABC,NBC,CBS,FX,TNT,A&amp;E
*/

We learned in my previous blog entry that we can eliminate tags being created by explicitly specifying a column a name of [*]. If we hadn’t done that, then we would end up with our lists surrounded by <CSVList></CSVList> tags. And we eliminated yet another tag surrounding our data by passing the empty string to FOR XML PATH. So our lists are really the same as they were originally, except for the special characters being automatically encoded for us.

Now we can incorporate that CROSS APPLY into our query:

select ID
      ,Description
      ,ListItem
from #CSVLists
cross apply (select XMLEncoded=(select [*]=CSVList for xml path(''))) F0
cross apply (select TagsList='<x>'+replace(XMLEncoded,',','</x><x>')+'</x>') F1
cross apply (select XmlList=cast(TagsList as xml)) F2
cross apply XmlList.nodes('x') F3(XmlNode)
cross apply (select ListItem=XmlNode.value('.','varchar(20)')) F4
/*
ID Description ListItem
-- ----------- ---------
 1 Stooges     Moe
 1 Stooges     Larry
 1 Stooges     Curly
 2 Castaways   Gilligan
 2 Castaways   Skipper
 2 Castaways   Thurston
 2 Castaways   Lovey
 2 Castaways   Ginger
 2 Castaways   Professor
 2 Castaways   MaryAnne
 3 Monkees     Davy
 3 Monkees     Micky
 3 Monkees     Peter
 3 Monkees     Mike
 4 Networks    ABC
 4 Networks    NBC
 4 Networks    CBS
 4 Networks    FX
 4 Networks    TNT
 4 Networks    A&E
*/

Ahhh… Success!

By the way, you may have noticed that the making and un-making (shredding) of these comma-delimited lists is like PIVOTing and UNPIVOTing. PIVOT converts rows to columns and UNPIVOT converts columns to rows. We have done something similar, except in making a list, we converted rows into a single column, and in shredding a list, we converted a single column into rows.

I’ll leave you with one final demonstration. Our #CSVLists table that we created had some comma-separated lists. What if we want to sort the items in the lists?

We would first shred the items into rows, and then, using the FOR XML logic we put together in my previous blog entry, we will re-assemble the lists and use an ORDER BY to sort them:

;with ShreddedLists as
(
  select ID
        ,ListItem
  from #CSVLists
  cross apply (select XMLEncoded=(select [*]=CSVList for xml path(''))) F0
  cross apply (select TagsList='<x>'+replace(XMLEncoded,',','</x><x>')+'</x>') F1
  cross apply (select XmlList=cast(TagsList as xml)) F2
  cross apply XmlList.nodes('x') F3(XmlNode)
  cross apply (select ListItem=XmlNode.value('.','varchar(20)')) F4
)
select ID
      ,Description
      ,SortedCSVList
from #CSVLists
cross apply (select CSVStringXML=(select ','+ListItem
                                  from ShreddedLists
                                  where ID=#CSVLists.ID
                                  order by ListItem
                                  for xml path(''),type)) F1
cross apply (select CSVString=CSVStringXML.value('.','varchar(max)')) F2
cross apply (select SortedCSVList=stuff(CSVString,1,1,'')) F3
/*
ID Description SortedCSVList
-- ----------- ---------------------------------------------------------
 1 Stooges     Curly,Larry,Moe
 2 Castaways   Gilligan,Ginger,Lovey,MaryAnne,Professor,Skipper,Thurston
 3 Monkees     Davy,Micky,Mike,Peter
 4 Networks    A&E,ABC,CBS,FX,NBC,TNT
*/

I hope you have many hours of happy list-making and shredding activity. Until next time…

UPDATE ON Dec12,2009: Please read an important follow-up to this article in which I discuss some issues with the speed of the XML shredding method.

8 comments:

Adam HainesOctober 16, 2009 at 9:59 PM
Good post Brad! This is a great way to unpack data oppose a lot of other solutions out there. I wonder how this stacks against the numbers table solution though. Also, you probably know this but you can consolidate the number of cross apply operators.

DECLARE @t TABLE(
id INT,
col VARCHAR(50)
);

INSERT INTO @t VALUES (1,'A&E,AdamHaines');
INSERT INTO @t VALUES (2,'Lauren,Addison');

SELECT ID
,x.i.value('.','varchar(50)')
FROM @t t
CROSS APPLY (SELECT XMLEncoded=(SELECT col AS [*] FOR XML PATH(''))) EncodeXML
CROSS APPLY (SELECT NewXML=CAST(''+REPLACE(XMLEncoded,',','')+'' AS XML)) CastXML
CROSS APPLY NewXML.nodes('/i') x(i)
Brad SchulzOctober 16, 2009 at 10:20 PM
Thanks!

Yes, I considered consolidating, but I wanted to show each step in detail. (Also, I think if I did combine the CAST and REPLACE in one line, it would have made the line too long and I can't stand it when code wraps). In reality, whether it's consolidated or not, the underlying code (when you look at the query plan) is exactly the same... SQL Server will combine them all into one monster expression.

As far as the Numbers table, I'll bet that it's faster... as long as the comma-delimited list is not really long (remember my blog post from a week or two ago).

I mainly wanted to demonstrate in these two articles how the XML method works... so many people ask about how to make lists... I wanted to put together an "ultimate" guide, so to speak, so I could just point them in that direction.

Thanks for the feedback!
AnonymousMay 14, 2010 at 7:37 AM
Follow-up post from http://bradsruminations.blogspot.com/2010/01/delimited-string-tennis-again-final.html.

I have some data that includes invalid XML characters that I am trying to shred using the method from this article. Love the technique! The character I am encountering issues with is char(8). "For XML" does escape this character as " ". However, line "cross apply (select XmlList=cast(TagsList as xml)) F3" throws the following error when attempting to cast this escaped character back to an XML data type:

Msg 9420, Level 16, State 1, Line 1 - XML parsing: line 1, character 200, illegal xml character

Unfortunately, we do not scrub the data prior to insertion into our production environment. Tons of nested replace statements seems like overkill. I would love to know if there are any recommended techniques to handle these illegal characters and still utilize this article's shredding technique.

Thanks.
Brad SchulzMay 14, 2010 at 11:19 AM
According to the W3C Recommendations on XML, only certain characters are allowed inside XML:

http://www.w3.org/TR/xml/#charsets

So illegal characters are CHAR(1) thru CHAR(8), CHAR(11), CHAR(12), CHAR(14) thru CHAR(31), as well as a handful of others above CHAR(126).

I'm afraid your only alternative is to clean your data first.

--Brad
AnonymousJune 1, 2010 at 10:51 AM
ww
buy viagraJune 1, 2010 at 10:53 AM
This comment has been removed by a blog administrator.
Devesh Borana(DevB)August 4, 2011 at 10:39 PM
Good Post as usual Brad
Generic viagrabuyMay 29, 2014 at 4:45 AM
such a good way you explain such a informative things thanks for sharing
Generic Viagra
Sildenafil citrate

SELECT Blog FROM Brad.Schulz CROSS APPLY SQL.Server()

Friday, October 16, 2009

Un-Making a List (or Shredding of Evidence)

8 comments:

Search This Blog

Subscribe

About Me

Keywords

Blog Archive

Followers

# # #

SELECT Blog FROM Brad.Schulz CROSS APPLY SQL.Server()

Friday, October 16, 2009

Un-Making a List (or Shredding of Evidence)

8 comments:

Search This Blog

Subscribe

Subscribe To...

About Me

Keywords

Blog Archive

Followers

# # #