SELECT Blog FROM Brad.Schulz CROSS APPLY SQL.Server(): Delimited String Tennis (Again)… The Final Volley?

Wednesday, January 6, 2010

Delimited String Tennis (Again)… The Final Volley?

This is a continuation of my previous blog post, Delimited String Tennis Anyone?, which playfully talked about a virtual T-SQL Tennis match… a friendly rally back and forth between SQL Server MVP Adam Haines and myself in discussing methods of shredding comma-delimited strings in our blogs.

In my previous post, I expressed my disappointment with the following bit of code, which shreds elements from a comma-delimited string consisting of 5000 elements:

/* 
Note that I have to CAST() the first parameter of REPLICATE
in order to force it to produce more than 8000 bytes
*/
declare @CommaList varchar(max)
set @CommaList='ABC'+replicate(cast(',ABC' as varchar(max)),4999)
 
select x.i.value('.','varchar(3)')
from (select XMLList=cast('<i>'+replace(@CommaList,',','</i><i>')+'</i>' as xml)) a
cross apply XMLList.nodes('i') x(i)

This code, which I will call the “Derived Table Method”, took over 6 painful minutes to execute!

It had to do with how the XML column in the derived table was actually streamed and processed in small chunks as opposed to just immediately instantiating a normal, vanilla XML column.

My solution to this problem was to just create an intermediate XML variable and process it directly:

declare @XMLList xml
set @XMLList=cast('<i>'+replace(@CommaList,',','</i><i>')+'</i>' as xml)
 
select x.i.value('.','varchar(3)')
from @XMLList.nodes('i') x(i)

This code, which I will call the “XML Variable Method”, was more like it… it took less than a second to execute… an improvement of over 2000%!

Since I wrote that blog entry, I received a few comments and e-mails that suggested some other approaches.

A person named “Mister Magoo” (no relation to the cartoon character voiced by Jim Backus in the 1950’s and 1960’s I’m sure) commented that the “Derived Table Method” could be improved dramatically just by adding a .query() to the XML column created in the derived table:

select x.i.value('.','varchar(3)')
from (select XMLList=cast('<i>'+replace(@CommaList,',','</i><i>')+'</i>' as xml).query('.')) a
cross apply XMLList.nodes('i') x(i)

He was correct! This also took less than a second! I suppose the theory here is that adding the .query() method forces the XML column to be instantiated immediately instead of bringing about the whole streaming nonsense.

This is excellent… it eliminates the step of having to create an intermediate XML variable… everything can be done in one single statement. It’s not quite as fast as the “XML Variable Method”, but we’re only talking about a small difference in milliseconds… it certainly performs like lightning.

The results of the above 3 methods come out as follows:

/*
Method                            CPU   Duration(ms)
----------------------------------------------------
Derived Table                 369,187     398,791
Derived Table with .query()       156         168
XML Variable                       99         117
*/

I also got an e-mail from my friend SQL Server MVP Alejandro Mesa (some of you may know him as Hunchback on various SQL Forums), who mentioned that things could be sped up even more by changing the XQuery expression passed as the first parameter to the .value() method.

Instead of the simple ‘.’ that I passed…

select x.i.value('.','varchar(3)')

…Alejandro suggested using the .text() accessor and an ordinal [1] predicate, like so:

select x.i.value('(./text())[1]','varchar(3)')

True enough, this did improve the performance of the query. With this XQuery expression, we are indicating that we want the first instance ([1]) of the text-only node values (i.e. not any XML sub-elements) in our x.i node. This is more specific than our original ‘.’ expression, which indicated that we wanted to process the entire x.i node.

You can get a clearer idea of how these two XQuery expressions work in the following examples:

declare @XMLVar xml
set @XMLVar='<a>abc<b>xyz</b>123</a>'
 
select x.i.value('.','varchar(10)')
from @XMLVar.nodes('a') x(i)
/* Returns 'abcxyz123' */
 
select x.i.value('(./text())[1]','varchar(10)')
from @XMLVar.nodes('a') x(i)
/* Returns 'abc' */
 
select x.i.value('(./text())[2]','varchar(10)')
from @XMLVar.nodes('a') x(i)
/* Returns '123' */

The @XMLVar variable above had two text nodes within the <a></a> element. In our original queries at the beginning of this article, each of the <i></i> elements contained only a single text node (and that’s why the expressions of ‘.’ and ‘(./text())[1]’ both produced the same result), and yet the expression ‘(./text())[1]’ was much more efficient and therefore faster because it was so much more specific in indicating what we wanted to process.

When I applied Alejandro’s suggestion to the queries, you can see how the CPU and Duration decreased even more:

/*
Method                                        CPU   Duration(ms)
----------------------------------------------------------------
Derived Table                             369,187     398,791
Derived Table with .query()                   156         168
Derived Table with .query() and text()[1]     130         159
XML Variable                                   99         117
XML Variable with text()[1]                    76          77
*/

Alejandro also pointed me towards a very detailed technical article at Microsoft TechNet entitled Performance Optimizations for the XML Data Type in SQL Server 2005. If you do a lot of XML manipulation, this article looks like a must-read.

Thank you again to Mister Magoo and Alejandro for their contributions! With all of this knowledge, we can all now be the speed demons of delimited string shredding.

18 comments:

Kent WaldropJanuary 6, 2010 at 11:36 AM
Well done, Brad; me like.

:-)

Kent Waldrop
ReplyDelete
Replies
AnonymousJanuary 6, 2010 at 11:44 PM
I especially like the derived table method with query() and text(), because it can be written as an inline function.
ReplyDelete
Replies
Michael J. SwartJanuary 7, 2010 at 5:57 AM
Quality post
ReplyDelete
Replies
Brad SchulzJanuary 7, 2010 at 10:10 AM
@Peso: Agreed... that's the beauty of it. Kudos to Mister Magoo.

@Kent and @Michael: Thanks! Me like too! 8^)
ReplyDelete
Replies
Mister MagooJanuary 7, 2010 at 3:41 PM
Thanks for the mention Brad, Peso et al.

I guess these are all going to be considered "hacks" by some, but to me they are just totally exceptional tweaks!

Nicely documented by the way - I expect to see this turning up in solutions to a few sql challenges...
ReplyDelete
Replies
AnonymousMay 13, 2010 at 3:44 PM
Was wondering if anyone has a recommended technique to handle invalid XML data/characters which throws errors when trying to cast to an XML data type when shredding delimited strings.
ReplyDelete
Replies
Brad SchulzMay 13, 2010 at 3:53 PM
@Anonymous:

Is this what you're looking for?:

http://bradsruminations.blogspot.com/2009/10/un-making-list-or-shredding-of-evidence.html

It talks about how to handle delimited strings that have ampersands or less-than or greater-than signs, which can "break" the usual shredding method.

--Brad
ReplyDelete
Replies
Jeff ModenSeptember 18, 2010 at 7:12 PM
Howdy folks. Would any of you be interested in participating in a little experiment, please? Thanks either way. It's located at the following URL...
http://www.sqlservercentral.com/Forums/Topic988784-391-1.aspx
ReplyDelete
Replies
Paul WhiteSeptember 18, 2010 at 8:59 PM
Brad,

An epic improvement to the usual XML string-splitting method - extremely well done!

That said, I am still uncomfortable using XML for string concatenation or splitting. I continue to recommend Adam Machanic's SQLCLR string splitter - which uses half the CPU of even the fastest XML method shown here, with no restrictions on the string to be split.

I'll follow Jeff's challenge on SSC - I fully expect your XML code to be faster than the numbers table (even after adding the [*] trick to handle entitization).

My sincere congratulations again.

Paul
ReplyDelete
Replies
AnonymousMarch 30, 2011 at 10:02 AM
I've used the above XML Variable method in SQL Server 2005 to shred 15 million nodes in just over 20 minutes.

We upgraded to SQL Server 2008 R2 and now the same shred takes over 19 hours.

Our SysAdmin & DBA have looked over all of the configuration issues with SS 2008 R2 and everything was done according to best practices.

Here's my code

--SHRED THE XML--------
WHILE @RowCNT <= @MaxRowCNT
BEGIN
INSERT INTO [dbo].[MAIDEN_POLICY_SHRED](POLICY_ID, fName, fValue)
SELECT
@RowCNT,
x.header.value('string[1]','nvarchar(255)') as fName,
x.header.value('string[2]','nvarchar(4000)') as fValue
FROM @x.nodes('/java/object/void') AS x(header)
SELECT @RowCNT = @RowCNT+1
SELECT @x = (SELECT CAST(POLICY_CLOB as XML) FROM [dbo].[POLICY_VALUE_SET] WHERE POLICY_ID=@RowCNT)
END
GO
ReplyDelete
Replies
Brad SchulzMarch 30, 2011 at 10:25 AM
@Anonymous:

Instead of a loop, could you try it with a single statement like so? (hope you can read this):

select policy_id
,fName=StringNode.value('((./string)[1]/text())[1]','nvarchar(255)')
,fValue=StringNode.value('((./string)[2]/text())[1]','nvarchar(4000)')
from policy_value_set
cross apply (select policy_xml=cast(policy_clob as xml).query('.')) F_XML
cross apply policy_xml.nodes('/java/object/void') F_Nodes(StringNode)
ReplyDelete
Replies
AnonymousMarch 30, 2011 at 12:07 PM
Ok issue is resolved.

Explaination:
SQL Server 2008 has an improved XML Kernal that creates optimistic plans when the data being shred and the shred datatype are the same.

In my example, the xml fValue was defined as VARCHAR, forcing 2008 to run very pessimistic plan due to having to convert, inline the UTF-16 characters into a VARCHAR. After changing to NVARCHAR(4000) I was able to shred 15million nodes in 21.3 minutes.

Changing the script to your soution Brad actually decreased performance similarly to where I was at before.

I like what you posted because I was not aware of StringNode as an XML shredding function.

I'm getting a feeling that SQL Server 2008 XML Kernal changes need to be fully published. Any pointers?

mike
ReplyDelete
Replies
Brad SchulzMarch 30, 2011 at 12:18 PM
I'm sorry, Mike, I'm not sure what else to add.

I'm kind of surprised and disappointed that the set-based solution I posted didn't perform as well. (And just so there's no confusion, StringNode is not a shredding function... it was just the name I gave to the .nodes() entry in the second CROSS APPLY in the query).

I'd post your question on one of the forums out there to see what other advice you can get. Perhaps the SQL XML Forum at MSDN: http://social.msdn.microsoft.com/Forums/en-US/sqlxml/threads. There's a guy there named "wBob" who is an XML wizard.

--Brad
ReplyDelete
Replies
AnonymousMarch 30, 2011 at 12:42 PM
Thank you for all of your suggestions Brad.
I too was very surprised that using a single in-line statement did not perform as well as the loop. It doesn't make sense but rest assured, I will persevere with your recommendation in an effort to write the most streamlined and effecient Shredding script as possible, including contacting wBob.

Thanks again!
Mike
ReplyDelete
Replies
CialisJanuary 13, 2012 at 11:27 AM
That was a great tutorial!
ReplyDelete
Replies
Logan airport hotelsMarch 7, 2012 at 12:12 AM
nice tags :) and nice post :) The way you explained it is really awesome and makes every one to read till the end. keep posting.. I must appreciate your article writing skills.
ReplyDelete
Replies
Wimpernserum mit BimatoprostMay 31, 2019 at 12:25 AM
I don't understand how it works. I failed to implement it.
ReplyDelete
Replies
GabapentinJune 18, 2022 at 6:30 AM
Great website, continue the Excellent work!
ReplyDelete
Replies

Add comment

SELECT Blog FROM Brad.Schulz CROSS APPLY SQL.Server()

Wednesday, January 6, 2010

Delimited String Tennis (Again)… The Final Volley?

18 comments:

Search This Blog

Subscribe

About Me

Keywords

Blog Archive

Followers

# # #

SELECT Blog FROM Brad.Schulz CROSS APPLY SQL.Server()

Wednesday, January 6, 2010

Delimited String Tennis (Again)… The Final Volley?

18 comments:

Search This Blog

Subscribe

Subscribe To...

About Me

Keywords

Blog Archive

Followers

# # #