Wednesday, January 6, 2010

Delimited String Tennis (Again)… The Final Volley?

Delimited String Tennis Anyone?This is a continuation of my previous blog post, Delimited String Tennis Anyone?, which playfully talked about a virtual T-SQL Tennis match… a friendly rally back and forth between SQL Server MVP Adam Haines and myself in discussing methods of shredding comma-delimited strings in our blogs.

In my previous post, I expressed my disappointment with the following bit of code, which shreds elements from a comma-delimited string consisting of 5000 elements:

/* 
Note that I have to CAST() the first parameter of REPLICATE
in order to force it to produce more than 8000 bytes
*/
declare @CommaList varchar(max)
set @CommaList='ABC'+replicate(cast(',ABC' as varchar(max)),4999)

select x.i.value('.','varchar(3)')
from (select XMLList=cast('<i>'+replace(@CommaList,',','</i><i>')+'</i>' as xml)) a
cross apply XMLList.nodes('i') x(i)
This code, which I will call the “Derived Table Method”, took over 6 painful minutes to execute!

It had to do with how the XML column in the derived table was actually streamed and processed in small chunks as opposed to just immediately instantiating a normal, vanilla XML column.

My solution to this problem was to just create an intermediate XML variable and process it directly:

declare @XMLList xml
set
@XMLList=cast('<i>'+replace(@CommaList,',','</i><i>')+'</i>' as xml)

select x.i.value('.','varchar(3)')
from @XMLList.nodes('i') x(i)
This code, which I will call the “XML Variable Method”, was more like it… it took less than a second to execute… an improvement of over 2000%!

Since I wrote that blog entry, I received a few comments and e-mails that suggested some other approaches.

A person named “Mister Magoo” (no relation to the cartoon character voiced by Jim Backus in the 1950’s and 1960’s I’m sure) commented that the “Derived Table Method” could be improved dramatically just by adding a .query() to the XML column created in the derived table:

select x.i.value('.','varchar(3)')
from (select XMLList=cast('<i>'+replace(@CommaList,',','</i><i>')+'</i>' as xml).query('.')) a
cross apply XMLList.nodes('i') x(i)
He was correct! This also took less than a second! I suppose the theory here is that adding the .query() method forces the XML column to be instantiated immediately instead of bringing about the whole streaming nonsense.

This is excellent… it eliminates the step of having to create an intermediate XML variable… everything can be done in one single statement. It’s not quite as fast as the “XML Variable Method”, but we’re only talking about a small difference in milliseconds… it certainly performs like lightning.

The results of the above 3 methods come out as follows:

/*
Method CPU Duration(ms)
----------------------------------------------------
Derived Table 369,187 398,791
Derived Table with .query() 156 168
XML Variable 99 117
*/
I also got an e-mail from my friend SQL Server MVP Alejandro Mesa (some of you may know him as Hunchback on various SQL Forums), who mentioned that things could be sped up even more by changing the XQuery expression passed as the first parameter to the .value() method.

Instead of the simple ‘.’ that I passed…

select x.i.value('.','varchar(3)')
…Alejandro suggested using the .text() accessor and an ordinal [1] predicate, like so:

select x.i.value('(./text())[1]','varchar(3)')
True enough, this did improve the performance of the query. With this XQuery expression, we are indicating that we want the first instance ([1]) of the text-only node values (i.e. not any XML sub-elements) in our x.i node. This is more specific than our original ‘.’ expression, which indicated that we wanted to process the entire x.i node.

You can get a clearer idea of how these two XQuery expressions work in the following examples:

declare @XMLVar xml
set
@XMLVar='<a>abc<b>xyz</b>123</a>'

select x.i.value('.','varchar(10)')
from @XMLVar.nodes('a') x(i)
/* Returns 'abcxyz123' */

select x.i.value('(./text())[1]','varchar(10)')
from @XMLVar.nodes('a') x(i)
/* Returns 'abc' */

select x.i.value('(./text())[2]','varchar(10)')
from @XMLVar.nodes('a') x(i)
/* Returns '123' */
The @XMLVar variable above had two text nodes within the <a></a> element. In our original queries at the beginning of this article, each of the <i></i> elements contained only a single text node (and that’s why the expressions of ‘.’ and ‘(./text())[1]’ both produced the same result), and yet the expression ‘(./text())[1]’ was much more efficient and therefore faster because it was so much more specific in indicating what we wanted to process.

When I applied Alejandro’s suggestion to the queries, you can see how the CPU and Duration decreased even more:

/*
Method CPU Duration(ms)
----------------------------------------------------------------
Derived Table 369,187 398,791
Derived Table with .query() 156 168
Derived Table with .query() and text()[1] 130 159
XML Variable 99 117
XML Variable with text()[1] 76 77
*/
Alejandro also pointed me towards a very detailed technical article at Microsoft TechNet entitled Performance Optimizations for the XML Data Type in SQL Server 2005. If you do a lot of XML manipulation, this article looks like a must-read.

Thank you again to Mister Magoo and Alejandro for their contributions! With all of this knowledge, we can all now be the speed demons of delimited string shredding.

18 comments:

  1. Well done, Brad; me like.

    :-)

    Kent Waldrop

    ReplyDelete
  2. I especially like the derived table method with query() and text(), because it can be written as an inline function.

    ReplyDelete
  3. @Peso: Agreed... that's the beauty of it. Kudos to Mister Magoo.

    @Kent and @Michael: Thanks! Me like too! 8^)

    ReplyDelete
  4. Thanks for the mention Brad, Peso et al.

    I guess these are all going to be considered "hacks" by some, but to me they are just totally exceptional tweaks!

    Nicely documented by the way - I expect to see this turning up in solutions to a few sql challenges...

    ReplyDelete
  5. Was wondering if anyone has a recommended technique to handle invalid XML data/characters which throws errors when trying to cast to an XML data type when shredding delimited strings.

    ReplyDelete
  6. @Anonymous:

    Is this what you're looking for?:

    http://bradsruminations.blogspot.com/2009/10/un-making-list-or-shredding-of-evidence.html

    It talks about how to handle delimited strings that have ampersands or less-than or greater-than signs, which can "break" the usual shredding method.

    --Brad

    ReplyDelete
  7. Howdy folks. Would any of you be interested in participating in a little experiment, please? Thanks either way. It's located at the following URL...
    http://www.sqlservercentral.com/Forums/Topic988784-391-1.aspx

    ReplyDelete
  8. Brad,

    An epic improvement to the usual XML string-splitting method - extremely well done!

    That said, I am still uncomfortable using XML for string concatenation or splitting. I continue to recommend Adam Machanic's SQLCLR string splitter - which uses half the CPU of even the fastest XML method shown here, with no restrictions on the string to be split.

    I'll follow Jeff's challenge on SSC - I fully expect your XML code to be faster than the numbers table (even after adding the [*] trick to handle entitization).

    My sincere congratulations again.

    Paul

    ReplyDelete
  9. I've used the above XML Variable method in SQL Server 2005 to shred 15 million nodes in just over 20 minutes.

    We upgraded to SQL Server 2008 R2 and now the same shred takes over 19 hours.

    Our SysAdmin & DBA have looked over all of the configuration issues with SS 2008 R2 and everything was done according to best practices.

    Here's my code

    --SHRED THE XML--------
    WHILE @RowCNT <= @MaxRowCNT
    BEGIN
    INSERT INTO [dbo].[MAIDEN_POLICY_SHRED](POLICY_ID, fName, fValue)
    SELECT
    @RowCNT,
    x.header.value('string[1]','nvarchar(255)') as fName,
    x.header.value('string[2]','nvarchar(4000)') as fValue
    FROM @x.nodes('/java/object/void') AS x(header)
    SELECT @RowCNT = @RowCNT+1
    SELECT @x = (SELECT CAST(POLICY_CLOB as XML) FROM [dbo].[POLICY_VALUE_SET] WHERE POLICY_ID=@RowCNT)
    END
    GO

    ReplyDelete
  10. @Anonymous:

    Instead of a loop, could you try it with a single statement like so? (hope you can read this):

    select policy_id
    ,fName=StringNode.value('((./string)[1]/text())[1]','nvarchar(255)')
    ,fValue=StringNode.value('((./string)[2]/text())[1]','nvarchar(4000)')
    from policy_value_set
    cross apply (select policy_xml=cast(policy_clob as xml).query('.')) F_XML
    cross apply policy_xml.nodes('/java/object/void') F_Nodes(StringNode)

    ReplyDelete
  11. Ok issue is resolved.

    Explaination:
    SQL Server 2008 has an improved XML Kernal that creates optimistic plans when the data being shred and the shred datatype are the same.

    In my example, the xml fValue was defined as VARCHAR, forcing 2008 to run very pessimistic plan due to having to convert, inline the UTF-16 characters into a VARCHAR. After changing to NVARCHAR(4000) I was able to shred 15million nodes in 21.3 minutes.

    Changing the script to your soution Brad actually decreased performance similarly to where I was at before.

    I like what you posted because I was not aware of StringNode as an XML shredding function.

    I'm getting a feeling that SQL Server 2008 XML Kernal changes need to be fully published. Any pointers?


    mike

    ReplyDelete
  12. I'm sorry, Mike, I'm not sure what else to add.

    I'm kind of surprised and disappointed that the set-based solution I posted didn't perform as well. (And just so there's no confusion, StringNode is not a shredding function... it was just the name I gave to the .nodes() entry in the second CROSS APPLY in the query).

    I'd post your question on one of the forums out there to see what other advice you can get. Perhaps the SQL XML Forum at MSDN: http://social.msdn.microsoft.com/Forums/en-US/sqlxml/threads. There's a guy there named "wBob" who is an XML wizard.

    --Brad

    ReplyDelete
  13. Thank you for all of your suggestions Brad.
    I too was very surprised that using a single in-line statement did not perform as well as the loop. It doesn't make sense but rest assured, I will persevere with your recommendation in an effort to write the most streamlined and effecient Shredding script as possible, including contacting wBob.

    Thanks again!
    Mike

    ReplyDelete
  14. nice tags :) and nice post :) The way you explained it is really awesome and makes every one to read till the end. keep posting.. I must appreciate your article writing skills.

    ReplyDelete
  15. I don't understand how it works. I failed to implement it.

    ReplyDelete
  16. Great website, continue the Excellent work!

    ReplyDelete