In my previous post, I expressed my disappointment with the following bit of code, which shreds elements from a comma-delimited string consisting of 5000 elements:
/*This code, which I will call the “Derived Table Method”, took over 6 painful minutes to execute!
Note that I have to CAST() the first parameter of REPLICATE
in order to force it to produce more than 8000 bytes
*/
declare @CommaList varchar(max)
set @CommaList='ABC'+replicate(cast(',ABC' as varchar(max)),4999)
select x.i.value('.','varchar(3)')
from (select XMLList=cast('<i>'+replace(@CommaList,',','</i><i>')+'</i>' as xml)) a
cross apply XMLList.nodes('i') x(i)
It had to do with how the XML column in the derived table was actually streamed and processed in small chunks as opposed to just immediately instantiating a normal, vanilla XML column.
My solution to this problem was to just create an intermediate XML variable and process it directly:
declare @XMLList xmlThis code, which I will call the “XML Variable Method”, was more like it… it took less than a second to execute… an improvement of over 2000%!
set @XMLList=cast('<i>'+replace(@CommaList,',','</i><i>')+'</i>' as xml)
select x.i.value('.','varchar(3)')
from @XMLList.nodes('i') x(i)
Since I wrote that blog entry, I received a few comments and e-mails that suggested some other approaches.
A person named “Mister Magoo” (no relation to the cartoon character voiced by Jim Backus in the 1950’s and 1960’s I’m sure) commented that the “Derived Table Method” could be improved dramatically just by adding a .query() to the XML column created in the derived table:
select x.i.value('.','varchar(3)')He was correct! This also took less than a second! I suppose the theory here is that adding the .query() method forces the XML column to be instantiated immediately instead of bringing about the whole streaming nonsense.
from (select XMLList=cast('<i>'+replace(@CommaList,',','</i><i>')+'</i>' as xml).query('.')) a
cross apply XMLList.nodes('i') x(i)
This is excellent… it eliminates the step of having to create an intermediate XML variable… everything can be done in one single statement. It’s not quite as fast as the “XML Variable Method”, but we’re only talking about a small difference in milliseconds… it certainly performs like lightning.
The results of the above 3 methods come out as follows:
/*
Method CPU Duration(ms)
----------------------------------------------------
Derived Table 369,187 398,791
Derived Table with .query() 156 168
XML Variable 99 117
*/
I also got an e-mail from my friend SQL Server MVP Alejandro Mesa (some of you may know him as Hunchback on various SQL Forums), who mentioned that things could be sped up even more by changing the XQuery expression passed as the first parameter to the .value() method. Instead of the simple ‘.’ that I passed…
select x.i.value('.','varchar(3)')…Alejandro suggested using the .text() accessor and an ordinal [1] predicate, like so:
select x.i.value('(./text())[1]','varchar(3)')True enough, this did improve the performance of the query. With this XQuery expression, we are indicating that we want the first instance ([1]) of the text-only node values (i.e. not any XML sub-elements) in our x.i node. This is more specific than our original ‘.’ expression, which indicated that we wanted to process the entire x.i node.
You can get a clearer idea of how these two XQuery expressions work in the following examples:
declare @XMLVar xmlThe @XMLVar variable above had two text nodes within the <a></a> element. In our original queries at the beginning of this article, each of the <i></i> elements contained only a single text node (and that’s why the expressions of ‘.’ and ‘(./text())[1]’ both produced the same result), and yet the expression ‘(./text())[1]’ was much more efficient and therefore faster because it was so much more specific in indicating what we wanted to process.
set @XMLVar='<a>abc<b>xyz</b>123</a>'
select x.i.value('.','varchar(10)')
from @XMLVar.nodes('a') x(i)
/* Returns 'abcxyz123' */
select x.i.value('(./text())[1]','varchar(10)')
from @XMLVar.nodes('a') x(i)
/* Returns 'abc' */
select x.i.value('(./text())[2]','varchar(10)')
from @XMLVar.nodes('a') x(i)
/* Returns '123' */
When I applied Alejandro’s suggestion to the queries, you can see how the CPU and Duration decreased even more:
/*
Method CPU Duration(ms)
----------------------------------------------------------------
Derived Table 369,187 398,791
Derived Table with .query() 156 168
Derived Table with .query() and text()[1] 130 159
XML Variable 99 117
XML Variable with text()[1] 76 77
*/
Alejandro also pointed me towards a very detailed technical article at Microsoft TechNet entitled Performance Optimizations for the XML Data Type in SQL Server 2005. If you do a lot of XML manipulation, this article looks like a must-read.Thank you again to Mister Magoo and Alejandro for their contributions! With all of this knowledge, we can all now be the speed demons of delimited string shredding.
Well done, Brad; me like.
ReplyDelete:-)
Kent Waldrop
I especially like the derived table method with query() and text(), because it can be written as an inline function.
ReplyDeleteQuality post
ReplyDelete@Peso: Agreed... that's the beauty of it. Kudos to Mister Magoo.
ReplyDelete@Kent and @Michael: Thanks! Me like too! 8^)
Thanks for the mention Brad, Peso et al.
ReplyDeleteI guess these are all going to be considered "hacks" by some, but to me they are just totally exceptional tweaks!
Nicely documented by the way - I expect to see this turning up in solutions to a few sql challenges...
Was wondering if anyone has a recommended technique to handle invalid XML data/characters which throws errors when trying to cast to an XML data type when shredding delimited strings.
ReplyDelete@Anonymous:
ReplyDeleteIs this what you're looking for?:
http://bradsruminations.blogspot.com/2009/10/un-making-list-or-shredding-of-evidence.html
It talks about how to handle delimited strings that have ampersands or less-than or greater-than signs, which can "break" the usual shredding method.
--Brad
Howdy folks. Would any of you be interested in participating in a little experiment, please? Thanks either way. It's located at the following URL...
ReplyDeletehttp://www.sqlservercentral.com/Forums/Topic988784-391-1.aspx
Brad,
ReplyDeleteAn epic improvement to the usual XML string-splitting method - extremely well done!
That said, I am still uncomfortable using XML for string concatenation or splitting. I continue to recommend Adam Machanic's SQLCLR string splitter - which uses half the CPU of even the fastest XML method shown here, with no restrictions on the string to be split.
I'll follow Jeff's challenge on SSC - I fully expect your XML code to be faster than the numbers table (even after adding the [*] trick to handle entitization).
My sincere congratulations again.
Paul
I've used the above XML Variable method in SQL Server 2005 to shred 15 million nodes in just over 20 minutes.
ReplyDeleteWe upgraded to SQL Server 2008 R2 and now the same shred takes over 19 hours.
Our SysAdmin & DBA have looked over all of the configuration issues with SS 2008 R2 and everything was done according to best practices.
Here's my code
--SHRED THE XML--------
WHILE @RowCNT <= @MaxRowCNT
BEGIN
INSERT INTO [dbo].[MAIDEN_POLICY_SHRED](POLICY_ID, fName, fValue)
SELECT
@RowCNT,
x.header.value('string[1]','nvarchar(255)') as fName,
x.header.value('string[2]','nvarchar(4000)') as fValue
FROM @x.nodes('/java/object/void') AS x(header)
SELECT @RowCNT = @RowCNT+1
SELECT @x = (SELECT CAST(POLICY_CLOB as XML) FROM [dbo].[POLICY_VALUE_SET] WHERE POLICY_ID=@RowCNT)
END
GO
@Anonymous:
ReplyDeleteInstead of a loop, could you try it with a single statement like so? (hope you can read this):
select policy_id
,fName=StringNode.value('((./string)[1]/text())[1]','nvarchar(255)')
,fValue=StringNode.value('((./string)[2]/text())[1]','nvarchar(4000)')
from policy_value_set
cross apply (select policy_xml=cast(policy_clob as xml).query('.')) F_XML
cross apply policy_xml.nodes('/java/object/void') F_Nodes(StringNode)
Ok issue is resolved.
ReplyDeleteExplaination:
SQL Server 2008 has an improved XML Kernal that creates optimistic plans when the data being shred and the shred datatype are the same.
In my example, the xml fValue was defined as VARCHAR, forcing 2008 to run very pessimistic plan due to having to convert, inline the UTF-16 characters into a VARCHAR. After changing to NVARCHAR(4000) I was able to shred 15million nodes in 21.3 minutes.
Changing the script to your soution Brad actually decreased performance similarly to where I was at before.
I like what you posted because I was not aware of StringNode as an XML shredding function.
I'm getting a feeling that SQL Server 2008 XML Kernal changes need to be fully published. Any pointers?
mike
I'm sorry, Mike, I'm not sure what else to add.
ReplyDeleteI'm kind of surprised and disappointed that the set-based solution I posted didn't perform as well. (And just so there's no confusion, StringNode is not a shredding function... it was just the name I gave to the .nodes() entry in the second CROSS APPLY in the query).
I'd post your question on one of the forums out there to see what other advice you can get. Perhaps the SQL XML Forum at MSDN: http://social.msdn.microsoft.com/Forums/en-US/sqlxml/threads. There's a guy there named "wBob" who is an XML wizard.
--Brad
Thank you for all of your suggestions Brad.
ReplyDeleteI too was very surprised that using a single in-line statement did not perform as well as the loop. It doesn't make sense but rest assured, I will persevere with your recommendation in an effort to write the most streamlined and effecient Shredding script as possible, including contacting wBob.
Thanks again!
Mike
That was a great tutorial!
ReplyDeletenice tags :) and nice post :) The way you explained it is really awesome and makes every one to read till the end. keep posting.. I must appreciate your article writing skills.
ReplyDeleteI don't understand how it works. I failed to implement it.
ReplyDeleteGreat website, continue the Excellent work!
ReplyDelete