Saturday, December 12, 2009

Delimited String Tennis Anyone?

Tennis Anyone?SQL Server MVP Adam Haines and I have been going back and forth for months in our blogs on the subject of creating and splitting a delimited string… it’s been kind of a virtual T-SQL Tennis match… not a competition, but rather a friendly rally, knocking the subject back and forth, practicing our skills. Between the two of us, we’ve just about beaten the subject to death; however, I like to think that our collective research has been a win-win situation for the SQL Server community.

(If you’d like to look at these past articles, read Adam’s posts here, here, here, and here and read my posts and comments here, here, here, and here.)

In Adam’s latest volley a few days ago (which you can read here), he did some in-depth time tests on various methods of splitting a delimited string…. the XML approach using XPath and the .nodes() and .value() methods, the permanent Numbers table approach, and the approach using a TVF with a virtual table of Numbers.

His verdict was that the XML approach was a biiiiig disappointment when handling a comma-delimited string of many items, in one case performing over 200 times slower (!!) than the Numbers table approaches. Like Adam, I was a bit crestfallen. The XML method is the usual method suggested on the T-SQL forums for doing this sort of thing.

It is literally shocking how badly it performs.

Take a look at the following example, in which I create a variable that contains a comma-delimited list of 5000 items. Then it uses the XML approach to shred the list into a result set of 5000 rows:

/* 
Note that I have to CAST() the first parameter of REPLICATE
in order to force it to produce more than 8000 bytes
*/
declare @CommaList varchar(max)
set @CommaList='ABC'+replicate(cast(',ABC' as varchar(max)),4999)

select x.i.value('.','varchar(3)')
from (select XMLList=cast('<i>'+replace(@CommaList,',','</i><i>')+'</i>' as xml)) a
cross apply XMLList.nodes('i') x(i)
The task that we’re asking SQL Server to do is not that complicated, but do you want to guess how long this takes to fully execute on my system? Are you sitting down?

It takes over 6 minutes!

Really disappointing…

I had a lot of trouble accepting this. After doing several experiments, I finally figured out what was going on.

The problem lies in the derived table in the FROM clause, where we create a derived XML column via the CAST function.

It seems we are not really creating a complete column or value at all. Apparently we are inadvertently creating an XML stream, which sends small chunks of data to the .nodes() method and then the chunks get handed over to the .value() method to come up with our result… one chunk at a time.

On my system, when I execute the above code, I see nothing at all happen for about almost 50 seconds. Then the first 676 rows get spit out into the results window. After another 50 seconds or so, another similar-sized chunk of rows get spit out. This happens like clockwork until it spits out the final 5000th row over 6 minutes after I started the query.

You may recall in a previous post of mine where I said that an XML stream is processed in 2033-byte chunks. Well, 676 rows of 3 bytes (‘ABC’) each equals 2028 bytes. (I’m assuming that there’s a little tiny bit of overhead involved and that’s why we can’t squeeze in the 677th row). My comma-delimited string consists of 5000 items at 3 bytes each, coming to 15000 bytes. Divide that by 2033 bytes and you get 7.38, which is the number of chunks of data that get processed. Multiply that by 50 seconds per chunk, and you get… voila… 6.15 minutes.

Well, this must be all really exciting for you math nerds out there, but how does it help us?

It does help to explain what’s going on, and it does show us that streaming the XML is reeeeeaaalllly sloooooow.

So is there hope for the XML method?

YES! YES! YES!

This slow streaming only occurs when we are creating the XML to be processed via a derived table or CTE. What If we instead transfer our comma-delimited string into an actual XML variable and then process that variable, like so?:

/* 
Note that I have to CAST() the first parameter of REPLICATE
in order to force it to produce more than 8000 bytes
*/
declare @CommaList varchar(max)
set @CommaList='ABC'+replicate(cast(',ABC' as varchar(max)),4999)

declare @XMLList xml
set
@XMLList=cast('<i>'+replace(@CommaList,',','</i><i>')+'</i>' as xml)

select x.i.value('.','varchar(3)')
from @XMLList.nodes('i') x(i)
Believe it or not, this takes only 0.174 seconds!

In comparing the two pieces of code outlined in this article, here are the results:

/*            Derived     XML
Table Variable Percent
Measurement Method Method Improvement
--------------------------------------------
CPU 386,172 125 2989%
Duration(ms) 368,747 174 2019%
*/
So I’m happy to report that all is not lost for the XML method. It’s still a contender in the list-splitting competition, giving the Numbers table methods a run for their money… as long as it is implemented in a certain way.

Back to you, Adam…

UPDATE ON Jan06,2010: Please read the follow-up to this article in which I talk about some additional methods to speed the process up even further.

10 comments:

  1. Great followup Brad!!! This a very interesting developement. Did you capture the results via profiler or statistics time? Also, did you flush the cache between runs, just to make there was no query play reuse?

    If this method can put the CPU and duration in the ballpark of the numbers table solution, I will gladly go back to recommending it :^) . When I get some time I will run this test along with my tests from my blog post, to see how it compares; however, I have a feeling you may have done that already.

    Again great job!

    ReplyDelete
  2. Those were profiler statistics, and, yes, I did the usual two DBCC commands between each run, and I set the query options to "Discard results".

    I was going to run the time tests on all methods, but I was too impatient and excited about getting this new finding up to the blog.

    --Brad

    ReplyDelete
  3. Brad,

    I just did the tests and the results rival those of the numbers table both inline and permanent. This is a great find and definitely good knowledge! XML is back baby!!!!!!!!!!!!!!!

    ReplyDelete
  4. Good findings gentle man. and great work and follow-up to both of you!

    ReplyDelete
  5. Awesome post. *and* I have a new favorite method of stressing my cpu.

    ReplyDelete
  6. See, it helps to take things personally. You put it more effort that way. :)

    Bradam, this is great stuff.

    ReplyDelete
  7. And without a variable....

    declare @CommaList varchar(max)
    set @CommaList='ABC'+replicate(cast(',ABC' as varchar(max)),4999)

    select x.i.value('.','varchar(3)')
    from (select cast(''+replace(@CommaList,',','')+'' as xml).query('.') d) XMLList
    cross apply XMLList.d.nodes('i') x(i)

    ...works quickly too

    ReplyDelete
  8. Hi Brad,

    I have a general comment for your blogs.

    1. Please allow an easy search functionality on your website

    2. Please have a list of your blogs with titles somewhere in each blog page

    Say, I need to find your blog now and I'm having troubles from this page finding it.

    Thanks in advance.

    ReplyDelete
  9. Hi Naomi...

    In the upper right-hand corner of each page of this blog is a Google search box to find posts in this blog.

    Underneath my profile on the right-hand side of each page is a list of keywords that will take you to blog entries related to that word.

    And underneath that is an archive of all blog posts by date (with the blog titles).

    I think among all those tools you should be able to find anything you need.

    --Brad

    ReplyDelete
  10. This comment has been removed by a blog administrator.

    ReplyDelete