Tuesday, August 17, 2010

Integer List Splitting: A SQL Fable

Once Upon A Time...nce upon a time, not too long ago, there was a company called AdventureWorks. And working in this company was a diligent database developer named Dan Druff. Dan was responsible for developing and maintaining the AdventureWorks SQL Server database.

One day, Dan got a request to put together a query. The specs called for accepting a string of comma-delimited ProductID’s and to produce a result set representing orders that used those ProductID’s. The result set was to consist of the ProductID, the Product Name, the SalesOrderID, the Order Date, and the name of the Territory associated with the order. The final output was to be ordered by ProductID and SalesOrderID.

This sounded easy enough. Dan already had a user-defined function to split a comma-delimited list of integers. He had written it himself:

create function dbo.ufn_SplitIntArray
(
@List varchar(max)
,@Delimiter char(1)
)
returns @Items table (Item int)
as
begin
declare @Item varchar(12)
,@Pos int
while len(@List)>0
begin
set @Pos=charindex(@Delimiter,@List)
if @Pos=0 set @Pos=len(@List)+1
set @Item=left(@List,@Pos-1)
insert @Items select convert(int,ltrim(rtrim(@Item)))
set @List=substring(@List,@Pos+len(@Delimiter),len(@List))
end
return
end
He put together the requested query, incorporating his function, and he tested it out:

declare @ProductList varchar(max) = '897,911,942'
select d.ProductID
,ProductName=p.Name
,h.SalesOrderID
,h.OrderDate
,TerritoryName=t.Name
from dbo.ufn_SplitIntArray(@ProductList,',') a
join Sales.SalesOrderDetail d on a.Item=d.ProductID
join Production.Product p on d.ProductID=p.ProductID
join Sales.SalesOrderHeader h on d.SalesOrderID=h.SalesOrderID
join Sales.SalesTerritory t on h.TerritoryID=t.TerritoryID
order by d.ProductID,h.SalesOrderID
/*
ProductID ProductName SalesOrderID OrderDate TerritoryName
--------- -------------------------------- ------------ ---------- --------------
897 LL Touring Frame - Blue, 58 51823 2003-08-01 United Kingdom
897 LL Touring Frame - Blue, 58 51875 2003-08-01 Australia
911 LL Road Seat/Saddle 51140 2003-07-01 Central
911 LL Road Seat/Saddle 53472 2003-09-01 United Kingdom
911 LL Road Seat/Saddle 53488 2003-09-01 Canada
911 LL Road Seat/Saddle 53495 2003-09-01 United Kingdom
911 LL Road Seat/Saddle 53501 2003-09-01 Southeast
911 LL Road Seat/Saddle 53529 2003-09-01 Canada
942 ML Mountain Frame-W - Silver, 38 51120 2003-07-01 United Kingdom
942 ML Mountain Frame-W - Silver, 38 51711 2003-08-01 Northwest
942 ML Mountain Frame-W - Silver, 38 51758 2003-08-01 Southeast
942 ML Mountain Frame-W - Silver, 38 51799 2003-08-01 Southeast
942 ML Mountain Frame-W - Silver, 38 51856 2003-08-01 Southwest
*/
It seemed to work just fine… it spit out the result quickly and efficiently.

But something nagged at Dan. His function had serviced him well in the past, but the function was so “yesterday”… so “1970’s”… it just felt kind of old-fashioned and tired. Dan wanted a function that was more “hip” and more “with it”.

And then Dan had an idea.

He recalled reading an excellent blog post by MVP Aaron Bertrand from last July that did a roundup of various string-splitting techniques. One of the techniques was just like Dan’s, and Aaron had given the approach a nickname of RBAR, meaning “Row By Agonizing Row”. The roundup showed that this technique was one of the poorest performers.

So Dan made a decision right then and there that he was going to move into the 21st century… he was going to upgrade his function to one of the better performers that Aaron had outlined in his blog.

The fastest performer in Aaron’s roundup by far was the CLR approach. But that kind of scared Dan… he didn’t want to get into something that he didn’t fully understand.

So he gave one of the other techniques a try.



In general, the Numbers Table technique seemed to be the next-best performer after CLR in Aaron’s roundup, so Dan went about creating a table of numbers from 1 to 1,000,000 in his AdventureWorks database, and created a primary key for it:

;with 
L0
(c) as (select 0 from (values (0),(0),(0)) x(c)) --3 Rows
,L1(c) as (select 0 from L0 a, L0 b, L0 c) --27 Rows (3x3x3)
,L2(c) as (select 0 from L1 a, L1 b, L1 c) --19683 Rows (27x27x27)
,L3(c) as (select 0 from L2 a, L2 b) --387,420,489 Rows (19683x19683)
,NN(n) as (select row_number() over (order by (select 0)) from L3)
select Number=isnull(convert(int,n),0) --Force it to be a "not null" column
into dbo.Numbers
from NN
where n<=1000000
go
alter table dbo.Numbers
add constraint PK_Numbers
primary key clustered (Number)
with (fillfactor=100)
And then he put together a function to split a string of integers using this auxiliary table of numbers:

create function dbo.ufn_SplitIntArrayNum
(
@List varchar(max)
,@Delimiter char(1)
)
returns table
as
return
select
Item=convert(int,String)
from dbo.Numbers with (nolock)
cross
apply (select ItemPos=Number) F1
cross apply (select DelimPos=charindex(@Delimiter,@List+@Delimiter,ItemPos)) F2
cross apply (select String=rtrim(ltrim(substring(@List,ItemPos,DelimPos-ItemPos)))) F3
where ItemPos<=convert(int,len(@List))
and substring(@Delimiter+@List,ItemPos,1)=@Delimiter
He tested it out…

select * 
from dbo.ufn_SplitIntArrayNum('123,456,789',',')
/*
Item
----
123
456
789
*/
…and it worked just fine. So he incorporated it into his query and executed it:

declare @ProductList varchar(max) = '897,911,942'
select d.ProductID
,ProductName=p.Name
,h.SalesOrderID
,h.OrderDate
,TerritoryName=t.Name
from dbo.ufn_SplitIntArrayNum(@ProductList,',') a
join Sales.SalesOrderDetail d on a.Item=d.ProductID
join Production.Product p on d.ProductID=p.ProductID
join Sales.SalesOrderHeader h on d.SalesOrderID=h.SalesOrderID
join Sales.SalesTerritory t on h.TerritoryID=t.TerritoryID
order by d.ProductID,h.SalesOrderID
/*
ProductID ProductName SalesOrderID OrderDate TerritoryName
--------- -------------------------------- ------------ ---------- --------------
897 LL Touring Frame - Blue, 58 51823 2003-08-01 United Kingdom
897 LL Touring Frame - Blue, 58 51875 2003-08-01 Australia
911 LL Road Seat/Saddle 51140 2003-07-01 Central
911 LL Road Seat/Saddle 53472 2003-09-01 United Kingdom
911 LL Road Seat/Saddle 53488 2003-09-01 Canada
911 LL Road Seat/Saddle 53495 2003-09-01 United Kingdom
911 LL Road Seat/Saddle 53501 2003-09-01 Southeast
911 LL Road Seat/Saddle 53529 2003-09-01 Canada
942 ML Mountain Frame-W - Silver, 38 51120 2003-07-01 United Kingdom
942 ML Mountain Frame-W - Silver, 38 51711 2003-08-01 Northwest
942 ML Mountain Frame-W - Silver, 38 51758 2003-08-01 Southeast
942 ML Mountain Frame-W - Silver, 38 51799 2003-08-01 Southeast
942 ML Mountain Frame-W - Silver, 38 51856 2003-08-01 Southwest
*/
Well, it gave the correct result, but something seemed strange. The query seemed sluggish. It didn’t seem as quick as his query that used his tired old “RBAR” function. But that couldn’t be possible… after all, Aaron’s blog had clearly demonstrated the superiority of the Numbers approach.

So Dan fired up Profiler and did a comparison between the two queries. And he looked at the Actual Execution Plans for the two queries and he was shocked at what he discovered:

/*
Estimated Estimated
Description CPU Reads Duration QueryCost Number of Rows MemoryGrant
----------------------------------------------------------------------------------------
Original (RBAR) 62ms 832 304ms 0.790177 359.935 2,192KB
Numbers Technique 5961ms 244,199 2229ms 3556.620000 56,629,000.000 37,984KB
*/
What!!??What!!?? This couldn’t be! Dan’s RBAR approach completely annihilated the Numbers approach by a mile! Heck, it beat it by more than a mile… it beat it by a light-year!

The CPU and Reads of the Numbers Technique was huge! The memory grant was almost 38MB! And the query took over 2 full seconds to run! This was ridiculous!

Dan was very puzzled and a little scared. Maybe he was doing something wrong. Maybe something was going on that he didn’t understand.

He decided to try another technique.



The approach that Aaron’s blog named “Inline 2” looked promising. So Dan created that function in his database:

create function dbo.ufn_SplitIntArrayInline
(
@List varchar(max)
,@Delimiter char(1)
)
returns table
as
return
select
Item=convert(int,(substring(@Delimiter+@List+@Delimiter
,w.n+1
,charindex(@Delimiter
,@Delimiter+@List+@Delimiter
,w.n+1) - w.n - 1
)))
from (select n=v0.n+v1.n+v2.n+v3.n
from (select n = 0
union all select 1 union all select 2 union all select 3
union all select 4 union all select 5 union all select 6
union all select 7 union all select 8 union all select 9
union all select 10 union all select 11 union all select 12
union all select 13 union all select 14 union all select 15) v0
,(select n = 0
union all select 16 union all select 32 union all select 48
union all select 64 union all select 80 union all select 96
union all select 112 union all select 128 union all select 144
union all select 160 union all select 176 union all select 192
union all select 208 union all select 224 union all select 240) v1
,(select n = 0
union all select 256 union all select 512 union all select 768
union all select 1024 union all select 1280 union all select 1536
union all select 1792 union all select 2048 union all select 2304
union all select 2560 union all select 2816 union all select 3072
union all select 3328 union all select 3584 union all select 3840) v2
,(select n = 0
union all select 4096 union all select 8192 union all select 12288
union all select 16384 union all select 20480 union all select 24576
union all select 28672 union all select 32768 union all select 36864
union all select 40960 union all select 45056 union all select 49152
union all select 53248 union all select 57344 union all select 61440
union all select 65536 union all select 69632 union all select 73728
union all select 77824 union all select 81920 union all select 86016
union all select 90112 union all select 94208 union all select 98304
union all select 102400 union all select 106496 union all select 110592
union all select 114688 union all select 118784 union all select 122880
union all select 126976 union all select 131072 union all select 135168
union all select 139264 union all select 143360 union all select 147456) v3
) w
where w.n=charindex(@Delimiter,@Delimiter+@List+@Delimiter,w.n)
and w.n<len(@Delimiter+@List)
He tested it out…

select * 
from dbo.ufn_SplitIntArrayInline('123,456,789',',')
/*
Item
----
123
456
789
*/
…and it worked just fine. So he incorporated it into his query and executed it:

declare @ProductList varchar(max) = '897,911,942'
select d.ProductID
,ProductName=p.Name
,h.SalesOrderID
,h.OrderDate
,TerritoryName=t.Name
from dbo.ufn_SplitIntArrayInline(@ProductList,',') a
join Sales.SalesOrderDetail d on a.Item=d.ProductID
join Production.Product p on d.ProductID=p.ProductID
join Sales.SalesOrderHeader h on d.SalesOrderID=h.SalesOrderID
join Sales.SalesTerritory t on h.TerritoryID=t.TerritoryID
order by d.ProductID,h.SalesOrderID
/*
Msg 537, Level 16, State 5, Line 2
Invalid length parameter passed to the LEFT or SUBSTRING function.
*/
What!!??What!!?? What was going on? The query completely bombed! This didn’t make sense. The function worked when run by itself, but once it was JOINed with other tables, it turned over and died.

Dan looked at the estimated execution plan and recorded some information to compare it with his previous two approaches:

/*
Estimated Estimated
Description CPU Reads Duration QueryCost Number of Rows MemoryGrant
----------------------------------------------------------------------------------------
Original (RBAR) 62ms 832 304ms 0.790177 359.935 2,192KB
Numbers Technique 5961ms 244,199 2229ms 3556.620000 56,629,000.000 37,984KB
Inline Technique N/A N/A N/A 116532.000000 34,912,200.000 N/A
*/
Omigosh! Look at the Estimated Query Cost! The optimizer estimated this query was going to take 116532/3600 = 32 Hours to run! No wonder it choked!

Dan was really confused at this point. But he decided to bravely trudge onward.



He figured he would give the XML approach a try… it certainly seemed to hold its own in Aaron’s tests. This XML technique was an approach that some joker named Brad Schulz had talked about several times in his blog. Dan didn’t always read Brad’s blog posts… Sometimes Brad had a weird sense of humor and that didn’t always sit well with Dan… Dan was kind of a serious fellow.

But, despite his feelings about Brad, Dan went ahead and created a function using the XML string-splitting function in his database:

create function dbo.ufn_SplitIntArrayXML
(
@List varchar(max)
,@Delimiter char(1)
)
returns table
as
return
select
Item
from (select XMLString='<x>'+replace(@List,@Delimiter,'</x><x>')+'</x>') F1
cross apply (select XMLList=cast(XMLString as xml).query('.')) F2
cross apply XMLList.nodes('/x') F3(XMLNode)
cross
apply (select Item=XMLNode.value('(./text())[1]','int')) F4
He tested it out…

select * 
from dbo.ufn_SplitIntArrayXML('123,456,789',',')
/*
Item
----
123
456
789
*/
…and it worked just fine. So he incorporated it into his query and executed it:

declare @ProductList varchar(max) = '897,911,942'
select d.ProductID
,ProductName=p.Name
,h.SalesOrderID
,h.OrderDate
,TerritoryName=t.Name
from dbo.ufn_SplitIntArrayXML(@ProductList,',') a
join Sales.SalesOrderDetail d on a.Item=d.ProductID
join Production.Product p on d.ProductID=p.ProductID
join Sales.SalesOrderHeader h on d.SalesOrderID=h.SalesOrderID
join Sales.SalesTerritory t on h.TerritoryID=t.TerritoryID
order by d.ProductID,h.SalesOrderID
/*
ProductID ProductName SalesOrderID OrderDate TerritoryName
--------- -------------------------------- ------------ ---------- --------------
897 LL Touring Frame - Blue, 58 51823 2003-08-01 United Kingdom
897 LL Touring Frame - Blue, 58 51875 2003-08-01 Australia
911 LL Road Seat/Saddle 51140 2003-07-01 Central
911 LL Road Seat/Saddle 53472 2003-09-01 United Kingdom
911 LL Road Seat/Saddle 53488 2003-09-01 Canada
911 LL Road Seat/Saddle 53495 2003-09-01 United Kingdom
911 LL Road Seat/Saddle 53501 2003-09-01 Southeast
911 LL Road Seat/Saddle 53529 2003-09-01 Canada
942 ML Mountain Frame-W - Silver, 38 51120 2003-07-01 United Kingdom
942 ML Mountain Frame-W - Silver, 38 51711 2003-08-01 Northwest
942 ML Mountain Frame-W - Silver, 38 51758 2003-08-01 Southeast
942 ML Mountain Frame-W - Silver, 38 51799 2003-08-01 Southeast
942 ML Mountain Frame-W - Silver, 38 51856 2003-08-01 Southwest
*/
Hmmm… well at least this approach didn’t croak like the Inline technique did. The result was correct. And it seemed pretty snappy in execution. Dan investigated the Profiler and Actual Execution Plan data and added data to his table of figures:

/*
Estimated Estimated
Description CPU Reads Duration QueryCost Number of Rows MemoryGrant
----------------------------------------------------------------------------------------
Original (RBAR) 62ms 832 304ms 0.790177 359.935 2,192KB
Numbers Technique 5961ms 244,199 2229ms 3556.620000 56,629,000.000 37,984KB
Inline Technique N/A N/A N/A 116532.000000 34,912,200.000 N/A
XML Technique 237ms 843 449ms 26.408000 35,993.500 25,856KB
*/
What!!??This was really weird! Dan’s original RBAR approach was still the clear winner in all these tests. The XML approach came pretty darn close… MUCH closer than the other methods… but it wanted a memory grant of 26MB!

Dan’s world had turned upside-down. He was expecting to hear news bulletins any minute talking about pigs flying and snowballs discovered in Hell. This just wasn’t right!

Dan figured he’d have to go the final step… There was only one clear path. He took a deep breath, let it out slowly, and decided he would have to brave the CLR technique.



Aaron Bertrand’s roundup had given a link to Adam Machanic’s blog that had the C# source code for doing the string splitting. But, as little as Dan knew about C#, he could tell that Adam’s code returned a table of NVARCHAR vales, not INT values.

So Dan gave a call to an old friend from college named Biff Wellington who was a C# consultant and asked if he could help. Biff was thrilled to hear from Dan and said he’d be happy to assist… anything for an old pal. Dan referred him to the source code at Adam Machanic’s blog and, an hour later, Biff emailed Dan the revised code… and a bill for $150. Dan swore under his breath.

Here was the code that Biff had sent, which amended Adam Machanic’s code to include a FillRowInts function and SplitStringOfInts function to return a table of integers:

Click here to download the C# Code

Now Dan had to jump through all the hoops to get this C# code to work. He downloaded a copy of C# 2008 Express Edition from Microsoft (at least that was free, unlike Biff’s services… the jerk). He created a project using the Class Library template, calling the project BiffSucks. He pasted in Biff’s code and hit the F6 key to Build the Solution and it built successfully! Now he was ready to get into SSMS to create his function.

First he made sure that the server was enabled for CLR:

sp_configure 'CLR Enabled',1
go
reconfigure
go
Then he created an assembly, which referenced the DLL he had built:

create assembly StringHelper
from 'c:\Users\DanDruff\Documents\Visual Studio 2008\Projects\'
+'BiffSucks\BiffSucks\obj\Release\BiffSucks.dll'
And then he created the function to use that assembly and its SplitStringOfInts function:

create function dbo.ufn_SplitIntArrayCLR
(
@List nvarchar(max)
,@Delimiter nchar(1)
)
returns table (Item int)
external name StringHelper.UserDefinedFunctions.SplitStringOfInts
Gee, this wasn’t so hard after all!

He tested it out…

select * 
from dbo.ufn_SplitIntArrayCLR(N'123,456,789',N',')
/*
Item
----
123
456
789
*/
…and it worked just fine. So he incorporated it into his query and executed it:

declare @ProductList nvarchar(max) = N'897,911,942'
select d.ProductID
,ProductName=p.Name
,h.SalesOrderID
,h.OrderDate
,TerritoryName=t.Name
from dbo.ufn_SplitIntArrayCLR(@ProductList,N',') a
join Sales.SalesOrderDetail d on a.Item=d.ProductID
join Production.Product p on d.ProductID=p.ProductID
join Sales.SalesOrderHeader h on d.SalesOrderID=h.SalesOrderID
join Sales.SalesTerritory t on h.TerritoryID=t.TerritoryID
order by d.ProductID,h.SalesOrderID
/*
ProductID ProductName SalesOrderID OrderDate TerritoryName
--------- -------------------------------- ------------ ---------- --------------
897 LL Touring Frame - Blue, 58 51823 2003-08-01 United Kingdom
897 LL Touring Frame - Blue, 58 51875 2003-08-01 Australia
911 LL Road Seat/Saddle 51140 2003-07-01 Central
911 LL Road Seat/Saddle 53472 2003-09-01 United Kingdom
911 LL Road Seat/Saddle 53488 2003-09-01 Canada
911 LL Road Seat/Saddle 53495 2003-09-01 United Kingdom
911 LL Road Seat/Saddle 53501 2003-09-01 Southeast
911 LL Road Seat/Saddle 53529 2003-09-01 Canada
942 ML Mountain Frame-W - Silver, 38 51120 2003-07-01 United Kingdom
942 ML Mountain Frame-W - Silver, 38 51711 2003-08-01 Northwest
942 ML Mountain Frame-W - Silver, 38 51758 2003-08-01 Southeast
942 ML Mountain Frame-W - Silver, 38 51799 2003-08-01 Southeast
942 ML Mountain Frame-W - Silver, 38 51856 2003-08-01 Southwest
*/
It worked! And it seemed pretty quick! Dan was hopeful as he started to record the Profiler and Actual Execution Plan figures into his table:

/*
Estimated Estimated
Description CPU Reads Duration QueryCost Number of Rows MemoryGrant
----------------------------------------------------------------------------------------
Original (RBAR) 62ms 832 304ms 0.790177 359.935 2,192KB
Numbers Technique 5961ms 244,199 2229ms 3556.620000 56,629,000.000 37,984KB
Inline Technique N/A N/A N/A 116532.000000 34,912,200.000 N/A
XML Technique 237ms 843 449ms 26.408000 35,993.500 25,856KB
CLR Technique 443ms 1541 526ms 22.483400 359,935.000 99,744KB
*/
What!!??Huh? The CLR approach performed in under a second, but it was still slower than the RBAR and XML approach! And it had almost twice the reads! And it wanted almost 100MB of memory to do its job!!

Dan was feeling faint. This must be a dream, he thought. He pinched himself. But he didn’t wake up… it was not a dream.

Dan was at wit’s end… The world no longer made any sense.



Not knowing what else to do, he decided to go to Brad Schulz’s blog to read more about the XML technique. Even though it underperformed compared to Dan’s boring old RBAR approach, at least the whole idea of XML was kind of cool and modern.

And when he navigated to the blog, he came upon a post that Brad had written on Aug13,2010 about Query Hints. Brad was describing a situation frighteningly similar to what Dan was experiencing. By adding query hints of FORCE ORDER, MAXDOP 1, and LOOP JOIN to the query, Brad demonstrated how the execution plan could be changed to give better performance.

So Dan decided to add these query hints to each of the 5 techniques and see what would happen.

The new query plans created by the hints certainly made a lot more sense… no more parallelism and weird JOIN configurations and such… and the Inline technique actually worked this time!

Here is what he recorded:

/*
Estimated Estimated
Description CPU Reads Duration QueryCost Number of Rows MemoryGrant
----------------------------------------------------------------------------------------
NO HINTS:
Original (RBAR) 62ms 832 304ms 0.790177 359.935 2,192KB
Numbers Technique 5961ms 244,199 2229ms 3556.620000 56,629,000.000 37,984KB
Inline Technique N/A N/A N/A 116532.000000 34,912,200.000 N/A
XML Technique 237ms 843 449ms 26.408000 35,993.500 25,856KB
CLR Technique 443ms 1541 526ms 22.483400 359,935.000 99,744KB
----------------------------------------------------------------------------------------
WITH HINTS:
Original (RBAR) 18ms 257 67ms 1.13386 288.875 1,024KB
Numbers Technique 23ms 212 178ms 42676.40000 20,968,100.000 383,976KB
Inline Technique 130ms 202 157ms 30394.90000 10,049,000.000 383,976KB
XML Technique 34ms 203 41ms 45.78860 28,887.500 6,768KB
CLR Technique 32ms 203 32ms 220.80600 288,875.000 63,000KB
----------------------------------------------------------------------------------------
*/
With the hints incorporated into the queries, all of the approaches were nice and quick, with a Duration of under 200ms. The Numbers and Inline techniques were clearly the worst performers… and they both wanted an ENORMOUS memory grant of about 384MB!!!… This was because the optimizer made such a gross over-estimate of how many records would result, that it figured it was going to have to sort tens of millions of rows. Dan eliminated those techniques outright.

Oddly enough, his RBAR technique still had the lowest CPU, but among the three remaining techniques, it had the highest Reads and longest Duration. On the other hand, RBAR had the lowest memory grant by far. The CLR, which was the fastest performer, wanted 60 times the memory that Dan’s RBAR technique wanted. At least the XML approach was not as much of a memory hog, since it had a more reasonable estimate of the number of resulting rows than CLR.

Ultimately, Dan made a decision. He was happy with his RBAR function. Sure, it wasn’t “sexy” like the other approaches, but it had a lot going for it. It was readable. It was simple. The optimizer made reasonable estimates of rows when it was incorporated into a query. It did not demand a lot of memory. And, if in the future, Dan forgot to include the FORCE ORDER and MAXDOP 1 and LOOP JOIN hints in a query that used his function, it would still perform just fine with very little memory requirements (unlike the other approaches).

However, Dan also vowed to make sure that in the future, he would at least test out future queries to make sure that his RBAR function was still a reasonable performer. There may be other, more complicated queries in his future that may work better with the other techniques. He now knew, better than ever, the golden rule of SQL Server: “It depends.”

Dan sat back in his chair, pleased with his decisions.

And he lived happily ever after.



The morals of the story are:

What works well in theory may not work well in practice… It depends… Test test test.

and (forgive me)…

If you're itching to get out of a hairy splitting situation, apply the Dan Druff treatment:
  • Gather data in testing various approaches

  • Hints if necessary

  • Repeat for future queries

17 comments:

  1. Nice one, Brad
    Did you try to run some tests for 30 and 300 integers in the list? Results for 3 parameters are indeed revolting :)

    Piotr

    ReplyDelete
  2. I took the hint and read now the other post. It's amazing in how many ways you can shoot your foot, isn't it? :)
    I think that it should be said that hints should be used prescribed by a doctor, because of side effects they can cause.

    Thanks

    Piotr

    ReplyDelete
  3. @Piotr:

    I most likely will do a follow-up to this post, because I also wanted to demonstrate that the performance figures are similar when attacking the query with a ProductID IN (SELECT Item FROM MyFunction) approach.

    I should also do a test for more integers as you suggest. I would think as the number of items increases, the CLR method would come through as a better performer compared to the others.

    But in reality, how many times is a list of thousands of integers passed to a routine like this as opposed to just a small handful?

    The fact that only 3 lousy integers causes this much grief is indeed interesting.

    --Brad

    ReplyDelete
  4. Ha! This actually made me laugh out loud getting looks from colleagues forcing me to send the link to this site to them.

    ReplyDelete
  5. WTF? The guy could have left me a comment on my blog and I would have shown him how to convert it FOR FREE. $150? WTF!? It's a 3-minute conversion, if that.

    ReplyDelete
  6. Sorry, did I take this "fable" too seriously? I skipped the title the first time and just did a search for my name (found the post via a Google alert) :-)

    ReplyDelete
  7. Okay, after reading the ENTIRE post I'm finding the whole thing to be a bit misleading. The real moral here is that "Dan" was focused on the wrong thing the whole time. The performance of splitting the strings has nothing to do with the rest of the performance picture in this case--just as you've shown. So this is not a post about integer splitting at all, it's a post about the impact of bad estimates.

    ReplyDelete
  8. @Adam: In a way, you have a point about the fable being misleading.

    The performance of the various string-splitting techniques, I'm sure, is just what Aaron demonstrated in his blog... when they are run by themselves in a vacuum.

    But the only reason we put these integer-splitting techniques together is so they can be incorporated into queries, right?

    It's frustrating that a simple query on a table with only 100,000 rows (Sales.SalesOrderDetail) that JOINs with a couple other tables causes this much of a fiasco with a list of only 3 integers.

    I just want to demonstrate that one should not blindly accept a particular technique as being the "best" or "fastest".

    --Brad

    ReplyDelete
  9. Well, we need to tune for the actual bottleneck. If you're splitting large strings, then yes, you do want to use the "fastest" method for doing that. Otherwise, you want to use the "fastest" method for whatever else you're doing.

    If you want the best of both worlds for this particular case, you can get it. Notice that the "RBAR" method uses a multistatement TVF? It might be interesting for "Dan" to wrap the call to the CLR version (or any of the others, for that matter) in a multistatement TVF and try these tests again...

    ReplyDelete
  10. Interesting idea regarding the multi-statement suggestion... I'll definitely check that out... sounds promising.

    Thanks for the input!

    ReplyDelete
  11. A part of the problem is what CHARINDEX does when it sees a MAX datatype and what the execution plan does (full cartesian join, usually) when there's a datatype mismatch between the result of the CHARINDEX and the integers in the numbers table.

    I just happen to be doing some work in this area and I'll try to remember to post back with some test info and troubleshooting results.

    And, you're correct... don't believe everything you see on the internet even when there's test code. ;-)

    ReplyDelete
  12. I'm not sure what I'm doing "wrong". I'm getting "only" a 70ms difference in performance between the Loop and Numbers table functions.

    Of course, for 10,000 "hits" a second, that 70ms will make all the difference in the world, but I'm not seeing the huge difference that you're seeing, Brad. In fact, I suspect there's something wrong with your machine because my 8 year old desktop (1.8GHZ single P4) runs the While Loop version of the code in only 31ms/31ms (CPU AND DURATION) instead of the 62ms/304ms that you reported for the same code.

    It will be interesting when I compare the two methods across a whole gambit of data. Thanks for taking the time to post this.

    ReplyDelete
  13. Just an update... haven't tested everything yet but it looks like the addition of the Cross Apply's that you made for reader clarity also cause the code to run in about a third of the time on my humble desktop. I see that Paul White ("Mr. Cross Apply") visited this thread and, hopefully, he might be able to explain why. It IS interesting that the Cross Apply's caused that percentage of a speed gain for lower width splits. Like I said, 70ms can make a huge difference if the code is running 10,000 times a second.

    As you and I may have guessed (but still had to check for), the datatype mismatch doesn't affect the While Loop noticeably and I can't get the numbers table solution to misbehave as badly as you seem to have.

    Also, thanks for the XML splitter. I have an active post that you responded to looking for some of these to test with.

    ReplyDelete
  14. Brad, any feedback on why there's so much difference between your machine and mine? Do you, perhaps, have a different collation or something setup?

    ReplyDelete
  15. Nice article. What if you use varchar(8000) or nvarchar(4000) instead of (max)? Also what if you serialize your results to a temp table instead of joining directly to functions? How does this affect performance?

    ReplyDelete