SELECT Blog FROM Brad.Schulz CROSS APPLY SQL.Server(): Know Your Data!

Tuesday, September 29, 2009

Know Your Data!

Back in July, I wrote a blog entry that highlighted the CROSS APPLY operator. A problem was presented that required us to pull word pairs out of a column called Message in a temporary table @T and find out how many times the word pairs occurred in the table.

Two different approaches were used to solve the problem.

The first approach involved creating a multi-line function that accepted a string and returned a table of word pairs. The final query that made use of this function looked like this (please refer back to the July blog entry for the definition of the ufn_GetWordPairs function):

select WordPair,Occurrences=count(*)
from @t
cross apply dbo.ufn_GetWordPairs(Message) f
group by WordPair
order by count(*) desc

It was a nice and simple easy-to-read query, but it took over 25 seconds to process a table with 100,000 rows.

The second approach to the problem was using a Numbers table. The final query looked like this:

select WordPair,Occurrences=count(*)
from @t
cross apply (select TrimMessage=ltrim(rtrim(Message))) f1
join Numbers on substring(' '+TrimMessage,N,1)=' '
                and substring(' '+TrimMessage,N+1,1)<>' '                
                 and N<len(TrimMessage)+1
cross apply (select WorkString=substring(TrimMessage+' ',N,len(TrimMessage)+1)) f2
cross apply (select p1=charindex(' ',WorkString)) f3
cross apply (select Word1=left(WorkString,p1-1)) f4
cross apply (select RestOfString=ltrim(substring(WorkString,p1,len(WorkString)))) f5
cross apply (select p2=charindex(' ',RestOfString)) f6
cross apply (select Word2=case when p2>0 then left(RestOfString,p2-1) end) f7
cross apply (select WordPair=Word1+' '+Word2) f8
where WordPair is not null
group by WordPair
order by count(*) desc

It looks very intimidating, but the blog entry went through the development of the query step-by-step. This query took about 12 seconds to process the same 100,000 rows… an improvement of over 50%.

So it seemed like the Numbers table approach was the clear winner.

Well… yes and no.

In the tests that I had done in July, I had 100,000 rows, but each Message column value in the @T table had only about 5-6 words in it.

Recently, I decided to do a series of tests to see if the number of words made a difference, and it certainly did!

First I changed the udf_GetWordPairs function so that it accepted and acted upon a VARCHAR(MAX). Then I created a test table of 10,000 rows. That remained consistent. What I changed was the number of words in the Message column, and you can see my results below:

/*                   (ms) 
#Words     (ms)     Numbers      (%)               Numbers 
  Per    Function    Table    Function  Function    Table 
Per Row  Approach  Approach   /Numbers   ms/Word   ms/Word 
---------------------------------------------------------- 
   25       9241      6495      142%      369.6     259.8 
   50      17529     14744      119%      350.6     294.9 
   75      26590     27447       97%      354.5     366.0 
  100      35225     39434       89%      352.3     394.3 
  150      53139     69016       77%      354.3     460.1 
  200      71435    101267       71%      357.2     506.3 
  250      87203    141602       62%      348.8     566.4 
  300     101267    184704       55%      337.6     615.7 
  400     142282    291286       49%      355.7     728.2 
  500     173923    415866       42%      347.8     831.7 
*/

Here is a graph that shows the differences in times (i.e. the first 3 columns in the data table above):

The Numbers Table approach performs better until the number of words per row is in the neighborhood of about 75. After that, it’s no contest… the Function approach wins by more and more of a margin as you increase the number of words.

If you look at the raw data, you can see that the Function approach takes about 350 milliseconds per word, and that remains constant. However, the Numbers Table approach takes longer and longer as the words increase… from 260 ms/word when processing 25 words/row up to 830 ms/word when processing 500 words/row. This exponential increase in time as the number of words increases is quite evident in the graph.

The lesson here is that you shouldn’t just blindly take an approach in solving a problem without intimately knowing the data that you’re eventually going to act upon.

7 comments:

Adam HainesSeptember 30, 2009 at 6:33 AM
Interesting.. Did you do any performance metrics on IO? I would imagine the performance degradation scales with additional IO costs. Also, did you have a clustered index, with fillfactor 100 on your numbers table? Having a good index may help reduce IO.
ReplyDelete
Replies
Brad SchulzSeptember 30, 2009 at 7:06 AM
I was going to add something to the post regarding the IO, but I mainly wanted to talk about time.

Interestinly, the Function Approach had huge IO, while the Numbers Table Approach remained low. For example, for 100 words/row, the Function recorded 1081536 reads and Numbers recorded 30730. When we got up to 500 words/row, the Function recorded 5416181 reads and Numbers recorded 70008.

The Numbers table did have a clustered index. I didn't specify a fill factor when I created the table... according to the documentation, I think if I don't specify anything, it defaults to 100%. I'll have to look further.
ReplyDelete
Replies
Adam HainesSeptember 30, 2009 at 8:35 AM
That is huge difference in IO, in favor of the numbers table. I am curious what the % batch cost is for each method.
ReplyDelete
Replies
Brad SchulzSeptember 30, 2009 at 5:37 PM
The query plan shows 99%:1% Numbers:Function. It essentially sees the function call as a "black box" and so it can't realistically associate any cost with it.
ReplyDelete
Replies
AnonymousJanuary 24, 2012 at 1:35 PM
Is posssible to get only the first two word of the sentece?
ReplyDelete
Replies
Brad SchulzJanuary 24, 2012 at 1:49 PM
@Anonymous:

This is an example of getting the first two words of all the messages in a table:

declare @t table (Message varchar(100))
insert @t select 'cross apply is really cool'
union all select 'and ntile is really cool too'
union all select ' and iced tea is really cool too '

select Word1,Word2
from @t
cross apply (select TrimMessage=ltrim(rtrim(Message))) f1
cross apply (select WorkString=TrimMessage+' ') f2
cross apply (select p1=charindex(' ',WorkString)) f3
cross apply (select Word1=left(WorkString,p1-1)) f4
cross apply (select RestOfString=ltrim(substring(WorkString,p1,len(WorkString)))) f5
cross apply (select p2=charindex(' ',RestOfString)) f6
cross apply (select Word2=case when p2>0 then left(RestOfString,p2-1) end) f7

--Brad
ReplyDelete
Replies
파워볼사이트March 7, 2025 at 1:17 AM
Publicación de artículo realmente informativa. Tengo muchas ganas de leer más. Quieren más.
ReplyDelete
Replies

Add comment

SELECT Blog FROM Brad.Schulz CROSS APPLY SQL.Server()

Tuesday, September 29, 2009

Know Your Data!

7 comments:

Search This Blog

Subscribe

About Me

Keywords

Blog Archive

Followers

# # #

SELECT Blog FROM Brad.Schulz CROSS APPLY SQL.Server()

Tuesday, September 29, 2009

Know Your Data!

7 comments:

Search This Blog

Subscribe

Subscribe To...

About Me

Keywords

Blog Archive

Followers

# # #