Tuesday, September 29, 2009

Know Your Data!

Back in July, I wrote a blog entry that highlighted the CROSS APPLY operator. A problem was presented that required us to pull word pairs out of a column called Message in a temporary table @T and find out how many times the word pairs occurred in the table.

Two different approaches were used to solve the problem.

The first approach involved creating a multi-line function that accepted a string and returned a table of word pairs. The final query that made use of this function looked like this (please refer back to the July blog entry for the definition of the ufn_GetWordPairs function):

select WordPair,Occurrences=count(*)
from @t
cross apply dbo.ufn_GetWordPairs(Message) f
group by WordPair
order by count(*) desc
It was a nice and simple easy-to-read query, but it took over 25 seconds to process a table with 100,000 rows.

The second approach to the problem was using a Numbers table. The final query looked like this:

select WordPair,Occurrences=count(*)
from @t
cross apply (select TrimMessage=ltrim(rtrim(Message))) f1
join Numbers on substring(' '+TrimMessage,N,1)=' '
and substring(' '+TrimMessage,N+1,1)<>' '
and N<len(TrimMessage)+1
cross apply (select WorkString=substring(TrimMessage+' ',N,len(TrimMessage)+1)) f2
cross apply (select p1=charindex(' ',WorkString)) f3
cross apply (select Word1=left(WorkString,p1-1)) f4
cross apply (select RestOfString=ltrim(substring(WorkString,p1,len(WorkString)))) f5
cross apply (select p2=charindex(' ',RestOfString)) f6
cross apply (select Word2=case when p2>0 then left(RestOfString,p2-1) end) f7
cross apply (select WordPair=Word1+' '+Word2) f8
where WordPair is not null
group by WordPair
order by count(*) desc
It looks very intimidating, but the blog entry went through the development of the query step-by-step. This query took about 12 seconds to process the same 100,000 rows… an improvement of over 50%.

So it seemed like the Numbers table approach was the clear winner.

Well… yes and no.

In the tests that I had done in July, I had 100,000 rows, but each Message column value in the @T table had only about 5-6 words in it.

Recently, I decided to do a series of tests to see if the number of words made a difference, and it certainly did!

First I changed the udf_GetWordPairs function so that it accepted and acted upon a VARCHAR(MAX). Then I created a test table of 10,000 rows. That remained consistent. What I changed was the number of words in the Message column, and you can see my results below:

/*                   (ms) 
#Words (ms) Numbers (%) Numbers
Per Function Table Function Function Table
Per Row Approach Approach /Numbers ms/Word ms/Word
----------------------------------------------------------
25 9241 6495 142% 369.6 259.8
50 17529 14744 119% 350.6 294.9
75 26590 27447 97% 354.5 366.0
100 35225 39434 89% 352.3 394.3
150 53139 69016 77% 354.3 460.1
200 71435 101267 71% 357.2 506.3
250 87203 141602 62% 348.8 566.4
300 101267 184704 55% 337.6 615.7
400 142282 291286 49% 355.7 728.2
500 173923 415866 42% 347.8 831.7
*/
Here is a graph that shows the differences in times (i.e. the first 3 columns in the data table above):

Word Pair Query Durations

The Numbers Table approach performs better until the number of words per row is in the neighborhood of about 75. After that, it’s no contest… the Function approach wins by more and more of a margin as you increase the number of words.

If you look at the raw data, you can see that the Function approach takes about 350 milliseconds per word, and that remains constant. However, the Numbers Table approach takes longer and longer as the words increase… from 260 ms/word when processing 25 words/row up to 830 ms/word when processing 500 words/row. This exponential increase in time as the number of words increases is quite evident in the graph.

The lesson here is that you shouldn’t just blindly take an approach in solving a problem without intimately knowing the data that you’re eventually going to act upon.

6 comments:

  1. Interesting.. Did you do any performance metrics on IO? I would imagine the performance degradation scales with additional IO costs. Also, did you have a clustered index, with fillfactor 100 on your numbers table? Having a good index may help reduce IO.

    ReplyDelete
  2. I was going to add something to the post regarding the IO, but I mainly wanted to talk about time.

    Interestinly, the Function Approach had huge IO, while the Numbers Table Approach remained low. For example, for 100 words/row, the Function recorded 1081536 reads and Numbers recorded 30730. When we got up to 500 words/row, the Function recorded 5416181 reads and Numbers recorded 70008.

    The Numbers table did have a clustered index. I didn't specify a fill factor when I created the table... according to the documentation, I think if I don't specify anything, it defaults to 100%. I'll have to look further.

    ReplyDelete
  3. That is huge difference in IO, in favor of the numbers table. I am curious what the % batch cost is for each method.

    ReplyDelete
  4. The query plan shows 99%:1% Numbers:Function. It essentially sees the function call as a "black box" and so it can't realistically associate any cost with it.

    ReplyDelete
  5. Is posssible to get only the first two word of the sentece?

    ReplyDelete
  6. @Anonymous:

    This is an example of getting the first two words of all the messages in a table:

    declare @t table (Message varchar(100))
    insert @t select 'cross apply is really cool'
    union all select 'and ntile is really cool too'
    union all select ' and iced tea is really cool too '

    select Word1,Word2
    from @t
    cross apply (select TrimMessage=ltrim(rtrim(Message))) f1
    cross apply (select WorkString=TrimMessage+' ') f2
    cross apply (select p1=charindex(' ',WorkString)) f3
    cross apply (select Word1=left(WorkString,p1-1)) f4
    cross apply (select RestOfString=ltrim(substring(WorkString,p1,len(WorkString)))) f5
    cross apply (select p2=charindex(' ',RestOfString)) f6
    cross apply (select Word2=case when p2>0 then left(RestOfString,p2-1) end) f7

    --Brad

    ReplyDelete