Saturday, July 11, 2009

Word Pairs Revisited (Cool CROSS APPLY Tricks, Part 2 ½)

In my last blog entry, I was so filled with zeal in demonstrating the CROSS APPLY operator that I threw in a final example at the end of the entry that I hurriedly explained in a small throwaway paragraph, and I didn’t give it the attention that it deserved.

So that’s what I’m going to do here, going over the thought process step by step, and we’re going to improve on it as well.

Just to review, we had this table of messages:

declare @t table (Message varchar(100))
insert @t select 'cross apply is really cool'
union all select 'and ntile is really cool too'
union all select 'and iced tea is really cool too'
And our job was to pull out word pairs and find out how many times they occur in the table. For example, the first row has the message “cross apply is really cool”. The word pairs in this message are “cross apply”, “apply is”, “is really” and “really cool”.

Let’s set out to write a query that will not make any assumptions about the messages in the table. For example, perhaps they might have leading and/or trailing spaces in the messages. Perhaps there is more than one space between words. So let’s mess up the third row in our table a little bit:

declare @t table (Message varchar(100))
insert @t select 'cross apply is really cool'
union all select 'and ntile is really cool too'
union all select ' and iced tea is really cool too '
Okay, let’s proceed with our task…

One traditional approach to pulling the word pairs out would be to write a multi-line table function that accepts a string and returns a table of word pairs.

Here’s an example of such a function, and the comments within should explain what’s going on:

if object_id('ufn_GetWordPairs') is not null drop function dbo.ufn_GetWordPairs
go
create function dbo.ufn_GetWordPairs (@String varchar(500))
returns @WordPairs table (WordPair varchar(100))
as
begin

--Declare variables
declare @CleanString varchar(500) --A cleaned-up copy of the string passed to the function
declare @p int --A position pointer to a character in the CleanString
declare @FirstSpacePos int --The location of the first space character
declare @SecondSpacePos int --The location of the second space character

--Get a copy of the string that was passed to the function
--And clean it up by removing leading and trailing spaces
-- and also get rid of any double spaces in the string.
--In other words, we only want single spaces between words.
set @CleanString=ltrim(rtrim(@String))
while charindex(' ',@CleanString)>0 set @CleanString=replace(@CleanString,' ',' ')

--Set our pointer to position#1 of the string, which is
--the beginning of our first word
set @p=1

--Let's loop through the string, looking for word pairs
while @p<=len(@CleanString)
begin

--Find the first space, starting at our position
set @FirstSpacePos=charindex(' ',@CleanString,@p)

--If we didn't find one, then that must mean that
--our position pointer is pointing to the last word
--in the string, so we're done. Let's get outta the loop.
if @FirstSpacePos=0 break

--Okay, we found the first space, so let's find the second one
set @SecondSpacePos=charindex(' ',@CleanString,@FirstSpacePos+1)

--If we didn't find one, then that must mean that
--we're on the last word pair of the string
--So we'll "pretend" that a space is after the end of the string
if @SecondSpacePos=0 set @SecondSpacePos=len(@CleanString)+1

--We now have a word pair, so let's put it in the table
insert @WordPairs select substring(@CleanString,@p,@SecondSpacePos-@p)

--And now let's set our new position to be the character just
--after the first space, which is the beginning of the next word
set @p=@FirstSpacePos+1

end

--We're finished
return

end
Let’s test it out:

select * from dbo.ufn_GetWordPairs('  one   two  three four  five   ')
/*
WordPair
----------
one two
two three
three four
four five
*/
Looks good! Now we can use that function in a CROSS APPLY to get our word pairs and their occurrences:

select WordPair,Occurrences=count(*)
from @t
cross apply dbo.ufn_GetWordPairs(Message) f
group by WordPair
order by count(*) desc
/*
WordPair Occurrences
----------- -----------
is really 3
really cool 3
cool too 2
cross apply 1
iced tea 1
and iced 1
and ntile 1
apply is 1
tea is 1
ntile is 1
*/
So, for each row in the table, we pass the Message to the function, and it returns (as a multi-row table with the column WordPair) all the word pairs it can find. And we GROUP BY those pairs and present the number of occurrences.

Using the function is very attractive. The whole messy word-pair gathering is encapsulated in a nice “black box”, and our CROSS APPLY query looks ridiculously simple.

But the bad news is that calling that function is very expensive. I created a test message table with 100,000 messages in it and when I ran the above query that used the function, it took over 25 seconds on average to run.

Calling a multi-line function and doing all of that looping and calculating 100,000 times is not really SQL’s strength. Its strength is in working with sets of data… that’s where it really shines. So let’s approach the problem that way.

The way to do this is with a table of Numbers. Some people routinely keep a table of numbers (integers) from one to a million in each of their databases, because it can be used for many clever solutions to problems.

Here’s an ingenious way, using a series of CTE’s (Common Table Expressions), to create a permanent table of all the numbers from 1 to 1,000,000:

create table Numbers (N int primary key);
go
with
L0
as (select 1 as C union all select 1) --2 rows
,L1 as (select 1 as C from L0 as A, L0 as B) --4 rows (2x2)
,L2 as (select 1 as C from L1 as A, L1 as B) --16 rows (4x4)
,L3 as (select 1 as C from L2 as A, L2 as B) --256 rows (16x16)
,L4 as (select 1 as C from L3 as A, L3 as B) --65536 rows (256x256)
,L5 as (select 1 as C from L4 as A, L4 as B) --4,294,967,296 rows (65536x65536)
,Nums as (select row_number() over (order by (select 0)) as N from L5)
insert Numbers
select N from Nums
where N<=1000000;
Note that we created a primary key on N, therefore creating a clustered index on the table. This way SQL can take advantage of the index when we perform a JOIN or a WHERE on the table.

Now let’s see how we can put this to use.

Let’s take another look at our message table:

declare @t table (Message varchar(100))
insert @t select 'cross apply is really cool'
union all select 'and ntile is really cool too'
union all select ' and iced tea is really cool too '
Our first step is to get rid of the leading and trailing spaces. So we'll use a CROSS APPLY to introduce a new column that is a trimmed version of our original message:

select Message,TrimMessage
from @t
cross apply (select TrimMessage=ltrim(rtrim(Message))) f1
/*
Message TrimMessage
---------------------------------------------- ----------------------------------------
cross apply is really cool cross apply is really cool
and ntile is really cool too and ntile is really cool too
and iced tea is really cool too and iced tea is really cool too
*/
We can’t really easily get rid of the multiple spaces between words, because we have no idea how many there could be. You may remember we handled it with a WHILE loop in our function… there’s no equivalent to that in a standalone query. We’ll just handle that multi-space problem in a different way later.

In order to find the word pairs in the messages, we have to know what a word is. A word is something that starts with a non-space character and the character preceding it must be a space… or the word could also be at the very beginning of the message with no space before it. Let’s be more general and just artificially put a space in front of the whole trimmed message and now our rule is that a word starts just after a space. So in essence, we need to find all instances in a message where there is a non-space character following a space character.

That’s where our Numbers table comes into play. We can JOIN our table to the Numbers table (in a one-to-many fashion) based on the positions of the words. This next step illustrates that (remember the column named N is our Numbers table column):

select TrimMessage,N
from @t
cross apply (select TrimMessage=ltrim(rtrim(Message))) f1
join Numbers on substring(' '+TrimMessage,N,1)=' '
and substring(' '+TrimMessage,N+1,1)<>' '
and N<len(TrimMessage)+1
/*
TrimMessage N
---------------------------------------- -----------
cross apply is really cool 1
cross apply is really cool 7
cross apply is really cool 13
cross apply is really cool 16
cross apply is really cool 23
and ntile is really cool too 1
and ntile is really cool too 5
and ntile is really cool too 11
and ntile is really cool too 14
and ntile is really cool too 21
and ntile is really cool too 26
and iced tea is really cool too 1
and iced tea is really cool too 6
and iced tea is really cool too 11
and iced tea is really cool too 18
and iced tea is really cool too 25
and iced tea is really cool too 33
and iced tea is really cool too 38
*/
So we’re finding (via the ON condition of the JOIN) instances where N points to a character that is a space and the character after it is not a space. We’re also limiting N to being less than LEN(TrimMessage)+1. The reason for the +1 is because we were artificially adding a space at the beginning of our trimmed message. SQL Server can take advantage of that predicate and do a clustered index seek into the Numbers table. This makes the JOIN very fast. Also note that, in adding an artificial space at the beginning of our trimmed message, we end up with N having the value of the beginning position of the words as you can see from the results above.

Now that we know the starting positions of the words, let’s use SUBSTRING() to create a work string that starts at those positions. Remember, our general rule about words was that they start after a space. Let’s also make a general rule that words end just before a space. So we will artificially add a space at the END of our work string. Let’s see how that looks:

select TrimMessage,N,WorkString
from @t
cross apply (select TrimMessage=ltrim(rtrim(Message))) f1
join Numbers on substring(' '+TrimMessage,N,1)=' '
and substring(' '+TrimMessage,N+1,1)<>' '
and N<len(TrimMessage)+1
cross apply (select WorkString=substring(TrimMessage+' ',N,len(TrimMessage)+1)) f2
/*
TrimMessage N WorkString
---------------------------------------- ----------- ----------------------------------------
cross apply is really cool 1 cross apply is really cool
cross apply is really cool 7 apply is really cool
cross apply is really cool 13 is really cool
cross apply is really cool 16 really cool
cross apply is really cool 23 cool
and ntile is really cool too 1 and ntile is really cool too
and ntile is really cool too 5 ntile is really cool too
and ntile is really cool too 11 is really cool too
and ntile is really cool too 14 really cool too
and ntile is really cool too 21 cool too
and ntile is really cool too 26 too
and iced tea is really cool too 1 and iced tea is really cool too
and iced tea is really cool too 6 iced tea is really cool too
and iced tea is really cool too 11 tea is really cool too
and iced tea is really cool too 18 is really cool too
and iced tea is really cool too 25 really cool too
and iced tea is really cool too 33 cool too
and iced tea is really cool too 38 too
*/
Great. Now let’s find the position of the first space of our work string. That will define the ending boundary of our word that begins the work string. And, using that, let’s also add the extraction of our first word of the pair:

select WorkString,p1,Word1
from @t
cross apply (select TrimMessage=ltrim(rtrim(Message))) f1
join Numbers on substring(' '+TrimMessage,N,1)=' '
and substring(' '+TrimMessage,N+1,1)<>' '
and N<len(TrimMessage)+1
cross apply (select WorkString=substring(TrimMessage+' ',N,len(TrimMessage)+1)) f2
cross apply (select p1=charindex(' ',WorkString)) f3
cross apply (select Word1=left(WorkString,p1-1)) f4
/*
WorkString p1 Word1
---------------------------------------- ----------- ------
cross apply is really cool 6 cross
apply is really cool 6 apply
is really cool 3 is
really cool 7 really
cool 5 cool
and ntile is really cool too 4 and
ntile is really cool too 6 ntile
is really cool too 3 is
really cool too 7 really
cool too 5 cool
too 4 too
and iced tea is really cool too 4 and
iced tea is really cool too 5 iced
tea is really cool too 4 tea
is really cool too 3 is
really cool too 7 really
cool too 5 cool
too 4 too
*/
In order to find the second word, we’ll need to look at the rest of our work string just after that first word. Note how we use LTRIM() to get rid of the “multiple space between words” problem:

select WorkString,p1,Word1,RestOfString
from @t
cross apply (select TrimMessage=ltrim(rtrim(Message))) f1
join Numbers on substring(' '+TrimMessage,N,1)=' '
and substring(' '+TrimMessage,N+1,1)<>' '
and N<len(TrimMessage)+1
cross apply (select WorkString=substring(TrimMessage+' ',N,len(TrimMessage)+1)) f2
cross apply (select p1=charindex(' ',WorkString)) f3
cross apply (select Word1=left(WorkString,p1-1)) f4
cross apply (select RestOfString=ltrim(substring(WorkString,p1,len(WorkString)))) f5
/*
WorkString p1 Word1 RestOfString
---------------------------------------- ----------- ------ ------------------------------------
cross apply is really cool 6 cross apply is really cool
apply is really cool 6 apply is really cool
is really cool 3 is really cool
really cool 7 really cool
cool 5 cool
and ntile is really cool too 4 and ntile is really cool too
ntile is really cool too 6 ntile is really cool too
is really cool too 3 is really cool too
really cool too 7 really cool too
cool too 5 cool too
too 4 too
and iced tea is really cool too 4 and iced tea is really cool too
iced tea is really cool too 5 iced tea is really cool too
tea is really cool too 4 tea is really cool too
is really cool too 3 is really cool too
really cool too 7 really cool too
cool too 5 cool too
too 4 too
*/
So now our RestOfString column begins with the second word, so let’s find the space that indicates the end of that word:

select Word1,RestOfString,p2
from @t
cross apply (select TrimMessage=ltrim(rtrim(Message))) f1
join Numbers on substring(' '+TrimMessage,N,1)=' '
and substring(' '+TrimMessage,N+1,1)<>' '
and N<len(TrimMessage)+1
cross apply (select WorkString=substring(TrimMessage+' ',N,len(TrimMessage)+1)) f2
cross apply (select p1=charindex(' ',WorkString)) f3
cross apply (select Word1=left(WorkString,p1-1)) f4
cross apply (select RestOfString=ltrim(substring(WorkString,p1,len(WorkString)))) f5
cross apply (select p2=charindex(' ',RestOfString)) f6
/*
Word1 RestOfString p2
------ ------------------------------------ -----------
cross apply is really cool 6
apply is really cool 3
is really cool 7
really cool 5
cool 0
and ntile is really cool too 6
ntile is really cool too 3
is really cool too 7
really cool too 5
cool too 4
too 0
and iced tea is really cool too 5
iced tea is really cool too 4
tea is really cool too 3
is really cool too 7
really cool too 5
cool too 4
too 0
*/
Note that there are instances where RestOfString is an empty string and contains no spaces, and so p2=0. There is no second word in those cases. So we will handle pulling out the second word via a CASE function… if p2>0 then there’s a word and we’ll pull it out. I don’t put an ELSE in the CASE function, so if p2<=0 the CASE will just return a NULL:

select Word1,RestOfString,p2,Word2
from @t
cross apply (select TrimMessage=ltrim(rtrim(Message))) f1
join Numbers on substring(' '+TrimMessage,N,1)=' '
and substring(' '+TrimMessage,N+1,1)<>' '
and N<len(TrimMessage)+1
cross apply (select WorkString=substring(TrimMessage+' ',N,len(TrimMessage)+1)) f2
cross apply (select p1=charindex(' ',WorkString)) f3
cross apply (select Word1=left(WorkString,p1-1)) f4
cross apply (select RestOfString=ltrim(substring(WorkString,p1,len(WorkString)))) f5
cross apply (select p2=charindex(' ',RestOfString)) f6
cross apply (select Word2=case when p2>0 then left(RestOfString,p2-1) end) f7
/*
Word1 RestOfString p2 Word2
------ ------------------------------------ ----------- ------
cross apply is really cool 6 apply
apply is really cool 3 is
is really cool 7 really
really cool 5 cool
cool 0 NULL
and ntile is really cool too 6 ntile
ntile is really cool too 3 is
is really cool too 7 really
really cool too 5 cool
cool too 4 too
too 0 NULL
and iced tea is really cool too 5 iced
iced tea is really cool too 4 tea
tea is really cool too 3 is
is really cool too 7 really
really cool too 5 cool
cool too 4 too
too 0 NULL
*/
Now it’s just a matter of putting together Word1 and Word2 to come up with our WordPair. (For those with a Word2 of NULL, the WordPair just comes out to NULL, since anything + NULL is just NULL):

select Word1,RestOfString,p2,Word2,WordPair
from @t
cross apply (select TrimMessage=ltrim(rtrim(Message))) f1
join Numbers on substring(' '+TrimMessage,N,1)=' '
and substring(' '+TrimMessage,N+1,1)<>' '
and N<len(TrimMessage)+1
cross apply (select WorkString=substring(TrimMessage+' ',N,len(TrimMessage)+1)) f2
cross apply (select p1=charindex(' ',WorkString)) f3
cross apply (select Word1=left(WorkString,p1-1)) f4
cross apply (select RestOfString=ltrim(substring(WorkString,p1,len(WorkString)))) f5
cross apply (select p2=charindex(' ',RestOfString)) f6
cross apply (select Word2=case when p2>0 then left(RestOfString,p2-1) end) f7
cross apply (select WordPair=Word1+' '+Word2) f8
/*
Word1 RestOfString p2 Word2 WordPair
------ ------------------------------------ ----------- ------ -----------
cross apply is really cool 6 apply cross apply
apply is really cool 3 is apply is
is really cool 7 really is really
really cool 5 cool really cool
cool 0 NULL NULL
and ntile is really cool too 6 ntile and ntile
ntile is really cool too 3 is ntile is
is really cool too 7 really is really
really cool too 5 cool really cool
cool too 4 too cool too
too 0 NULL NULL
and iced tea is really cool too 5 iced and iced
iced tea is really cool too 4 tea iced tea
tea is really cool too 3 is tea is
is really cool too 7 really is really
really cool too 5 cool really cool
cool too 4 too cool too
too 0 NULL NULL
*/
We now have our pairs, so now we can filter out the NULL word pairs, and finally do our GROUP BY in order to figure out the number of occurrences of each pair:

select WordPair,Occurrences=count(*)
from @t
cross apply (select TrimMessage=ltrim(rtrim(Message))) f1
join Numbers on substring(' '+TrimMessage,N,1)=' '
and substring(' '+TrimMessage,N+1,1)<>' '
and N<len(TrimMessage)+1
cross apply (select WorkString=substring(TrimMessage+' ',N,len(TrimMessage)+1)) f2
cross apply (select p1=charindex(' ',WorkString)) f3
cross apply (select Word1=left(WorkString,p1-1)) f4
cross apply (select RestOfString=ltrim(substring(WorkString,p1,len(WorkString)))) f5
cross apply (select p2=charindex(' ',RestOfString)) f6
cross apply (select Word2=case when p2>0 then left(RestOfString,p2-1) end) f7
cross apply (select WordPair=Word1+' '+Word2) f8
where WordPair is not null
group by WordPair
order by count(*) desc
/*
WordPair Occurrences
----------- -----------
is really 3
really cool 3
cool too 2
cross apply 1
iced tea 1
and iced 1
and ntile 1
apply is 1
tea is 1
ntile is 1
*/
Whew! And you know the best part? Remember how it took over 25 seconds to process 100,000 messages with our multi-line table function? Well our new-and-improved set-based method takes a little less than half that time… about 12 seconds. Still not blazingly fast, but an enormous improvement.

Just for your amusement, here’s the query that SQL Server is actually running “under the hood”. You can see these monstrous expressions in the query plan in the properties of the Compute Scalar icons.

select WordPair=
(substring(substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1),1,
charindex(' ',substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1))-1)+' ')+
case when charindex(' ',ltrim(substring(substring(ltrim(rtrim(Message))+' ',N,
len(ltrim(rtrim(Message)))+1),charindex(' ',substring(ltrim(rtrim(Message))+' ',N,
len(ltrim(rtrim(Message)))+1)),len(substring(ltrim(rtrim(Message))+' ',N,
len(ltrim(rtrim(Message)))+1)))))>0
then substring(ltrim(substring(substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1),
charindex(' ',substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1)),
len(substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1)))),1,
charindex(' ',ltrim(substring(substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1),
charindex(' ',substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1)),
len(substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1)))))-1) else null end
,Occurrences=count(*)
from @t
join Numbers on substring(' '+ltrim(rtrim(Message)),N,1)=' '
and substring(' '+ltrim(rtrim(Message)),N+1,1)<>' '
and N<len(ltrim(rtrim(Message)))+1
where (substring(substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1),1,
charindex(' ',substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1))-1)+' ')+
case when charindex(' ',ltrim(substring(substring(ltrim(rtrim(Message))+' ',N,
len(ltrim(rtrim(Message)))+1),charindex(' ',substring(ltrim(rtrim(Message))+' ',N,
len(ltrim(rtrim(Message)))+1)),len(substring(ltrim(rtrim(Message))+' ',N,
len(ltrim(rtrim(Message)))+1)))))>0
then substring(ltrim(substring(substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1),
charindex(' ',substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1)),
len(substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1)))),1,
charindex(' ',ltrim(substring(substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1),
charindex(' ',substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1)),
len(substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1)))))-1) else null end is not null
group by (substring(substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1),1,
charindex(' ',substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1))-1)+' ')+
case when charindex(' ',ltrim(substring(substring(ltrim(rtrim(Message))+' ',N,
len(ltrim(rtrim(Message)))+1),charindex(' ',substring(ltrim(rtrim(Message))+' ',N,
len(ltrim(rtrim(Message)))+1)),len(substring(ltrim(rtrim(Message))+' ',N,
len(ltrim(rtrim(Message)))+1)))))>0
then substring(ltrim(substring(substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1),
charindex(' ',substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1)),
len(substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1)))),1,
charindex(' ',ltrim(substring(substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1),
charindex(' ',substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1)),
len(substring(ltrim(rtrim(Message))+' ',N,len(ltrim(rtrim(Message)))+1)))))-1) else null end
order
by count(*) desc
So I hope this gives you a good appreciation of the imaginative uses of both a Numbers table and the CROSS APPLY operator.

2 comments:

  1. An excellent blog.

    Perfectly described and step by step demonstration. Excellent.

    This is exactly what I needed in my project.

    ReplyDelete