Thursday, July 9, 2009

Cool CROSS APPLY Tricks, Part 2

In Part 1, we looked at some fairly typical examples of using CROSS APPLY.

But the best is yet to come.

Consider this problem that Adam Haines brought up in his blog entry entitled Converting a Delimited String of Values into Columns. He presents the following table.

declare @t table
(
ProductId int
,ProductName varchar(25)
,SupplierId int
,Descr varchar(50)
)

insert @t select 1,'Product1',1,'A1,10in,30in,2lbs'
union all select 2,'Product2',2,'T6,15in,30in'
union all select 3,'Product3',1,'A2,1in,,0.5lbs'
union all select 4,'Product4',1,'X5,3in'
Note the column called Descr. That column contains 4 attributes of a product (Type, Length, Height, and Width) in a comma-delimited format. Your job: extract those values. Adam uses a very nice XML method to extract the values from the Descr field. Unfortunately the problem with using XML is that one of the attributes might contain a character that would break it, like an ampersand (&) or a less-than sign (<) for example.

So CROSS APPLY to the rescue!

The Descr field may not have all the commas in place (see row 4 in the table), so we will artifically add 4 of them to the end, creating a string to work with:

select ProductID
,SupplierID
,string
from @t
cross apply (select string=Descr+',,,,') f1
/*
ProductID SupplierID string
----------- ----------- ---------------------
1 1 A1,10in,30in,2lbs,,,,
2 2 T6,15in,30in,,,,
3 1 A2,1in,,0.5lbs,,,,
4 1 X5,3in,,,,
*/
You see how I used CROSS APPLY to introduce a new column called string? Now that we have that string to work with, we need to find where the commas are. The first comma is easy. We just use CHARINDEX. But the second and third and fourth commas all depend on the location of the previous comma. So we must use CROSS APPLY 4 times:

select ProductID
,SupplierID
,string
,p1,p2,p3,p4
from @t
cross apply (select string=Descr+',,,,') f1
cross apply (select p1=charindex(',',string)) f2
cross apply (select p2=charindex(',',string,p1+1)) f3
cross apply (select p3=charindex(',',string,p2+1)) f4
cross apply (select p4=charindex(',',string,p3+1)) f5
/*
ProductID SupplierID string p1 p2 p3 p4
----------- ----------- --------------------- -- -- -- --
1 1 A1,10in,30in,2lbs,,,, 3 8 13 18
2 2 T6,15in,30in,,,, 3 8 13 14
3 1 A2,1in,,0.5lbs,,,, 3 7 8 15
4 1 X5,3in,,,, 3 7 8 9
*/
Now that we have all the locations of the commas, it’s easy to pull out the attributes that we want:

select ProductID
,SupplierID
,[Type]
,Length
,Height
,Weight
from @t
cross apply (select string=Descr+',,,,') f1
cross apply (select p1=charindex(',',string)) f2
cross apply (select p2=charindex(',',string,p1+1)) f3
cross apply (select p3=charindex(',',string,p2+1)) f4
cross apply (select p4=charindex(',',string,p3+1)) f5
cross apply (select [Type]=substring(string,1,p1-1)
,Length=substring(string,p1+1,p2-p1-1)
,Height=substring(string,p2+1,p3-p2-1)
,Weight=substring(string,p3+1,p4-p3-1)) f6
/*
ProductID SupplierID Type Length Height Weight
----------- ----------- ---- ------ ------ ------
1 1 A1 10in 30in 2lbs
2 2 T6 15in 30in
3 1 A2 1in 0.5lbs
4 1 X5 3in
*/
The terrific thing about this is that it doesn’t involve anything in terms of cost. If you look at a query plan, all it consists of is a Table Scan followed by a Compute Scalar, the latter of which has an estimated operator cost of 0.0000001… in other words, pretty much no cost at all! The Query Optimizer combined all the CROSS APPLY stuff I introduced and essentially does the following query, (which you can see if you look at the properties of the Compute Scalar icon in the plan):

select ProductID
,SupplierID
,[Type]=substring([Descr]+',,,,',1,charindex(',',[Descr]+',,,,')-1)
,Length=substring([Descr]+',,,,',charindex(',',[Descr]+',,,,')+1,
(charindex(',',[Descr]+',,,,',charindex(',',[Descr]+',,,,')+1)-
charindex(',',[Descr]+',,,,'))-1)
,Height=substring([Descr]+',,,,',charindex(',',[Descr]+',,,,',
charindex(',',[Descr]+',,,,')+1)+1,(charindex(',',[Descr]+',,,,',
charindex(',',[Descr]+',,,,',charindex(',',[Descr]+',,,,')+1)+1)-
charindex(',',[Descr]+',,,,',charindex(',',[Descr]+',,,,')+1))-1)
,Weight=substring([Descr]+',,,,',charindex(',',[Descr]+',,,,',
charindex(',',[Descr]+',,,,',charindex(',',[Descr]+',,,,')+1)+1)+1,
(charindex(',',[Descr]+',,,,',charindex(',',[Descr]+',,,,',
charindex(',',[Descr]+',,,,',charindex(',',[Descr]+',,,,')+1)+1)+1)-
charindex(',',[Descr]+',,,,',charindex(',',[Descr]+',,,,',
charindex(',',[Descr]+',,,,')+1)+1))-1)
from @t
How would you like to parse through and troubleshoot that? What a nightmare! But you don’t have to… our CROSS APPLY query is very readable and easy to understand, and it's fast.

Here’s one more example…

Here’s a table of messages:

declare @t table (Message varchar(100))
insert @t select 'cross apply is really cool'
union all select 'and ntile is really cool too'
union all select 'and iced tea is really cool too'
Your job is to pull out all the word pairs and report on how often they occur throughout the table. Here’s how to do it with our new pal CROSS APPLY:

;with Numbers as
(
select N=Number
from master..spt_values
where Type='P' and Number>0
)
select WordPair,Occurrences=count(*)
from @t
join Numbers on substring(' '+Message,N,1)=' ' and N<len(Message)+1
cross apply (select string=substring(' '+Message+' ',N+1,len(Message)+1)) f1
cross apply (select p1=charindex(' ',string)) f2
cross apply (select p2=charindex(' ',string,p1+1)) f3
cross apply (select WordPair=case when p1<p2 then left(string,p2-1) end) f4
where WordPair is not null
group by WordPair
order by count(*) desc
/*
WordPair Occurrences
----------- -----------
is really 3
really cool 3
cool too 2
cross apply 1
iced tea 1
and iced 1
and ntile 1
apply is 1
tea is 1
ntile is 1
*/
I use a table of numbers (I “cheated” and just used the numbers in master..spt_values) and JOIN that to the Message column based on the location of its space characters. The first CROSS APPLY (f1) creates a string to work with, which is a SUBSTRING of the Message, starting at a word boundary (and ending with a space I artificially put at the end). The second and third CROSS APPLYs (f2 and f3) find the location of the first and second space characters in that string. The final CROSS APPLY creates the word pair, if one was found. If no word pair is found, then WordPair is set to NULL (because there is no ELSE in the CASE). That is filtered out by the WHERE clause.

(Edit on July11,2009: For a more in-depth discussion of processing the above word-pair query, see my blog post entitled Word Pairs Revisited (Cool CROSS APPLY Tricks, Part 2 ½)).

Note that all these new columns that we introduced (string, p1, p2, WordPair) can be used in a WHERE clause and a GROUP BY clause or pretty much anywhere within the query. That’s the best part.

The query below is what is really calculated behind the scenes:

;with Numbers as
(
select N=Number
from master..spt_values
where Type='P' and Number>0
)
select WordPair=case when charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))<
charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1),
charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))+1)
then left(substring(' '+Message+' ',N+1,len(Message)+1),
charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1),
charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))+1)-1)
end
,count(*)
from @t
join Numbers on substring(' '+Message,N,1)=' ' and N<len(Message)+1
where case when charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))<
charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1),
charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))+1)
then left(substring(' '+Message+' ',N+1,len(Message)+1),
charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1),
charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))+1)-1)
end is not null
group by case when charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))<
charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1),
charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))+1)
then left(substring(' '+Message+' ',N+1,len(Message)+1),
charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1),
charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))+1)-1)
end
order
by count(*) desc
Yeesh!

So after all this, I hope I have converted you to the beauty of the APPLY operator. Go spread the word!

8 comments:

  1. Great post Brad.

    Brad, perhaps you can slow it down and show each step individually. It's a lot to absorb all at once!

    First explain the issue. Then show the "traditional" way to do it. Then show the creation of the solution step by step. The last query could have been part 3 all by itself.

    ReplyDelete
  2. Good post Brad :). You are correct that there are certain characters that would cause the XML method to break. In this scenario, one would have to replace the invalid characters, with temporary holders. You can then replace the temporary holders back to their original form before returning results; however, This is not the poster child of simple and efficient coding. Before using the XML method one should really know and understand their data. The cross apply is a great alternative! Thanks for the blog post, it is defintely a good read.

    ReplyDelete
  3. Thanks Brian...

    I debated whether I should go through that last example step by step... I was afraid of it getting too long. But I think you're right, I should have made it a Part 3... and I think I'll just go ahead and do that.

    I appreciate the feedback.

    ReplyDelete
  4. Hi Adam...

    I love all the XML methods that one can take advantage of. It was Plamen Ratchev who brought up the "invalid character" problem, and that kind of burst my bubble about using XML too much... or at least it now makes me think it through a little more carefully first.

    ReplyDelete
  5. @Brad

    Don't worry about a post getting to long. It's better too long than too short!

    After writing an entry, read it again (waiting a day or two helps a lot). Making spelling correction, punctuation, enhance the flavor, and, decide if it needs to be split in two. Most likely you'll get a rule of how many (non-example) words force a split, but don't ever short the info so to keep it as one.

    ReplyDelete
  6. Thanks, Brad. Read this blog again and I think I'm convinced.

    ReplyDelete
  7. Hi Brad,
    Really very interesting use of APPLY! Thanks for this.
    But by this technique we can split just a limited item. So my recommend in this case is using of PARSNAME for splitting the string.

    ;with c as
    (select productID, supplierID,
    replace(
    replace(replace(descr,'.','~!@|'),',','.')+
    replicate('.', 3-(len(descr)-len(replace(descr,',',''))))+' '
    ,'..', '. .') as descr
    from @t)
    select productID, supplierID,
    replace(parsename(descr,4),'~!@|','.') as Type,
    replace(parsename(descr,3),'~!@|','.') as Length,
    replace(parsename(descr,2),'~!@|','.') as Height,
    replace(parsename(descr,1),'~!@|','.') as weight
    from c

    ReplyDelete
  8. Hi Brad,
    No need APPLY instead of that we can use multiple CTE like this:

    ;with f1 as
    (select string=Descr+',,,,',productid, supplierid
    from @t),
    f2 as (select p1=charindex(',',string),* from f1),
    f3 as (select p2=charindex(',',string,p1+1),* from f2),
    f4 as (select p3=charindex(',',string,p2+1),* from f3),
    f5 as (select p4=charindex(',',string,p3+1),* from f4)

    select productid,
    supplierid,
    [Type]=substring(string,1,p1-1)
    ,Length=substring(string,p1+1,p2-p1-1)
    ,Height=substring(string,p2+1,p3-p2-1)
    ,Weight=substring(string,p3+1,p4-p3-1)
    from f5

    ReplyDelete