Tuesday, July 7, 2009

Oh Yeah? Collate This!

So I decide to start a blog, and I figure my first blog entry will be a really cool and helpful stored procedure that people can use to produce T-SQL code that creates and populates sample tables, thus enabling them to paste this code into forum messages when asking for help. (How's that for a long bloated first sentence?) The concept of putting together this procedure was easy enough, but the reality ended up being something different.

I wanted to create something that would be as non-invasive as possible. I didn’t want to force a person to create my procedure in his/her own database. And then I noticed SQL will allow you to create a temporary stored procedure, just by putting a ‘#’ in front of the procedure’s name. Great! Looked like the answer.

But then I ran up against, of all things, collation problems. I was using AdventureWorks as a database to test out my procedure. Its collation is Latin1_General_CI_AS. However the collation of my server is SQL_Latin1_General_CP1_CI_AS. This created some unexpected bumps in the road.

You can see some of these potholes for yourself. First find out tempdb’s collation (which is inherited from the server’s collation when tempdb is created):

select databasepropertyex('tempdb','collation')
--My system returns SQL_Latin1_General_CP1_CI_AS
Now create a test database with a collation different from tempdb (I’m creating one below with a case-sensitive accent-insensitive collation just to be really contrarian) and (this is important!) make this new test database your current database:

if db_id('BSTest') is not null drop database BSTest
go
create database BSTest collate Latin1_General_CS_AI
go
use BSTest
go
Now, with BSTest as our current database context, try creating a laughably simple temporary procedure:

create procedure #TempProc1
as
declare
@var nchar(1)
set @var='X'
if @var='X' print 'Duh!'
You will get the following error: Cannot resolve the collation conflict between "Latin1_General_CS_AI" and "SQL_Latin1_General_CP1_CI_AS" in the equal to operation. (Your collation names may differ).

Apparently there’s a problem comparing the variable @var to the constant of 'X'. It seems that @var inherits the collation of tempdb (since we’re creating the procedure there), but the constant inherits the collation of the current database context that we’re in (BSTest). They are incompatible.

Well, I guess that kinda sorta in a weird way makes a teensy-weensy bit of sense. I could certainly re-code the IF statement like so and it would work:

if @var = '' collate SQL_Latin1_General_CP1_CI_AS print 'Duh!'
But I can only do that because I know exactly what tempdb’s collation is. Alternately I can use database_default to convert @var to the collation of the current database’s (BSTest’s) collation:

if @var collate database_default = 'X' print 'Duh!'
Ultimately, for whatever reason, I decided to take care of it by doing something like the following instead:

create procedure #TempProc1
as
declare
@var nchar(1)
,
@emptystring nchar(1)
set @var=''
set @emptystring=''
if @var=@emptystring print 'Duh!'
Okay, that worked. But then I ran up against something like this:

create procedure #TempProc2
as
declare
@var nchar(1)
if @var is null print 'Duh!'
The same type of collation error occurs! It appears that not all NULLs are created equal. The NULL in the code above is apparently a NULL with BSTest’s collation.

Swell. Now what? Well, how about the following? Seems straightforward enough:

create procedure #TempProc2
as
declare
@var nchar(1)
,
@emptystring nchar(1)
set @emptystring=''
if coalesce(@var,@emptystring)=@emptystring print 'Duh!'
Another error? Jeez! Now come on! What’s wrong with that? I’m comparing apples to apples, aren’t I? I get the same kind of collation error message, but now it says: Cannot resolve the collation conflict between ... in the is not operation. Huh? What IS NOT operation are they talking about?

I find out (thanks to my pal Hunchback at the MSDN T-SQL Forum) that SQL converts COALESCE to a CASE function behind the scenes. You can see this if you look at a query plan for the following:

select coalesce(Name,'') from sys.databases
If you look at the Properties of the Compute Scalar icon in the plan, you will see that it is essentially computing the following:

CASE WHEN [Name] IS NOT NULL THEN [Name] ELSE '' END
So that explains why the collation error message has a problem with an IS NOT operation. Luckily, if I use ISNULL instead of COALESCE, everything finally works fine. Whew!

Finally I had problems mixing variables and constants in SELECT commands:

create procedure #TempProc3
as
declare
@var nchar(1)
set @var='X'
select 'Prefix'+@var
select 'Row#1' union all select @var
Again, the above throws an error on both SELECT commands, citing a collation conflict. I ended up solving this kind of problem by adding COLLATE database_default to the SELECT column:

create procedure #TempProc3
as
declare
@var nchar(1)
set @var='X'
select 'Prefix'+@var collate database_default
select 'Row#1' collate database_default union all select @var
Note that I can specify the COLLATE clause in any of the SELECTs of a UNIONed query. It doesn’t matter which one. Just being specific about the collation in one SELECT controls the collation of the entire UNIONed query. Again, note that the database_default here is the database that has context (i.e. BSTest).

You can see all of this demonstrated here:

create procedure #TempProc4
as
declare
@var nchar(1)
set @var='X'
--Create 4 temp files
select StringConstant='Whatever' into #Junk1
select StringVar=@var into #Junk2
select StringCombo='Whatever'+@var collate database_default into #Junk3
select StringUnion='Whatever' collate database_default
into #Junk4
union all
select @var
--Now let's look at what collation was created for the columns
--and compare them to the collations of our databases
select Name,Collation_Name
from tempdb.sys.columns
where Name like 'String%'
union all
select 'tempdb',databasepropertyex('tempdb','collation')
union all
select db_name(),databasepropertyex(db_name(),'collation')
go
exec #TempProc4
/*
Name Collation_Name
-------------- ----------------------------
StringCombo Latin1_General_CS_AI
StringConstant Latin1_General_CS_AI
StringUnion Latin1_General_CS_AI
StringVar SQL_Latin1_General_CP1_CI_AS
tempdb SQL_Latin1_General_CP1_CI_AS
bstest Latin1_General_CS_AI
*/
So I think the lesson here is to use temporary procedures with caution… or use them if you’re a glutton for punishment. Are they worth it? They seem to be a big pain in the posterior, but this whole experience did make me learn something. I hope you learned something too.

13 comments:

  1. Great start Brad! Keep the good stuff coming!

    --Plamen

    ReplyDelete
  2. Thanks, Plamen! I'll do my best to keep it interesting.

    ReplyDelete
  3. Brad, a couple comments:

    1) Only post once per day. Queue the rest. This makes it more likely to post for longer, spend more time on each post (making a higher quality post), not burning you out, and focuses. I just read both CROSS APPLY posts as one because they were posted so closely together. Ideally, each should be appreciated on its own.

    2) Can we submit challenges to you? Or is that what the MS forums are for? :)

    ReplyDelete
  4. Thanks for the tips, Brian... most appreciated.

    Do you have a challenge? I'm game.

    ReplyDelete
  5. Thanx Brad!

    I like your posts, so i want this blog to be very successful. Selfish reasons of course. :)

    My challenge was posted in the T-SQL forums. http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/2dd0a0bc-54b5-4a69-bbef-50418a05ec4d

    After going through it in my head a few times, i've come to the conclusion that it is not possible. As it is confusing, though, i'm not sure i explained the question well enough.

    If you'd like to look at it and give me your opinion, i'd much appreciate it. Though, as it stands right now, i'm going to be implementing it a stored PROCEDURE, with multiple queries. *sniff*, i like my all-in-one VIEW solution (used elsewhere).

    If you're willing to take a look at it, i could help explain it further, or i'd email you the actual query in our application (which would require a bit more explanation). Or whatever for of communication your like.

    ReplyDelete
  6. Ouch... my head hurts.

    I'm not sure this can be accomplished in one single query, recursive or not.

    But I think you can do it with a recursive FUNCTION (kind of like a factorial works, if you know what I mean).

    I'd have to think more on it.

    --Brad

    ReplyDelete
  7. @Brad

    Heh. I keep having to remind myself what i am doing. Worst part is, i designed those TABLEs. :)

    Hmmm recursive FUNCTION, i didn't think of that at all. Currently, i'm most of the way through a stored PROCEDURE that uses a temp TABLE to store the results of calculations that don't rely on others (or rely on those already calculated). Unfortunately, every time i concentrate on it for too long something comes up and i lose my train of thought. But, i'm almost done.

    Using a recursive FUNCTION, if possible, would allow a VIEW to be based upon it, which is something i'd really like to do. I'll have to think about to see if that makes sense. Well, if i can make sense of it.

    If you are interested enough and you'd like me to explain it further, please let me know.

    ReplyDelete
  8. @Brad

    Just wanted to add.

    The confusing part of the problem is that it is a two-level grouping, where the second group can include the first group. Thus, the top-to-bottom approach does not work.

    The obvious solution is to go bottom-to-top. That is, have a recursive CTE create a calculation-chain for each calculation, identifying each a level. So, if there is a level 2, level 1 relies on it, and so on. Then, process each calculation/level from the highest to lowest level.

    That ought to work, but, if the second level includes more than one calculation, there is no way to have a CTE reference a record other than the one prior to it. For example, if calculation 1 has a set that uses the results of calculation 2 and calculation 3, even if processed 3 then 2 then 1, calculation 1 can get to the results of calculation 2 because it is the prior record. But calculation 3 is out of bounds. The branches are not handled well.

    That is why i am using a temp TABLE.

    ReplyDelete
  9. Brad, bored of posting already?

    I have an idea. CTEs can be UPDATEd, DELETEd, and so on. Confusing topic, i say. Want to try to tackle it?

    ReplyDelete
  10. No I'm not bored... our family is on vacation for 2 weeks... and yes, I would like to tackle that subject when I get back... you read my mind... that was something I was considering.

    ReplyDelete
  11. Excellent Brad.

    Enjoy the summer. :)

    ReplyDelete
  12. ALTER DATABASE test2 -- put your database name here
    COLLATE Latin1_General_CS_AS -- replace with whatever collation you need

    ReplyDelete
  13. @Anonymous:

    Sure, you can change the collation of the database... but are you going to blindly do that on a Production database? Or on TempDB?

    The point of this blog post was to get around having to do that... and the fact that TEMPDB can have a collation different from the USE database... I had a need to write generalized code to handle that collation difference without hard-coding a SPECIFIC collation.

    ReplyDelete