Thursday, February 25, 2010

Spotlight on UNPIVOT, Part 2

UNPIVOT Those Columns!In my last blog entry, we looked at the UNPIVOT command and how it works in unpivoting columns into rows.

We learned that T-SQL simply translates the UNPIVOT operator into an OUTER APPLY, coupled with a filter to get rid of NULL values. With that knowledge, we were able to unpivot data ourselves using an APPLY operator directly. This gave us more flexibility… We could include NULLs if we wanted to and we could very easily unpivot multiple sets of columns.

In this entry, we’ll go over some scenarios for when unpivoting data is helpful (more likely essential), and we’ll present many examples.

Before we begin, though, I have to say one thing. Most of these examples employ unpivoting because the tables are horribly designed and not normalized. I urge you, in designing your tables, to not emulate some of these table structures I am going to present.

Aggregations Over Columns

Let’s say you’re a school teacher and you have given 12 tests (exams) to your students over the course of the semester. Each test score is a percentage value from 0 to 100. And here’s the interesting part… you gave your students the option to skip any test of their choice during the semester, so one of the 12 tests might have a NULL value. Some students might take advantage of that… others (the teacher’s pets no doubt) will take every exam.

Here’s the (poorly-designed) table:

if object_id('tempdb..#Scores','U') is not null drop table #Scores
go
create
table #Scores
(
StudentID int primary key
,Test1 int
,Test2 int
,Test3 int
,Test4 int
,Test5 int
,Test6 int
,Test7 int
,Test8 int
,Test9 int
,Test10 int
,Test11 int
,Test12 int
)
go
insert
#Scores select 1, 96, 87,null, 100, 99, 91, 73, 90, 90, 91, 93, 92
union all select 2, 83, 84, 89, 79, 88,null, 93, 82, 85, 86, 88, 82
union all select 3, 100, 100, 98, 91, 100, 95, 96, 96, 99, 100, 98, 100

select * from #Scores
/*
StudentID Test1 Test2 Test3 Test4 Test5 Test6 Test7 Test8 Test9 Test10 Test11 Test12
--------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ------ ------ ------
1 96 87 NULL 100 99 91 73 90 90 91 93 92
2 83 84 89 79 88 NULL 93 82 85 86 88 82
3 100 100 98 91 100 95 96 96 99 100 98 100
*/

Your job is to get an average test score for each student. Well, that seems easy enough. We just add the scores together and divide by 12, right?

Oh wait…

There’s that darn NULL value we have to deal with… some students might have it, and some may not. And we don’t know which of the 12 test scores have the NULL value.

So we have to come up with a really awful query like this:

select StudentID
,AverageScore=( coalesce(Test1,0)+coalesce(Test2,0)+coalesce(Test3,0)
+coalesce(Test4,0)+coalesce(Test5,0)+coalesce(Test6,0)
+coalesce(Test7,0)+coalesce(Test8,0)+coalesce(Test9,0)
+coalesce(Test10,0)+coalesce(Test11,0)+coalesce(Test12,0) )
* 1.0 /
( case when Test1 is not null then 1 else 0 end
+case when Test2 is not null then 1 else 0 end
+case when Test3 is not null then 1 else 0 end
+case when Test4 is not null then 1 else 0 end
+case when Test5 is not null then 1 else 0 end
+case when Test6 is not null then 1 else 0 end
+case when Test7 is not null then 1 else 0 end
+case when Test8 is not null then 1 else 0 end
+case when Test9 is not null then 1 else 0 end
+case when Test10 is not null then 1 else 0 end
+case when Test11 is not null then 1 else 0 end
+case when Test12 is not null then 1 else 0 end )
from #Scores
/*
StudentID AverageScore
--------- ---------------
1 91.090909090909
2 85.363636363636
3 97.750000000000
*/
You have to admit… That is really ugly.

But it gets uglier…

Even worse, you have to find the minimum and maximum test score for each student. Take a look at this monster (try not to laugh… I dare you), which finds the maximum score:

select StudentID
,MaxScore=case
when Test1>=coalesce(Test2,0) and Test1>=coalesce(Test3,0)
and Test1>=coalesce(Test4,0) and Test1>=coalesce(Test5,0)
and Test1>=coalesce(Test6,0) and Test1>=coalesce(Test7,0)
and Test1>=coalesce(Test8,0) and Test1>=coalesce(Test9,0)
and Test1>=coalesce(Test10,0) and Test1>=coalesce(Test11,0)
and Test1>=coalesce(Test12,0) then Test1
when Test2>=coalesce(Test1,0) and Test2>=coalesce(Test3,0)
and Test2>=coalesce(Test4,0) and Test2>=coalesce(Test5,0)
and Test2>=coalesce(Test6,0) and Test2>=coalesce(Test7,0)
and Test2>=coalesce(Test8,0) and Test2>=coalesce(Test9,0)
and Test2>=coalesce(Test10,0) and Test2>=coalesce(Test11,0)
and Test2>=coalesce(Test12,0) then Test2
when Test3>=coalesce(Test1,0) and Test3>=coalesce(Test2,0)
and Test3>=coalesce(Test4,0) and Test3>=coalesce(Test5,0)
and Test3>=coalesce(Test6,0) and Test3>=coalesce(Test7,0)
and Test3>=coalesce(Test8,0) and Test3>=coalesce(Test9,0)
and Test3>=coalesce(Test10,0) and Test3>=coalesce(Test11,0)
and Test3>=coalesce(Test12,0) then Test3
when Test4>=coalesce(Test1,0) and Test4>=coalesce(Test2,0)
and Test4>=coalesce(Test3,0) and Test4>=coalesce(Test5,0)
and Test4>=coalesce(Test6,0) and Test4>=coalesce(Test7,0)
and Test4>=coalesce(Test8,0) and Test4>=coalesce(Test9,0)
and Test4>=coalesce(Test10,0) and Test4>=coalesce(Test11,0)
and Test4>=coalesce(Test12,0) then Test4
when Test5>=coalesce(Test1,0) and Test5>=coalesce(Test2,0)
and Test5>=coalesce(Test3,0) and Test5>=coalesce(Test4,0)
and Test5>=coalesce(Test6,0) and Test5>=coalesce(Test7,0)
and Test5>=coalesce(Test8,0) and Test5>=coalesce(Test9,0)
and Test5>=coalesce(Test10,0) and Test5>=coalesce(Test11,0)
and Test5>=coalesce(Test12,0) then Test5
when Test6>=coalesce(Test1,0) and Test6>=coalesce(Test2,0)
and Test6>=coalesce(Test3,0) and Test6>=coalesce(Test4,0)
and Test6>=coalesce(Test5,0) and Test6>=coalesce(Test7,0)
and Test6>=coalesce(Test8,0) and Test6>=coalesce(Test9,0)
and Test6>=coalesce(Test10,0) and Test6>=coalesce(Test11,0)
and Test6>=coalesce(Test12,0) then Test6
when Test7>=coalesce(Test1,0) and Test7>=coalesce(Test2,0)
and Test7>=coalesce(Test3,0) and Test7>=coalesce(Test4,0)
and Test7>=coalesce(Test5,0) and Test7>=coalesce(Test6,0)
and Test7>=coalesce(Test8,0) and Test7>=coalesce(Test9,0)
and Test7>=coalesce(Test10,0) and Test7>=coalesce(Test11,0)
and Test7>=coalesce(Test12,0) then Test7
when Test8>=coalesce(Test1,0) and Test8>=coalesce(Test2,0)
and Test8>=coalesce(Test3,0) and Test8>=coalesce(Test4,0)
and Test8>=coalesce(Test5,0) and Test8>=coalesce(Test6,0)
and Test8>=coalesce(Test7,0) and Test8>=coalesce(Test9,0)
and Test8>=coalesce(Test10,0) and Test8>=coalesce(Test11,0)
and Test8>=coalesce(Test12,0) then Test8
when Test9>=coalesce(Test1,0) and Test9>=coalesce(Test2,0)
and Test9>=coalesce(Test3,0) and Test9>=coalesce(Test4,0)
and Test9>=coalesce(Test5,0) and Test9>=coalesce(Test6,0)
and Test9>=coalesce(Test7,0) and Test9>=coalesce(Test8,0)
and Test9>=coalesce(Test10,0) and Test9>=coalesce(Test11,0)
and Test9>=coalesce(Test12,0) then Test9
when Test10>=coalesce(Test1,0) and Test10>=coalesce(Test2,0)
and Test10>=coalesce(Test3,0) and Test10>=coalesce(Test4,0)
and Test10>=coalesce(Test5,0) and Test10>=coalesce(Test6,0)
and Test10>=coalesce(Test7,0) and Test10>=coalesce(Test8,0)
and Test10>=coalesce(Test9,0) and Test10>=coalesce(Test11,0)
and Test10>=coalesce(Test12,0) then Test10
when Test11>=coalesce(Test1,0) and Test11>=coalesce(Test2,0)
and Test11>=coalesce(Test3,0) and Test11>=coalesce(Test4,0)
and Test11>=coalesce(Test5,0) and Test11>=coalesce(Test6,0)
and Test11>=coalesce(Test7,0) and Test11>=coalesce(Test8,0)
and Test11>=coalesce(Test9,0) and Test11>=coalesce(Test10,0)
and Test11>=coalesce(Test12,0) then Test11
when Test12>=coalesce(Test1,0) and Test12>=coalesce(Test2,0)
and Test12>=coalesce(Test3,0) and Test12>=coalesce(Test4,0)
and Test12>=coalesce(Test5,0) and Test12>=coalesce(Test6,0)
and Test12>=coalesce(Test7,0) and Test12>=coalesce(Test8,0)
and Test12>=coalesce(Test9,0) and Test12>=coalesce(Test10,0)
and Test12>=coalesce(Test11,0) then Test12
end
from
#Scores
/*
StudentID MaxScore
--------- --------
1 100
2 93
3 100
*/

(Sheesh… The things I do to demonstrate stupidity… I can’t believe I actually typed that whole thing out… I must be insane).

Well, the UNPIVOT clause can come to our rescue. It can unpivot the columns into rows and we can use T-SQL’s ability to do aggregations. So we can easily calculate the Average, Minimum, Maximum, and even the Standard Deviation of those test scores:

select StudentID
,AverageScore=avg(TestScore*1.0)
,MinScore=min(TestScore)
,MaxScore=max(TestScore)
,StdDev=stdev(TestScore)
from #Scores
unpivot (TestScore for Descript in (Test1,Test2,Test3,Test4,Test5,Test6
,Test7,Test8,Test9,Test10,Test11,Test12)) P
group by StudentID
/*
StudentID AverageScore MinScore MaxScore StdDev
--------- ------------ -------- -------- ----------------
1 91.090909 73 100 7.18963900977711
2 85.363636 79 93 3.95658254741963
3 97.750000 91 100 2.80016233295663
*/
Now isn’t that a lot easier by a factor of about a thousand?

Transpositions

This section is just for fun, really… however it might give you some ideas for other things.

Let’s take our table of test scores and turn the columns into rows and rows into columns… in other words, we’ll unpivot and then re-pivot.

;with UnpivotedScores as
(
select StudentColumn='Student'+convert(varchar(10),StudentID)
,TestScore
,TestID=convert(int,substring(TestColumnName,5,len(TestColumnName)))
from #Scores
unpivot (TestScore for TestColumnName in (Test1,Test2,Test3,Test4,Test5,Test6
,Test7,Test8,Test9,Test10,Test11,Test12)) U
)
select *
from UnpivotedScores
pivot (max(TestScore) for StudentColumn in (Student1,Student2,Student3)) P
/*
TestID Student1 Student2 Student3
------ -------- -------- --------
1 96 83 100
2 87 84 100
3 NULL 89 98
4 100 79 91
5 99 88 100
6 91 NULL 95
7 73 93 96
8 90 82 96
9 90 85 99
10 91 86 100
11 93 88 98
12 92 82 100
*/
So in this query, after unpivoting the data, I took the StudentID’s of 1 and 2 and 3 and created a column called StudentColumn containing the strings ‘Student1’, ‘Student2’, and ‘Student3’. I also created a column called TestID, extracting its value out of TestColumnName, a column created by the UNPIVOT. Finally, I PIVOTed the data across the three students.

Of course, I was only able to do this because I knew that I had only 3 studentsin my data. If I had more, then I would have to list more student columns in the PIVOT.

Here’s another fun and crazy exercise.

Create a table of 20,000 rows, with each row containing 8 random values from 1 to 999999.

if object_id('tempdb..#BunchesOfNums','U') is not null drop table #BunchesOfNums
go
create
table #BunchesOfNums
(
Num1 int
,Num2 int
,Num3 int
,Num4 int
,Num5 int
,Num6 int
,Num7 int
,Num8 int
)
go
set
nocount on
insert
#BunchesOfNums
select abs(checksum(newid()))%999999+1
,abs(checksum(newid()))%999999+1
,abs(checksum(newid()))%999999+1
,abs(checksum(newid()))%999999+1
,abs(checksum(newid()))%999999+1
,abs(checksum(newid()))%999999+1
,abs(checksum(newid()))%999999+1
,abs(checksum(newid()))%999999+1
go 20000

select * from #BunchesOfNums
/*
Num1 Num2 Num3 Num4 Num5 Num6 Num7 Num8
------ ------ ------ ------ ------ ------ ------ ------
430583 396264 325292 396579 294463 540881 827428 332591
819493 634199 705003 711098 387911 146887 931888 705560
967582 19768 391719 243203 716007 378208 471666 908523
206660 382012 753810 289947 725329 309939 268837 770277
651409 353534 54753 799079 317000 589131 77782 136318
4796 481041 870133 282436 243489 890761 307494 510633
682449 686105 311957 626728 781477 902128 974246 256178
933380 499938 653796 33348 536858 359010 875253 414892
... And so on for 20,000 rows
*/
Let’s reorganize all of this data, sorting it across the columns (i.e. horizontally):

;with NumbersUnpivot as
(
select Num
,NewRow=(row_number() over (order by Num)-1)/8+1
,NewCol=(row_number() over (order by Num)-1)%8+1
from #BunchesOfNums
unpivot (Num for NumColumn in (Num1,Num2,Num3,Num4,Num5,Num6,Num7,Num8)) P
)
select Num1,Num2,Num3,Num4,Num5,Num6,Num7,Num8
from (select Num
,NewRow
,ColName='Num'+convert(varchar(10),NewCol)
from NumbersUnpivot) A
pivot (max(Num) for ColName in (Num1,Num2,Num3,Num4,Num5,Num6,Num7,Num8)) P
order by NewRow
/*
Num1 Num2 Num3 Num4 Num5 Num6 Num7 Num8
------ ------ ------ ------ ------ ------ ------ ------
17 40 63 67 68 71 79 90
94 97 103 109 120 122 134 140
166 173 174 177 183 192 194 195
204 206 207 208 216 220 222 229
238 247 263 273 277 278 290 300
. . .
999772 999772 999773 999780 999781 999783 999783 999787
999794 999804 999806 999808 999808 999822 999830 999831
999832 999840 999841 999855 999868 999875 999887 999905
999905 999910 999930 999938 999940 999949 999963 999966
999966 999972 999986 999988 999991 999991 999992 999996
*/
Again, this was an UNPIVOT followed by a PIVOT. The original table was unpivoted, creating a derived table of 1 column (called Num) containing all 160,000 of our numbers. I created new columns called NewRow and NewCol, which were calculated using the ROW_NUMBER() function based on the sorting of the Num column. NewRow was calculated by dividing by 8 and NewCol was calculated by using doing a modulo 8 (i.e. the remainder in dividing by 8). Then I re-PIVOTed the numbers based on these new Row/Column positions.

We could also reorganize the numbers down across the rows (i.e. vertically). This time, we have to use 20,000 (the number of rows) for our divisor. But let’s make it more general and assume we don’t know up front how many rows there are. Let’s use the COUNT() aggregate to figure that number out and use that as the divisor:

;with NumbersUnpivot as
(
select Num
,NewRow=(row_number() over (order by Num)-1)%(count(Num) over () / 8)+1
,NewCol=(row_number() over (order by Num)-1)/(count(Num) over () / 8)+1
from #BunchesOfNums
unpivot (Num for NumColumn in (Num1,Num2,Num3,Num4,Num5,Num6,Num7,Num8)) P
)
select Num1,Num2,Num3,Num4,Num5,Num6,Num7,Num8
from (select Num
,NewRow
,ColName='Num'+convert(varchar(10),NewCol)
from NumbersUnpivot) A
pivot (max(Num) for ColName in (Num1,Num2,Num3,Num4,Num5,Num6,Num7,Num8)) P
order by NewRow
/*
Num1 Num2 Num3 Num4 Num5 Num6 Num7 Num8
------ ------ ------ ------ ------ ------ ------ ------
17 125498 250005 374416 499463 623675 749265 875013
40 125501 250008 374421 499471 623680 749268 875025
63 125512 250010 374429 499472 623687 749269 875029
67 125517 250018 374431 499480 623691 749278 875031
68 125533 250028 374433 499485 623691 749289 875033
. . .
125479 249986 374402 499427 623659 749216 874991 999988
125480 249987 374409 499433 623661 749242 874994 999991
125484 249999 374411 499446 623662 749243 874997 999991
125488 250001 374413 499453 623663 749251 875003 999992
125493 250003 374416 499455 623675 749254 875004 999996
*/
This certainly works, but the query is not really that efficient. Because of the COUNT() aggregate and the way it’s used, the query employs a whopping 384159 reads and 738 writes, because it needs to spool information to tempdb to do its work.

The following is more efficient and only employs 650 reads and 0 writes… an enormous difference!

;with NumbersUnpivot as
(
select Num
,NewRow=(row_number() over (order by Num)-1)%NumRows+1
,NewCol=(row_number() over (order by Num)-1)/NumRows+1
from #BunchesOfNums
cross join (select NumRows=count(*) from #BunchesOfNums) C
unpivot (Num for NumColumn in (Num1,Num2,Num3,Num4,Num5,Num6,Num7,Num8)) P
)
select Num1,Num2,Num3,Num4,Num5,Num6,Num7,Num8
from (select Num
,NewRow
,ColName='Num'+convert(varchar(10),NewCol)
from NumbersUnpivot) A
pivot (max(Num) for ColName in (Num1,Num2,Num3,Num4,Num5,Num6,Num7,Num8)) P
order by NewRow
/*
Num1 Num2 Num3 Num4 Num5 Num6 Num7 Num8
------ ------ ------ ------ ------ ------ ------ ------
17 125498 250005 374416 499463 623675 749265 875013
40 125501 250008 374421 499471 623680 749268 875025
63 125512 250010 374429 499472 623687 749269 875029
67 125517 250018 374431 499480 623691 749278 875031
68 125533 250028 374433 499485 623691 749289 875033
. . .
125479 249986 374402 499427 623659 749216 874991 999988
125480 249987 374409 499433 623661 749242 874994 999991
125484 249999 374411 499446 623662 749243 874997 999991
125488 250001 374413 499453 623663 749251 875003 999992
125493 250003 374416 499455 623675 749254 875004 999996
*/
As you can see, I calculated the number of rows up front (into a column called NumRows) and I CROSS JOINed that with the table itself so that I could use it as the divisor in calculating NewRow and NewCol.

I encourage you to examine the plans of the two queries above to see the differences.

Column Data Repair

How often have you seen a scenario like this?: A table has 3 or 4 or 5 or more columns that contain the same kind of information. An address is a good example. A customer might have 1 or 2 or 3 lines of address and there are 3 columns for that information to be stored.

Another example is credit card information. A table has 5 columns to store up to 5 possible credit card numbers that we have on file for a person.

Again, this is very bad design… the credit card information should be put into a child table of its own, with one row for each credit card. Then many credit cards could be recorded and old ones can be deleted as much as desired.

But in the bad table design with 5 credit card columns, people will just erase outdated credit card numbers over time and not reorganize them and you end up with lots of possible “holes” as in the following example:

use TempDB
go
if
object_id('CredCardInfo','U') is not null drop table CredCardInfo
go
create
table CredCardInfo
(
ID int primary key
,Card1 varchar(16)
,Card2 varchar(16)
,Card3 varchar(16)
,Card4 varchar(16)
,Card5 varchar(16)
)
go

insert CredCardInfo values
(1,'','4898198371982987','5379110787298721','','')
,(2,'6798172988177792','','','5798116090379113','4798116444218977')
,(3,'','','','','4980926667212798')
,(4,'','','6877172983719871','','4079267155432178')
,(5,'4076372911736532','6888327238009837','5368200841237897','','')

select * from CredCardInfo
/*
ID Card1 Card2 Card3 Card4 Card5
-- ---------------- ---------------- ---------------- ---------------- ----------------
1 4898198371982987 5379110787298721
2 6798172988177792 5798116090379113 4798116444218977
3 4980926667212798
4 6877172983719871 4079267155432178
5 4076372911736532 6888327238009837 5368200841237897
*/
You can see that the first 4 rows have empty columns where credit card numbers used to be. We want to repair this information so that all the credit card numbers are like the 5th row, where all the card numbers are in the 1st and 2nd and 3rd slots and so on with no “holes”.

So here is a stored procedure that does just that:

use TempDb
go
if
object_id('RepairCardInfo','P') is not null drop procedure RepairCardInfo
go
create
procedure RepairCardInfo
as
with
CardsUnpivot as
(
select ID
,Card
,OrigColName
,NewColName='Card'
+convert(char(1),row_number() over (partition by ID
order by OrigColName))
from CredCardInfo
unpivot (Card for OrigColName in (Card1,Card2,Card3,Card4,Card5)) P
where Card<>''
)
,
IDsThatNeedChange as
(
select ID
from CardsUnpivot
where OrigColName<>NewColName
)
,
NewCardValues as
(
select ID,Card1,Card2,Card3,Card4,Card5
from (select ID,Card,NewColName
from CardsUnpivot
where ID in (select ID from IDsThatNeedChange)) A
pivot (max(Card) for NewColName in (Card1,Card2,Card3,Card4,Card5)) P
)
update C
set Card1=coalesce(n.Card1,'')
,Card2=coalesce(n.Card2,'')
,Card3=coalesce(n.Card3,'')
,Card4=coalesce(n.Card4,'')
,Card5=coalesce(n.Card5,'')
from CredCardInfo C
join NewCardValues N on C.ID=N.ID
Like other examples we’ve seen in this article, we UNPIVOT the data and then re-PIVOT it.

Take a closer look at the CardsUnpivot CTE… here’s what it produces:

select ID,Card,OrigColName
,NewColName='Card'
+convert(char(1),row_number() over (partition by ID
order by OrigColName))
from CredCardInfo
unpivot (Card for OrigColName in (Card1,Card2,Card3,Card4,Card5)) P
where Card<>''
/*
ID Card OrigColName NewColName
-- ---------------- ----------- ----------
1 4898198371982987 Card2 Card1
1 5379110787298721 Card3 Card2
2 6798172988177792 Card1 Card1
2 5798116090379113 Card4 Card2
2 4798116444218977 Card5 Card3
3 4980926667212798 Card5 Card1
4 6877172983719871 Card3 Card1
4 4079267155432178 Card5 Card2
5 4076372911736532 Card1 Card1
5 6888327238009837 Card2 Card2
5 5368200841237897 Card3 Card3
*/
Notice that after the UNPIVOTing, we create a NewColName, based on the ROW_NUMBER() ordered by OrigColName (within each ID). And that NewColName is sequenced as ‘Card1’, ‘Card2’, etc for each ID.

The next CTE, IDdsThatNeedChange, finds all IDs where OrigColName and NewColName are different. Note that ID Number 5 has them all matching because its data is already in the format we want, with all the numbers in slot 1 and 2 and 3.

The final NewCardValues CTE rePIVOTs the data back into the 5 credit card columns… but only for those IDs that need the change.

And that CTE is used to UPDATE the original table… updating only those rows that need change.

If we execute this procedure, and then look at the data again, we see that all is repaired:

set nocount off
exec
RepairCardInfo
/* (4 row(s) affected) */

select
* from CredCardInfo
/*
ID Card1 Card2 Card3 Card4 Card5
-- ---------------- ---------------- ---------------- ---------------- ----------------
1 4898198371982987 5379110787298721
2 6798172988177792 5798116090379113 4798116444218977
3 4980926667212798
4 6877172983719871 4079267155432178
5 4076372911736532 6888327238009837 5368200841237897
*/

Note that only 4 of 5 rows were actually UPDATEd. If we run the procedure again, nothing happens, because all of our data is already repaired:

set nocount off
exec
RepairCardInfo
/* (0 row(s) affected) */
Vertical Presentations of Data

Sometimes it’s nice to see data presented in a vertical format rather than in columns.

In AdventureWorks, there’s a view called Sales.vStoreWithDemographics, which JOINs many tables to come up with Contact information, Name and Address information, and Demographic information for Stores in the database.

Let’s say we want to create an inline table-valued function which accepts a @CustomerID parameter to present data for the Main Office of a Customer Store. Since there can potentially be more than one contact for a store, we will use ROW_NUMBER() to only look at the first instance.

If we attempt to use the UNPIVOT operator to do this, we run into trouble:

if object_id('VerticalMainOfficeData','IF') is not null 
drop function VerticalMainOfficeData
go
create
function VerticalMainOfficeData
(
@CustomerID int
)
returns table
as
return
with
BaseData as
(
select SeqNo=row_number() over (order by ContactType)
,Name,AddressLine1,AddressLine2,City,StateProvinceName
,PostalCode,CountryRegionName
,YearOpened,NumberEmployees,Specialty
,SquareFeet,Brands,AnnualSales
from AdventureWorks.Sales.vStoreWithDemographics
where CustomerID=@CustomerID
and AddressType='Main Office'
)
select Property,Value
from BaseData
unpivot (Value for Property in (Name
,AddressLine1
,AddressLine2
,City
,StateProvinceName
,PostalCode
,CountryRegionName
,YearOpened
,NumberEmployees
,Specialty
,SquareFeet
,Brands
,AnnualSales)) P
where SeqNo=1
/*
Msg 8167, Level 16, State 1, Procedure VerticalMainOfficeData, Line 22
The type of column "AddressLine1" conflicts with the type of other columns
specified in the UNPIVOT list.
*/
That’s yet another problem with the UNPIVOT operator. Every single column name specified in the list must be exactly the same datatype and have exactly the same length.

But, as we now know, we can just use an APPLY operator instead of UNPIVOT, and we get truckloads of flexibility, as you can see by this function definition, which returns our information in a nice fancy way:

if object_id('VerticalMainOfficeData','IF') is not null 
drop function VerticalMainOfficeData
go
create
function VerticalMainOfficeData
(
@CustomerID int
)
returns table
as
return
with
BaseData as
(
select SeqNo=row_number() over (order by ContactType)
,Name,AddressLine1,AddressLine2,City,StateProvinceName
,PostalCode,CountryRegionName
,YearOpened,NumberEmployees,Specialty
,SquareFeet,Brands,AnnualSales
from AdventureWorks.Sales.vStoreWithDemographics
where CustomerID=@CustomerID
and AddressType='Main Office'
)
select Property,Value
from BaseData
cross apply
(values ('NAME AND ADDRESS:','')
,(' Name',Name)
,(' Address',AddressLine1)
,(' ',AddressLine2)
,(' City',City)
,(' State/Province',StateProvinceName)
,(' Postal Code',PostalCode)
,(' Country/Region',CountryRegionName)
,('','')
,('DEMOGRAPHIC INFO:','')
,(' Year Opened',str(YearOpened,4))
,(' Number of Employees',convert(varchar(10),NumberEmployees))
,(' Specialty',Specialty)
,(' Square Feet',convert(varchar(10),SquareFeet))
,(' Brands',Brands)
,(' Annual Sales','$'+convert(varchar(20),AnnualSales,1))) P(Property,Value)
where SeqNo=1
and Value is not null
Since we’re effectively UNIONing the data in the APPLY, we of course had to CONVERT any numerical data to a VARCHAR value, but we didn’t have to care about the various lengths of the various VARCHAR values. Here are a couple of examples of the output:

select * from dbo.VerticalMainOfficeData(34)
/*
Property Value
--------------------- -----------------------------
NAME AND ADDRESS:
Name Cycles Wholesaler & Mfg.
Address Science Park South, Birchwood
Stanford House
City Warrington
State/Province England
Postal Code WA3 7BH
Country/Region United Kingdom

DEMOGRAPHIC INFO:
Year Opened 1999
Number of Employees 15
Specialty Touring
Square Feet 21000
Brands 4+
Annual Sales $800,000.00
*/

select * from dbo.VerticalMainOfficeData(388)
/*
Property Value
--------------------- ---------------------------
NAME AND ADDRESS:
Name Road Way Sales and Rental
Address 2900 - 25055 Dunning Street
City Vancouver
State/Province British Columbia
Postal Code V7L 4J4
Country/Region Canada

DEMOGRAPHIC INFO:
Year Opened 1979
Number of Employees 6
Specialty Mountain
Square Feet 10000
Brands 3
Annual Sales $300,000.00
*/

Notice in the first example how the second address line is displayed… with a blank string in the Property column… so it looks like it’s just a second line of the address like it should be. The second example had no second line of address (it was NULL in the original table)… it was eliminated by the WHERE clause that eliminated any NULL values.

Finally, I decided to go wild and create a more extensive function that included contact information (up to 2 contacts… no Main Office of any Store in AdventureWorks has more than 2 contacts anyway).

Here is the function I created to do that.

if object_id('VerticalMainOfficeDataWithContacts','IF') is not null 
drop function VerticalMainOfficeDataWithContacts
go
create
function VerticalMainOfficeDataWithContacts
(
@CustomerID int
)
returns table
as
return
with
BaseData as
(
select SeqNo=row_number() over (order by ContactType)
,Name,AddressLine1,AddressLine2,City,StateProvinceName
,PostalCode,CountryRegionName
,YearOpened,NumberEmployees,Specialty
,SquareFeet,Brands,AnnualSales
,ContactType,Title,FirstName,MiddleName
,LastName,Suffix,Phone,EmailAddress
from AdventureWorks.Sales.vStoreWithDemographics
where CustomerID=@CustomerID
and AddressType='Main Office'
)
,
ContactInfo as
(
select SeqNo
,ContactType
,ContactName=coalesce(Title+' ','')
+FirstName+' '
+coalesce(MiddleName+' ','')
+LastName
+coalesce(' '+Suffix,'')
,Phone
,EmailAddress
from BaseData
)
,
ContactUnpivot as
(
select Property=Property+str(SeqNo,1)
,Value
from ContactInfo
cross apply (values ('ContactType',ContactType)
,('ContactName',ContactName)
,('Phone',Phone)
,('EmailAddress',EmailAddress)) C(Property,Value)
)
,
ContactColumns as
(
select *
from ContactUnpivot
pivot (max(Value) for Property in (ContactType1,ContactName1,Phone1,EmailAddress1
,ContactType2,ContactName2,Phone2,EmailAddress2)) P
)
,
UnpivotInput as
(
select *
from BaseData
cross join ContactColumns
where SeqNo=1
)
select Property,Value
from UnpivotInput
cross apply
(values ('NAME AND ADDRESS:','')
,(' Name',Name)
,(' Address',AddressLine1)
,(' ',AddressLine2)
,(' City',City)
,(' State/Province',StateProvinceName)
,(' Postal Code',PostalCode)
,(' Country/Region',CountryRegionName)
,('','')
,('DEMOGRAPHIC INFO:','')
,(' Year Opened',str(YearOpened,4))
,(' Number of Employees',convert(varchar(10),NumberEmployees))
,(' Specialty',Specialty)
,(' Square Feet',convert(varchar(10),SquareFeet))
,(' Brands',Brands)
,(' Annual Sales','$'+convert(varchar(20),AnnualSales,1))
,('','')
,('CONTACT'+case when ContactType2 is not null then ' #1' else '' end+':','')
,(' Type',ContactType1)
,(' Name',ContactName1)
,(' Phone',Phone1)
,(' Email',EmailAddress1)
,('',case when ContactType2 is not null then '' end)
,('CONTACT #2:',case when ContactType2 is not null then '' end)
,(' Type',ContactType2)
,(' Name',ContactName2)
,(' Phone',Phone2)
,(' Email',EmailAddress2)) P(Property,Value)
where Value is not null
I’ll leave it up to you to examine how it works as an exercise. In a nutshell, it unpivots and re-pivots the contact information into a single-row derived table, and then that’s JOINed back to the original base data and then the whole thing is unpivoted.

Here are a couple examples of the output. Note how the first one has only 1 contact (with the heading ‘CONTACT’) and the second one has 2 contacts (with the headings ‘CONTACT #1’and ‘CONTACT #2’:

select * from dbo.VerticalMainOfficeDataWithContacts(34)
/*
Property Value
--------------------- -----------------------------
NAME AND ADDRESS:
Name Cycles Wholesaler & Mfg.
Address Science Park South, Birchwood
Stanford House
City Warrington
State/Province England
Postal Code WA3 7BH
Country/Region United Kingdom

DEMOGRAPHIC INFO:
Year Opened 1999
Number of Employees 15
Specialty Touring
Square Feet 21000
Brands 4+
Annual Sales $800,000.00

CONTACT:
Type Owner
Name Ms. Barbara J. German
Phone 1 (11) 500 555-0181
Email barbara4@adventure-works.com
*/


select * from dbo.VerticalMainOfficeDataWithContacts(50)
/*
Property Value
--------------------- ------------------------
NAME AND ADDRESS:
Name Hometown Riding Supplies
Address Bundesallee 9571
Rechnungsstelle - C 035
City Berlin
State/Province Brandenburg
Postal Code 14197
Country/Region Germany

DEMOGRAPHIC INFO:
Year Opened 1991
Number of Employees 46
Specialty Mountain
Square Feet 38000
Brands 4+
Annual Sales $1,500,000.00

CONTACT #1:
Type Owner
Name Mr. William Vong
Phone 809-555-0100
Email william6@adventure-works.com

CONTACT #2:
Type Purchasing Manager
Name Mr. Greg Harrison
Phone 1 (11) 500 555-0112
Email greg2@adventure-works.com
*/
I hope this article gave you some interesting examples of using UNPIVOT (and yes, PIVOT also). I also hope it stimulated your imagination on creative things you can do with T-SQL.

13 comments:

  1. Brad,

    Wow. We were anticipating a blog post not an entire chapter. You have the most creative way of presenting & narrating ideas. I am pretty sure you make an exceptional speaker and can't wait to listen. Have you talked to Mark?

    ReplyDelete
  2. I just used UNPIVOT on a TABLE that has Monthly and Year-to-date numbers. I need to do a set of calculations on both of them. In order to not repeat each calculation, the answer was SELECT Id, Monthly UNION ALL SELECT Id, YTD or UNPIVOT. UNPIVOT has non-intuitive syntax, but it does make a query look a lot nicer.

    ReplyDelete
  3. That was awesome Brad,
    It's cool to see you flex your UNPIVOT, PIVOT muscles.
    I can imagine an application for these PIVOT/UNPIVOT where it might be easier to arrange and sort things in the database so that Reporting Services (or any other reporting tool) has an easier time of things. Your credit card example comes to mind.

    ReplyDelete
  4. @Sankar: Thanks! I've considered talking to Mark about giving a presentation... but I don't really know our SIG audience that well... my gut feeling is that they are mostly Administrators rather than Developers... but I'll have to confirm that.

    @Brian: UNPIVOT's syntax is (perhaps) easier to understand than PIVOT's, and yes, I agree that it does make a query look nicer/compact. BTW, in reality, UNIONing is faster than an UNPIVOT/APPLY in my testing on large data sets. I was considering posting a "Part 2-1/2" to talk about this and other little tidbits that came up in writing these articles.

    @Michael: Thanks! I agree regarding preparing data for reporting... good point.

    ReplyDelete
  5. Is SQL Server smart enough to rule out an UNPIVOT. Well, i think it's easier to say in code.

    Case A:

    WITH
    A(Id, A, B)
    AS
    (
    SELECT 1, 1, 2 UNION ALL
    SELECT 2, 3, 4
    ),
    B
    AS
    (
    SELECT
    Id,
    D,
    C
    FROM
    A UNPIVOT(C FOR D IN(A, B)) A
    )
    SELECT
    Id,
    C
    FROM
    B
    WHERE
    D = 'A';


    Case B:

    WITH
    A(Id, A, B)
    AS
    (
    SELECT 1, 1, 2 UNION ALL
    SELECT 2, 3, 4
    ),
    B
    AS
    (
    SELECT
    Id,
    'A' D,
    A
    FROM
    A
    UNION ALL
    SELECT
    Id,
    'B' D,
    B
    FROM
    A
    )
    SELECT
    Id,
    A
    FROM
    B
    WHERE
    D = 'A';


    The first case doesn't change if the WHERE clause is there or not. The second case, however, does.

    Just wondering which is better when writing a VIEW that is expected to be used with just such a WHERE clause.

    ReplyDelete
  6. @Brian: At first I didn't understand... but now I get it... Yes, the absence/presence of the WHERE clause in the first query DOES NOT change the underlying Query Plan... but the Query Plan DOES change on the second query depending on whether the WHERE is there or not.

    Yes, if you want a view to have more potential for flexibility, then the UNION choice would be better. The UNPIVOT will force that APPLY operator to be in the plan no matter what.

    I found out (from Rob Farley) that UNPIVOT is translated to an OUTER APPLY (rather than a CROSS APPLY) because it has the POTENTIAL to make the query more efficient... but UNPIVOT always throws in that filter on NULL, so it kind of ruins the whole thing anyway.

    Even if we hard-code OUTER APPLY (rather than use UNPIVOT), we still can't take advantage of any short-circuiting just because of the nature of how APPLY works.

    Hope this helps answer your question.

    ReplyDelete
  7. @Brad

    Thanx! Not only does it answer the question, i'm going to change my code because of it. There's no reason to block out a good (chance for an) optimization later on.

    Sorry about the confusion. I forgot the format would get messed up. I also figured that explaining what i meant would have taken longer (and might have added confusion rather than removing it) than to give you the basics.

    So, i pretty much agree with you now. No reason to use PIVOT/UNPIVOT that have very specific cases, they can both be achieved manually, and the manual method allows optimization.

    Bra, thanx for the great posts.

    ReplyDelete
  8. Great post Brad! For some reason my "blogs that I follow" did not show, you had a new entry...?

    Anyway, I loved the article. I found it be a little long, but I enjoyed reading it. You have presented a lot of great content.

    Note: In the example, you stated that all columns must have the same data type to be UNPIVOTED. This is true, but the easy workaround for this is to convert each column to VARBINARY(N) (making sure it is large enough). At this point you can use a simple cast to varchar in the outer select list.

    Again another great post Brad! :^)

    ReplyDelete
  9. @Adam:

    Thanks for your comments... Sorry about the length, but there was so much stuff to show... and I didn't want to create another article in the series.

    Great point about VARBINARY(N)... I notice I could also just CAST each column to a SQL_VARIANT, and therefore not worry about the length (except in the VARCHAR CASTing in the outer SELECT list).

    ReplyDelete
  10. This didn't show up on my aggregator (google reader) either. I saw that the unpivot 1 post had an active link to this post and was surprised (because I hadn't seen this part 2), but I clicked it anyways (expecting it to fail). Much to my surprise, it did not. Just figured I'd re-iterate this because that's a lot of work put into an article that many will miss because it didn't properly RSS for some reason.

    ReplyDelete
  11. @Garadin:

    I had a problem with this post... because of its length, it made my feed too long (and therefore invalid), and I didn't realize it for several hours. I fixed the feed so that it only feeds 4 blog posts maximum and that seemed to help, but I guess it was probably too late.

    It's unfortunate... You're apparently not the only person to have the problem (see Adam's post above).

    ReplyDelete
  12. >> Aggregations over Columns <<
    Hi Brad,
    In CASE method you can simplify the query.
    Your logic for 3 columns is:
    Case when col1 > col2 and col1 > col3 then col1
    when col2 > col1 and col2 > col3 then col2
    when col3 > col1 and col3 > col2 then col3
    end as maxium_value

    But the simpler logic is:
    Case when col1 > col2 and col1 > col2 then col1
    when col2 > col3 then col2
    else col3
    end as maximum_value

    Also in this case we can use mixture of simple logic and ALL() predicate like this:
    Case when col1 >ALL (select col2 union select col3) then col1
    when col2 > (select col3) then col2
    else col3
    end as maximum_value

    ReplyDelete