Wednesday, July 8, 2009

It's the Natural Order of Things... NOT!

In moving from the Visual FoxPro (VFP) world to the SQL Server world, the biggest thing that I had to un-learn was the concept of a “natural order” of a table. In VFP, when you add rows to a table, they ALWAYS get appended to the end of the table. No exceptions. So when you do a SELECT FROM the table, the rows will ALWAYS appear in the same order no matter what. You can create as many indexes as you like on the table, and it still won’t make any difference, because VFP will ALWAYS use the “natural order” un-indexed rows when you do a SELECT FROM the table.

All of that gets thrown out the window when it comes to SQL Server. I see many posts on the MSDN T-SQL Forum from people who present some sample data and want some kind of solution that assumes a certain order in the table. For example, here’s a typical example of a post I’ve seen many times:

My data looks like this:

Code  Quantity
---- --------
A 23
A 15
A 11
B 14
B 18
A 43
C 12
And I want to sum up the quantities by Code and get this result:

Code  TotQuantity
---- -----------
A 49
B 32
A 43
C 12
The inevitable and immediate question from others on the forum trying to help is: “You can’t do what you want unless you order on some other column. Is there a datetime column or something?” And many times, the answer is no, there is no order. This is just what they see when they do a SELECT * and they assume that this is a “natural” order of the rows as they were chronologically INSERTed into the table.

Wrong-o.

I can certainly understand that this is a natural assumption to make. Coming from the VFP world, it’s certainly the assumption I would have made… until I learned more about how SQL Server works.

Let’s see just how wrong this assumption is.

Create a table and insert 10 rows into it and let’s take a look at what we get when we do a SELECT * from it:

create table #test 
(
col1 varchar(900) not null
,col2 char(1) not null
)

insert #test values ('B'+replicate('.',899),'N')
insert #test values ('R'+replicate('.',899),'A')
insert #test values ('A'+replicate('.',899),'T')
insert #test values ('D'+replicate('.',899),'U')
insert #test values ('S'+replicate('.',899),'R')
insert #test values ('C'+replicate('.',899),'A')
insert #test values ('H'+replicate('.',899),'L')
insert #test values ('U'+replicate('.',899),'O')
insert #test values ('L'+replicate('.',899),'R')
insert #test values ('Z'+replicate('.',899),'D')

select
* from #test
/*
col1 col2
--------- ----
B........ N
R........ A
A........ T
D........ U
S........ R
C........ A
H........ L
U........ O
L........ R
Z........ D
*/
Great. That makes sense. It’s the “natural” chronologically-entered order of the rows. You can see my name (Brad Schulz) spelled down vertically in the first column and “Natural Ord” spelled down vertically in the second column. That’s the order we put them in, and so it’s not surprising that’s the way they come out.

Now let’s insert two more rows and do another SELECT *:

insert #test values ('X','X')
insert #test values ('Y','Y')
select * from #test
/*
col1 col2
--------- ----
B........ N
R........ A
A........ T
D........ U
S........ R
C........ A
H........ L
U........ O
X X
Y Y
L........ R
Z........ D
*/
Huh? Why didn’t it append them at the end? Well, because SQL Server just doesn’t do an append. Table data is stored as a collection of 8-KB (8192-byte) pages. If you subtract the space for page header information and other overhead, that leaves 8060 bytes for the data on each page. And here’s the key: a row of data can not span multiple pages.

The first 10 rows I created were 901 bytes each. The first 8 rows took up 8 * 901 = 7208 bytes, leaving 8060 - 7208 = 852 bytes left in the page. The ninth row of 901 bytes was too big to fit in that page, so it was placed on a whole new page, along with the tenth row (since it, too, was too big to fit on the first page).

But when I added an 11th and 12th row, I purposely made those rows only 2 bytes each. And SQL Server plopped them on the first data page of the table because there was easily enough room for them. So now when we do a SELECT *, we are getting the first page’s worth of data followed by the second page.

You can read more at Books Online about SQL’s data storage in pages.

Now let's stir things up and create a clustered index on col1:

create clustered index ix_test1 on #test (col1)
select * from #test
/*
col1 col2
--------- ----
A........ T
B........ N
C........ A
D........ U
H........ L
L........ R
R........ A
S........ R
U........ O
X X
Y Y
Z........ D
*/
The table is no longer a heap of unordered rows… the table is now ordered on the column we called col1. So a SELECT is now going to retrieve data from the table in that order. If you look at the query plan, you can see that it’s doing a Clustered Index Scan.

But wait, that’s not all…

Now create a nonclustered index on col2:

create nonclustered index ix_test2 on #test (col2)
select * from #test
/*
col1 col2
--------- ----
C........ A
R........ A
Z........ D
H........ L
B........ N
U........ O
L........ R
S........ R
A........ T
D........ U
X X
Y Y
*/
Hmmm… Why does it now come out in col2 order? After all, we did a SELECT on ALL columns of the table, and all the data is stored in the clustered index. So why does the query plan show that an Index Scan was done on the nonclustered index instead of doing a Clustered Index Scan?

Our nonclustered index on col2 (and any nonclustered index we create) automatically includes col1 in the index as well, because it needs that to do a lookup into the clustered index in case it needs to acquire any further columns. So our nonclustered index has both col1 and col2 in it… all the columns of our table are in it!

And why does SQL prefer the NonClustered Index Scan over the Clustered Index Scan, even though they both have exactly the same amount of data? Well, in normal real-life situations, a nonclustered index will have less data in its pages than a clustered one (since the clustered index contains ALL columns). The size of a nonclustered index will be either equal to or less than the size of a clustered index… it will never be bigger… so for that reason, SQL builds an execution plan that will always favor a covering nonclustered index.

Now let’s remove the two indexes… no ordering at all. What happens?

drop index #test.ix_test1
drop index #test.ix_test2
select * from #test
/*
col1 col2
--------- ----
A........ T
B........ N
C........ A
D........ U
H........ L
L........ R
R........ A
S........ R
U........ O
X X
Y Y
Z........ D
*/
Wait a minute… We got rid of the clustered index. Why is the data in col1 order now? I assume that when we removed the clustered index, SQL scanned the clustered index (in col1 order) and re-wrote the data into a heap (a bunch o’ data without a clustered index).

I find it interesting, though, that it didn’t follow the rules of putting a new row into the first page it can find that has room. In other words, it wrote out the first 8 rows (the ones where col1 starts with A, B, C, D, H, L, R, and S), taking up 7208 bytes. Then row number 9’s 901 bytes was too big to fit, so it started a new page. But then row numbers 10 and 11 were our 2-byte rows, and they were apparently APPENDED AFTER row 9 in the second page and WERE NOT put into the first page where there was ample room for them. Perhaps, in order to remove the clustered index as quickly as possible, it’s just faster to spit out the rows onto successive pages rather than trying to do a “first page that fits” algorithm for every single row. (If anyone has any insight into this, please chime in).

If we add a new row, we can see that it follows the “first page that fits” rule, just as we had seen the last time we manually inserted records, putting the row into the first page:

insert #test values ('Z','Z')
select * from #test
/*
col1 col2
--------- ----
A........ T
B........ N
C........ A
D........ U
H........ L
L........ R
R........ A
S........ R
Z Z
U........ O
X X
Y Y
Z........ D
*/
So, after all this, I think we can throw the “natural order” concept out the window, don't you? There’s just no such thing.

4 comments:

  1. Hi Brad,

    The "are tables ordered" is a battle we fight all the time :-). Another viewpoint is to not look at the physical level (pages etc), but at the logical level. The relational model clearly states that a relation (what we call table) is not ordered, and the maths that consitutes the relational model is based on that. This is the foundation for the SQL language. Old news, perhaps, but I find it worth mentioning.

    But what I really wanted to commen what the behavior after dropping the clustered index. When you drop a clustered index, the table is of course converted from a clustered table to a heap table. But SQL Server do not have to re-build the actual data pages (formely the leaf level of the clustered index). They stay the same.

    ReplyDelete
  2. Ahhh, thanks Tibor, that makes sense now that I think about it.

    ReplyDelete
  3. Hi Brad, got a link to the same content over on SQL Serverpedia and did a post about it on my blog showing how VistaDB differs a bit (works more like Foxpro in this one respect). Didn't see this article, added a link to it after another poster pointed me to it.

    Great job, very clear explanation and it is entirely accurate.

    ReplyDelete
  4. Brad

    Here is an ORM that works with VistaDB
    https://www.kellermansoftware.com/p-47-net-data-access-layer.aspx

    ReplyDelete