Wednesday, September 16, 2009

The Age-Old SELECT * vs. SELECT 1 Debate

There is a certain look that T-SQL Developers get when they see SELECT * used in production code. In the span of a couple seconds, you’ll see a slight tic develop in their left eye, and you’ll hear little whimpering sounds coming through their nose, which is quickly followed by a not-so-subtle twitching of their lips, almost like they’re holding in bubbling lava before a volcanic explosion. And then there’s a sharp intake of breath and…

Omigosh! A SELECT *!!

It’s not a pretty picture.

But there is a place where a SELECT * is considered okay by some (though not all) developers. And that’s in an EXISTS subquery.

Consider the following query, which finds all Customers in the NorthWind database who do not have any orders:

select CustomerID
from NorthWind.dbo.Customers
where not exists (select *
from NorthWind.dbo.Orders
where CustomerID=Customers.CustomerID)
/*
CustomerID
----------
FISSA
PARIS
*/
Even though a SELECT * is used within the subquery, SQL doesn’t do anything at all with the columns in the SELECT list when it actually runs the query. It only cares about the existence of rows and doesn’t care about the specific attributes of the rows. The only thing that’s really important is the WHERE clause.

In order to firmly demonstrate that the SELECT list is not used, I wrote the following tongue-in-cheek query, which would ordinarily bring about truckloads of runtime errors, but it runs correctly and without any error whatsoever:

select CustomerID
from NorthWind.dbo.Customers
where not exists (select DivideByZero=1/0
,BadConversion=convert(int,'xxx')
,InvalidParameter=left('abc',-ShipVia)
,DateOverflow=dateadd(year,9999,getdate())
,ArithmeticOverflow=cast(1e308 as tinyint)
,InvalidCursorRef=cursor_status('junk','junk')
,BadSubquery=(select top (-OrderID) EmployeeID
from NorthWind.dbo.Employees)
from NorthWind.dbo.Orders
where CustomerID=Customers.CustomerID)
/*
CustomerID
----------
FISSA
PARIS
*/
The query plan for both of the above queries is exactly identical. It only shows an Index Scan of the Orders.CustomersOrders non-clustered index. And the properties of that Index Scan operator shows that only the CustomerID column is in its output list (because it was specified in the WHERE clause):



Let me show you one other thing I discovered.

Books Online states in the SELECT Clause topic that “the maximum number of expressions that can be specified in the select list is 4096.”

Oh yeah?

This is true of a normal query. Consider the following, where I populate a variable with a comma-separated list of 4097 zeroes, and I incorporate that into a regular SELECT statement. When this is executed, it bombs with the compile-time error message 1056, stating that the SELECT list had over 4096 elements.

declare @columnlist nvarchar(max)
,@sql nvarchar(max)

--I have to CAST() the first parameter of REPLICATE
--in order to force it to produce more than 8000 bytes
select @columnlist='0'+replicate(cast(',0' as nvarchar(max)),4096)

select @sql='select '+@columnlist+' from NorthWind.dbo.Orders'
exec sp_executesql @sql
--Bombs with Msg 1056: The number of elements in the select
-- list exceeds the maximum allowed number of 4096 elements.
But if I insert that same 4097-element SELECT list into an EXISTS subquery, it runs with no problem whatsoever:

select @sql='select CustomerID
from NorthWind.dbo.Customers
where not exists (select '
+@columnlist+'
from NorthWind.dbo.Orders
where CustomerID=Customers.CustomerID)'
exec sp_executesql @sql
/*
CustomerID
----------
FISSA
PARIS
*/
Incredibly, you can give it a SELECT list of 10,000 items and it will compile and execute without error… the compile time takes a while, as it parses the statement and makes sure the syntax is correct, but a compile-time error does not occur, because it just tosses out the SELECT list altogether. And, once again, the query plan is unchanged.

Hey, I can even introduce a GROUP BY and aggregates, and still, they have no impact whatsoever and the same final query plan is generated:

select CustomerID
from NorthWind.dbo.Customers
where not exists (select ShipVia
,count(*)
,sum(Freight)
from NorthWind.dbo.Orders
where CustomerID=Customers.CustomerID
group by ShipVia)
/*
CustomerID
----------
FISSA
PARIS
*/
In fact, I can even do an illegal SELECT DISTINCT and it doesn’t even blink an eye. As you can see below, I add a new XML column to the Orders table and then do a SELECT DISTINCT * in the EXISTS subquery. Ordinarily, this would be an illegal operation, because you can’t do a SELECT DISTINCT on non-comparable columns (i.e. XML, TEXT, NTEXT and IMAGE). But heck, it doesn’t matter when it’s in an EXISTS subquery!

alter table NorthWind.dbo.Orders 
add NonComparableColumn xml
go
select CustomerID
from NorthWind.dbo.Customers
where not exists (select distinct *
from NorthWind.dbo.Orders
where CustomerID=Customers.CustomerID)
/*
CustomerID
----------
FISSA
PARIS
*/
And again, the query plan for every single one of all of the above EXISTS queries is exactly the same.

So if the SELECT list doesn’t make any difference in an EXISTS subquery, why do many developers insist on doing a SELECT 1 instead of a SELECT *? Part of it is just a religious purity kind of thing… The * is the Anti-Christ and should be exorcised at all costs. But part of it is also the argument that doing a SELECT * will add unnecessary nanoseconds of compile time to the query because SQL will expand the * into columns and then will turn right around and remove them because they’re unnecessary anyway.

To be absolutely honest, I thought that was nonsense, especially in light of all the illustrative demos I put together above (particularly the SELECT DISTINCT * example). And it just seemed like extra unnecessary work for the compiler to do.

However, I certainly can’t argue with Conor Cunningham, SQL Server Query Optimization Development Lead at Microsoft. He stated back in November 2005 that it may be ever-so-slightly faster to do a SELECT 1. But in a February 2008 blog entry, he stated categorically that the * expansion does happen and that SELECT 1 will avoid the examination of table metadata. You conspiracy theorists may find it interesting that the direct link to this blog entry is no longer functional, but you may be able to see it at this link.

So, I’ll leave it up to you whether to do a SELECT * or a SELECT 1 in your EXISTS subqueries. As for me, I always used to do a SELECT *, but perhaps, to be different, I’ll just do a SELECT 1/0 in all of mine from now on. Sure, it will take a couple of extra femtoseconds to compile (compared to the boring SELECT 1), but the opportunity to do a kind of mischievous “wink wink” will be worth it.

12 comments:

  1. SELECT 1 WHERE EXISTS(SELECT 1/0)

    That is awesome.

    I think SELECT * is clearer, stating, i don;t care about the COLUMNs. I kind of wish they'd remove the SELECT clause itself from the EXISTS subquery, and start from FROM.

    And thanx for the link.

    ReplyDelete
  2. Brad,

    Good post, as always. I use SELECT 1 out of habit and because I dont see a need to use a column list, for the exists clause. For the distinct clause in the exists or IN clause, I recommend that it is avoided. You should avoid distinct in the IN and Exists clause because it is pointless. Both of these clause are essentially looking for whether or not the statement is true. Once true, a value of 1 is returned, so it doesnt check anything else. This is why it is pointless because it is either true or not, so there is no need to remove duplicates from the list. An equivelent clause would look like this: select * where something IN('a','b','a'). In this example, it does not matter that a is in the list twice because it will evaluate to true or false.

    I would like to shed more light on the exists clause select list.
    The select list is not relevant, in the exists clause, because the predicate drives it. The optimizer uses the outer column in predicate to lookup an existing value in the inner table (usually a nested loop join). If a match is found then a 1 is returned. The exists clause only cares that the statement evaluates to true. Nothing more, Nothing less.

    PS: Sorry for being long winded.

    Take this example into consideration:

    DECLARE @t1 TABLE(
    id INT PRIMARY KEY CLUSTERED
    )
    INSERT INTO @t1 VALUES (1);
    INSERT INTO @t1 VALUES (2);

    DECLARE @t2 TABLE(
    id INT PRIMARY KEY CLUSTERED
    )

    INSERT INTO @t2 VALUES (1);

    --one row is returned because it was found in @t1 and not @t2.
    --The exists row evaluates to true for id=1, so the not of true is false
    --the exists row evaluates to false for id=2, so the not of false is true
    SELECT *
    FROM @t1 t1
    WHERE NOT EXISTS(SELECT * FROM @t2 t2 WHERE t1.id = t2.id)

    --no results are returned because exists always evaluates to true
    --because a record does exist in @t2 and the not of true is false.
    SELECT *
    FROM @t1 t1
    WHERE NOT EXISTS(SELECT 2 FROM @t2 t2)

    --all rows are retuned because the predicate is always true
    --because the exists evaluate to false and the not of false is true
    SELECT *
    FROM @t1 t1
    WHERE NOT EXISTS(SELECT * FROM @t2 t2 WHERE t2.id = 2)

    ReplyDelete
  3. Excellent. Good explanation. I'll back pocket this for the next zealot that tries to argue for one or the other, hehe.

    ReplyDelete
  4. You might find this Connect Item of interest too https://connect.microsoft.com/SQLServer/feedback/details/671475/select-test-where-exists-select

    BTW: Regarding Conor's point I looked at this (in SQL Server 2008) and was not able to find any difference whatsoever between "SELECT 1" and "SELECT *" even on a table with 1024 columns. I did find that both queries were significantly and equivalently speeded up against a table with 1 column though (both tables were empty so the only difference is the column metadata) so I wonder if internally the "SELECT 1" gets rewritten as "SELECT *" then goes through the column expansion described by Conor.

    The script I used is here http://stackoverflow.com/questions/1597442/subquery-using-exists-1-or-exists/6140367#6140367

    ReplyDelete
  5. Hi Brad,

    Good stuff as usual.

    Two contributions:

    SELECT 'Test' WHERE EXISTS (SELECT LOG(0));

    and SELECT * is still illegal syntax in schema-bound objects, so I find myself using SELECT {constant} more and more. SELECT PI() is a good one.

    Cheers,

    Paul

    ReplyDelete
  6. Speechless with the "SELECT 1 WHERE EXISTS(SELECT 1/0)"

    ReplyDelete
  7. Awesome and fun summary!

    I am soo tempted to start using 1/0, although u'll never know when something like that will change and your stored procs will start failing all over the shop :)

    />L

    ReplyDelete
  8. At the moment, this appears to be a working link to Conor's blog post: http://www.sqlskills.com/blogs/conor/exists-subqueries-select-1-vs-select/

    ReplyDelete
  9. A good reason to use if exists (select 1... is to develop the habit. If you ever end up writing an oracle query it does make a difference.

    ReplyDelete
  10. Select * will throw when the user doesn't have permission to all columns, no??

    ReplyDelete