Execution Plans Don’t Have the Yellow Bang They Really Need.

When SQL Server is purely making an estimate up out of thin air, with no freakin’ idea what the real answer is, it doesn’t give you any kind of warning. It just produces an estimate that looks scientifically accurate – until you know how it’s actually getting calculated.

Let’s ask SQL Server how many users have cast more than ten million votes. You can use any Stack Overflow database for this:

FROM dbo.Users u
WHERE u.UpVotes + u.DownVotes > 10000000;

To give SQL Server the best possible chance to calculate it, I’ll even create a couple of indexes, which will also create perfectly accurate fully scanned statistics on the relevant columns:

CREATE INDEX UpVotes_DownVotes ON dbo.Users(UpVotes, DownVotes);
CREATE INDEX DownVotes_UpVotes ON dbo.Users(DownVotes, UpVotes);

When I run the query, it doesn’t actually produce any results:

But if I look at the execution plan, the estimates are a little off:

SQL Server estimated that precisely 2,675,250 rows would come back – but none did.

So where does that 2,675,250 number come from? It would sound like it was the result of some really scientific calculations, but in reality, it’s just a hard-coded 30% of the number of rows in the table.

There’s absolutely no logic behind that estimate.

30% is a purely made-up number, and it’s hard-coded: you get the same answer if you use greater than OR less than, and with ANY NUMBER YOU COULD PICK. In my case, I picked ten million rows, but you can put any number in that query, and you still get a 30% estimate.

When SQL Server makes an estimate up out of thin air based on hard-coded rules like 30%, it needs to warn us with a yellow bang so that we can jump to fixing that specific problem.

The bigger your query gets,
the more important this becomes.

I started with a really simple example with a small blast radius: that estimation error doesn’t really harm anything. However, in real-world queries, you start to layer in joins and ordering, like finding all of the users and their badges:

FROM dbo.Users u
INNER JOIN dbo.Badges b ON u.Id = b.UserId
WHERE u.UpVotes + u.DownVotes > 10000000
ORDER BY (u.UpVotes + u.DownVotes) DESC;

You would hope that SQL Server would go find the users first, and then short-circuit out of the rest of the query if it didn’t find any matching users. You would have also hoped that we didn’t have a hard-coded estimate of 30% here too, though, and I’ve got bad news about your hopes. I’m here to turn your hopes to nopes, because here’s the live query plan:

SQL Server scans both tables at the same time because it assumes so many users will match – might as well start work on the Badges table, right? No sense in waiting around to short-circuit – we’re going to be working with millions of Users rows! Better get this party started.

After you learn about this, your next steps should probably be: