SQLComparison of NULL values by use of set operators in SQL Server
In this article we will go through a common task of searching a table for a filtering value using various approaches, ending up with the most elegant solution using set operators. This way of thinking will then be explored in other common usages, such as joins and MERGE statements.
Searching through a table using a SARG
Being able to search a table for a filtering value using an index can significantly improve performance in SQL Server.
A search without an index results in a so-called "Table Scan" and that implies that the whole table (maybe millions of rows) is read and compared to the filtering value, and this we want to avoid for performance reasons. A better performing search using an index can result in a so-called "Seek" and that implies that only the relevant part of the index (maybe just 5 or 10 rows) is read and compared to the filtering value, and this is generally much faster because of the smaller amount of IO traffic (disk reads).
To make a Seek happen requires of course that the table has an index on the column we are searching, but it also has some additional requirements to the expression in the WHERE clause: You can not apply any manipulation to the filtered column, like a function on the column, and your operator must be one that represents a consecutive range within the order that is defined by the index key. If these requirements are fulfilled, then the expression in the WHERE clause is considered a Search ARGument (also known as a SARG), and there is the potential for the SQL Server query optimizer to perform a seek instead of a table scan.
Our example: The Customer table
This calls for an example. Assume we have a table called Customer with a clustered index on the column CustomerId:
CREATE TABLE dbo.Customer ( CustomerId INT IDENTITY(1,1) NOT NULL, Firstname NVARCHAR(30) NULL, Lastname NVARCHAR(30) NULL, Street NVARCHAR(60) NULL, Zip NVARCHAR(10) NULL, City NVARCHAR(30) NULL, Phone NVARCHAR(24) NULL, CONSTRAINT PK_Customer PRIMARY KEY CLUSTERED ( CustomerId ) );
And we have an additional index on the Phone column:
CREATE NONCLUSTERED INDEX Customer_Phone ON dbo.Customer ( Phone );
Let us say we want to write a stored procedure that takes the phone number as a parameter and returns that (or those) customers with the given phone number.
CREATE PROCEDURE FetchCustomer @Phone NVARCHAR(24) AS BEGIN SELECT CustomerId, Phone FROM dbo.Customer WHERE Phone = @Phone; END;
We can then execute the stored procedure to find customers with a phone number of 55512345 like this:
EXEC dbo.FetchCustomer @Phone = '55512345';
If we in SQL Server Management Studio first press Ctrl-M and then F5, then we will see the actual query plan when we execute the stored procedure. The query plan shows a Seek that takes advantage of our index
Customer_Phone. And of course, the query correctly returns the customer with the phone number specified.
For the rest of this article I want to focus on the query part, not the stored procedure part. So to make the example a bit simpler I remove the stored procedure and the execution of it, and I just have this simple script that does the same thing:
DECLARE @Phone NVARCHAR(24) = '55512345'; SELECT CustomerId, Phone FROM dbo.Customer WHERE Phone = @Phone;
Some customers do not have a phone number in the table and the value is then stored as NULL in the Phone column. To search for customers with no phone number one might think that this is the way to do it:
DECLARE @Phone NVARCHAR(24) = NULL; SELECT CustomerId, Phone FROM dbo.Customer WHERE Phone = @Phone;
However, this will not work because in SQL one NULL is not considered equal to another NULL. The query returns no customers even though several customers are missing the phone number.
Using the IsNull function
Most people do the following to fix it: They use the
IsNull function to replace the NULL with a value that can not normally appear in the data. For phone numbers the value
'XYZ' is not expected and can be used:
DECLARE @Phone NVARCHAR(24) = NULL; SELECT CustomerId, Phone FROM dbo.Customer WHERE IsNull(Phone, 'XYZ') = IsNull(@Phone, 'XYZ');
Now we are getting the right result. The query will return the customers without phone number.
But the expression in the WHERE clause is not a SARG. The query plan shows an "Index Scan". So we are scanning all rows in the index. Note here that we are in fact using the index
Customer_Phone, but not in the best way. What happens is that the SQL Server query optimizer sees that we are only interested in the columns that are covered by our index
Customer_Phone so it does not have to use the table itself. Hereby it reduces the amount of data to read because the total size of the index is smaller than the table. But since we are applying a function (IsNull) to the Phone column then the expression is no longer a SARG and we can not benefit from a Seek to precisely pinpoint the customers without phone numbers, and we have to read all of the millions of rows in the index. That is what is happening when we see an "Index Scan" in the query plan. It is still better than a "Table Scan", but we would much prefer to see an "Index Seek" instead.
Using an OR clause
Here is the normally recommended way to make the expression in the WHERE clause into a SARG. We add an extra line with "OR" taking into account the special case when a NULL is supplied as the filter:
DECLARE @Phone NVARCHAR(24) = NULL; SELECT CustomerId, Phone FROM dbo.Customer WHERE Phone = @Phone OR (Phone IS NULL AND @Phone IS NULL);
Now the filter is a SARG, and we are getting the right result. The query plan shows an "Index Seek". With just 11000 rows in my test database the query runs 24 times faster than the previous query. With millions of rows in the table the performance gain would be even more dramatic. So we are happy.
Well, are we totally happy? This is fine with one filter/parameter, but what if you have more filters? Then the query will get more ugly and more messy, and consequently difficult to maintain.
Using set operators for comparison
There is a shorter and more elegant form that many people are not familiar with, but it does the job with less code. This form was shown to me by the great SQL Server expert Itzik Ben-Gan at the SQL Bits 2019 conference.
There is a group of SQL operators which instead of using equality based comparison use distinctness based comparison when making comparisons. This group is called set operators, and it includes the operators
EXCEPT. These SQL operators correspond to the well-known operators from the mathematical field of Set theory. In Set theory the operators go under the names of Union, Intersection, and Set Difference (the last one is also known as Relative Complement).
The solution is to use a distinctness based comparison by using a combination of a subquery and a set operator. Write
WHERE EXISTS and then form two sets each made of one row with a column for each parameter. With just one parameter it looks like this:
DECLARE @Phone NVARCHAR(24) = NULL; SELECT CustomerId, Phone FROM dbo.Customer WHERE EXISTS (SELECT Phone INTERSECT SELECT @Phone);
EXISTS returns True if the intersection contains any rows. The intersection contains rows if the queries to the left and right of the
INTERSECT keyword have any matching rows in common. This takes care of NULL values just like we want, and the query plan shows an "Index Seek".
You can extend this method to any number of parameters you want, provided you have the right indexes created. Here is an example with a search on three parameters:
SELECT CustomerId, Phone FROM dbo.Customer WHERE EXISTS (SELECT Phone, Firstname, Lastname INTERSECT SELECT @Phone, @Firstname, @Lastname);
Joins with set operators
Distinctness based comparison can also be used in other contexts, such as for joins with columns that can be NULL. Let us assume that the customer's first names can also be NULL, and we want to find matching customers from two different customer tables. Then we can do it like this, using the set operator
SELECT c.* FROM dbo.CoffeeShopCustomer AS c INNER JOIN dbo.PizzaPlaceCustomer AS p ON EXISTS ( SELECT c.Phone, c.FirstName, c.LastName, c.City INTERSECT SELECT p.Phone, p.FirstName, p.LastName, p.City );
Merge statements with set operators
Assume we have just loaded some customers into a staging table called
CustomerStage and this is going to be our source for a merge into our target table
Customer. We will match rows in the two tables by their column CustomerId. When there is a match (i.e. in both tables there is a row with the same CustomerId) then we perform an update of the target table so that all the other columns (besides CustomerId) will get a new value coming from the source table. When there is no match (i.e. there is a row in the source table with a CustomerId that could not be found in the target table) then we perform an insertion into the target table so that the row will be copied with all it's values from the source table.
This is normally written like this using a MERGE statement:
MERGE INTO dbo.Customer AS tgt USING dbo.CustomerStage AS src ON src.CustomerId = tgt.CustomerId WHEN MATCHED THEN UPDATE SET tgt.Firstname = src.Firstname , tgt.Lastname = src.Lastname , tgt.Street = src.Street , tgt.Zip = src.Zip , tgt.City = src.City , tgt.Phone = src.Phone WHEN NOT MATCHED THEN INSERT ( CustomerId, Firstname, Lastname, Street, Zip, City ) VALUES ( src.CustomerId, src.Firstname, src.Lastname, src.Street, src.Zip, src.City );
Often though, we want to make sure that if the source row is in fact identical to the target row (i.e. the name is the same, the address is the same, the phone is the same, etc.) then we do not want to apply the update, for these two reasons:
- We do not want to spend time updating the table and all indexes on the table.
- We do not want to execute any triggers or other auditing on the table.
The normal solution is then to add a predicate to the above line
WHEN MATCHED THEN so that this line will be replaced by something like:
WHEN MATCHED AND (src.Firstname <> tgt.Firstname OR src.Lastname <> tgt.Lastname) THEN
(Of course, all the other fields should also be mentioned here).
(In SQL, the symbol
<> means "not equal to", similar to
!= in other languages).
But this gets even more tricky when some of the columns support NULL values. Suppose for example that the Firstname and Lastname columns support NULL. Then the line will look like this:
WHEN MATCHED AND (src.Firstname <> tgt.Firstname OR src.Firstname IS NULL AND tgt.Firstname IS NOT NULL OR src.Firstname IS NOT NULL AND tgt.Firstname IS NULL OR src.Lastname <> tgt.Lastname OR src.Lastname IS NULL AND tgt.Lastname IS NOT NULL OR src.Lastname IS NOT NULL AND tgt.Lastname IS NULL) THEN
Imagine how this will look if the table has 20 more columns in addition to Firstname and Lastname!
Fortunately there is a much cleaner way of doing this using distinctness based comparison, taking advantage of the set operator
EXCEPT. The line can simply be written like this:
WHEN MATCHED AND EXISTS (SELECT src.* EXCEPT SELECT tgt.*) THEN
This says "take all the rows from the source table except those that precisely match any rows in the target table, and if any rows are left then perform an update of those rows", and it also handles NULL values without problems.
If the structures (i.e. column order and data types) of the source and target tables do not match perfectly then this method can still be used by defining a view on top of the source table and then using that view instead of the source table in the MERGE statement.
Besides the occasional usage of the
UNION operator, the set operators are generally not much used in SQL. In this article I have argued that there are other ways of benefitting from the set operators. I have done so by demonstrating how to perform distinctness based comparison, in the contexts of searching, joining and merging. Have a go at trying this out for yourself. And maybe you can come up with even more ways to make good use of the powerful set operators.