Tuesday, May 4, 2010

Performance tuning Case Expressions With Correlated Subqueries

Today I wanted to talk about some potential pitfalls that a developer may encounter when using correlated subqueries, in a case expression.  As you may recall, I have done a post on the potential performance pitfalls, in using correlated subqueries, before  http://jahaines.blogspot.com/2009/06/correlated-sub-queries-for-good-or-evil.html.  In this post, I will be focusing on case expressions that use  correlated subqueries.

I will start by creating a sample table.

USE [tempdb]
GO

IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'TestData')
BEGIN
    DROP TABLE dbo.[TestData];
END
GO

CREATE TABLE dbo.TestData(
RowNum INT PRIMARY KEY ,
SomeChar TINYINT
);
GO

INSERT INTO dbo.TestData
SELECT TOP 1000 
    ROW_NUMBER() OVER (ORDER BY t1.NAME) AS RowNumber,
    ABS(CHECKSUM(NEWID())%3+1)
FROM 
    Master.dbo.SysColumns t1,
    Master.dbo.SysColumns t2
GO

CREATE NONCLUSTERED INDEX ncl_idx_SomeChar ON dbo.TestData(SomeChar);
GO

IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'TestData2')
BEGIN
    DROP TABLE dbo.[TestData2];
END
GO

CREATE TABLE dbo.TestData2(
Id INT IDENTITY(1,1) PRIMARY KEY,
RowNum INT unique,
SomeChar TINYINT
);
GO

INSERT INTO dbo.TestData2
SELECT TOP 500
    ROW_NUMBER() OVER (ORDER BY t1.NAME) AS RowNumber,
    ABS(CHECKSUM(NEWID())%3+1)
FROM 
    Master.dbo.SysColumns t1,
    Master.dbo.SysColumns t2
GO

CREATE NONCLUSTERED INDEX ncl_idx_SomeChar ON dbo.TestData2(SomeChar);
GO

A typical correlated subquery in a case expression may look something like this:

SELECT 
    RowNum,SomeChar,
    CASE (SELECT SomeChar FROM dbo.TestData2 t2 WHERE t2.RowNum = t1.RowNum) 
        WHEN 1 THEN 'Type1' 
        WHEN 2 THEN 'Type2'
        WHEN 3 THEN 'Type3'
    END
FROM dbo.TestData t1
WHERE [RowNum] <= 500

Let’s have a look at the execution plan to see what is going on underneath the hood

image

IO:

Table 'Worktable'. Scan count 442, logical reads 2886, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestData2'. Scan count 3, logical reads 12, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestData'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Whoa!!! This query is extremely inefficient.  As you can see the TestData2 table was scanned 3 times and a worker table was created and scanned 442 times.  The problem here is the optimizer chooses to spool the data from dbo.TestData2 twice.  The even bigger problem with this method is scalability.  This code does not scale well at all.  In the case of this query, the optimizer creates a relational number of index spools to the number of elements in the case expression.  The relationship can be defined as Number Of Spools = Number of Case Elements – 1.  What does this mean? It means that if your case expression has 4 elements you get 3 spools… if you case expression has 5 elements you get 4 spools and so on.  Simply put…..query performance decreases as the number of elements in the case expression increase.  Take a look at the example below.

image

IO:

Table 'Worktable'. Scan count 539, logical reads 4081, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestData2'. Scan count 4, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestData'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

So how should we change the query to help the optimizer make a better decision? 

The best solution is to allow the optimizer to get the computed value, while it is joining the TestData2 table, as shown below.

SELECT 
    RowNum,SomeChar,
    (SELECT CASE SomeChar WHEN 1 THEN 'Type1' WHEN 2 THEN 'Type2' WHEN 3 THEN 'Type3' END FROM dbo.TestData2 t2 WHERE t2.RowNum = t1.RowNum) 
FROM dbo.TestData t1
WHERE [RowNum] <= 500

image

IO:

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestData2'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestData'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

As you can see, this is a much better query plan.  The key here is the optimizer is able to use a compute scalar operator upstream, while joining the tables. Because the computed value is joined to the TestData table, we do not have to worry about spooling the data.

Conclusion

For me, correlated subqueries can have inconsistent behavior and often bear performance problems, such as this one.  Do not get me wrong, correlated subqueries are not all bad, but they should be thoroughly tested.  In my opinion, the best way to write this query is to LEFT OUTER JOIN dbo.TestData2.  An outer join will provide more consistent performance.

SELECT 
    t1.RowNum,t1.SomeChar,
    CASE t2.SomeChar 
        WHEN 1 THEN 'Type1' 
        WHEN 2 THEN 'Type2'
        WHEN 3 THEN 'Type3'
    END
FROM dbo.TestData t1
LEFT JOIN dbo.TestData2 t2  ON t1.RowNum = t2.RowNum
WHERE t1.[RowNum] <= 500

Until next time, happy coding.

4 comments:

Paul said...

Great blog post. Did you see Hugo Kornelis' related Connect item: https://connect.microsoft.com/SQLServer/feedback/details/546437/coalesce-subquery-1-may-return-null ...?

Adam Haines said...

Hi Paul,

Thanks for the comment. I hadn't seen Hugo's connect item. I am glad to see that this is being looked into, as this type of behavior, is the exact reason I dislike correlated subqueries.

I decided to blog about this after seeing a few similar threads on the MSDN forums. I figured it would be good to expose this type of behavior.

Thanks for the feedback.

Fabiano Amorim said...

Nice work...

Thanks

Paul White NZ said...

Another way to get a good plan:

SELECT t1.RowNum, t1.SomeChar,
CASE CA.SomeChar
WHEN 1 THEN 'Type1'
WHEN 2 THEN 'Type2'
WHEN 3 THEN 'Type3'
END
FROM dbo.TestData t1
OUTER
APPLY (
SELECT t2.SomeChar
FROM dbo.TestData2 t2
WHERE t2.RowNum = t1.RowNum
) CA
WHERE t1.[RowNum] <= 500;

This happens to be transformed to a right outer join by the optimiser. It's quite good at de-correlating queries in general - but I agree CASE + correlated subquery can be a problematic construction.

Paul