Today I wanted to talk about some potential pitfalls that a developer may encounter when using correlated subqueries, in a case expression. As you may recall, I have done a post on the potential performance pitfalls, in using correlated subqueries, before http://jahaines.blogspot.com/2009/06/correlated-sub-queries-for-good-or-evil.html. In this post, I will be focusing on case expressions that use correlated subqueries.
I will start by creating a sample table.
USE [tempdb] GO IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'TestData') BEGIN DROP TABLE dbo.[TestData]; END GO CREATE TABLE dbo.TestData( RowNum INT PRIMARY KEY , SomeChar TINYINT ); GO INSERT INTO dbo.TestData SELECT TOP 1000 ROW_NUMBER() OVER (ORDER BY t1.NAME) AS RowNumber, ABS(CHECKSUM(NEWID())%3+1) FROM Master.dbo.SysColumns t1, Master.dbo.SysColumns t2 GO CREATE NONCLUSTERED INDEX ncl_idx_SomeChar ON dbo.TestData(SomeChar); GO IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'TestData2') BEGIN DROP TABLE dbo.[TestData2]; END GO CREATE TABLE dbo.TestData2( Id INT IDENTITY(1,1) PRIMARY KEY, RowNum INT unique, SomeChar TINYINT ); GO INSERT INTO dbo.TestData2 SELECT TOP 500 ROW_NUMBER() OVER (ORDER BY t1.NAME) AS RowNumber, ABS(CHECKSUM(NEWID())%3+1) FROM Master.dbo.SysColumns t1, Master.dbo.SysColumns t2 GO CREATE NONCLUSTERED INDEX ncl_idx_SomeChar ON dbo.TestData2(SomeChar); GO
A typical correlated subquery in a case expression may look something like this:
SELECT RowNum,SomeChar, CASE (SELECT SomeChar FROM dbo.TestData2 t2 WHERE t2.RowNum = t1.RowNum) WHEN 1 THEN 'Type1' WHEN 2 THEN 'Type2' WHEN 3 THEN 'Type3' END FROM dbo.TestData t1 WHERE [RowNum] <= 500
Let’s have a look at the execution plan to see what is going on underneath the hood
IO:
Table 'Worktable'. Scan count 442, logical reads 2886, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'TestData2'. Scan count 3, logical reads 12, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'TestData'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Whoa!!! This query is extremely inefficient. As you can see the TestData2 table was scanned 3 times and a worker table was created and scanned 442 times. The problem here is the optimizer chooses to spool the data from dbo.TestData2 twice. The even bigger problem with this method is scalability. This code does not scale well at all. In the case of this query, the optimizer creates a relational number of index spools to the number of elements in the case expression. The relationship can be defined as Number Of Spools = Number of Case Elements – 1. What does this mean? It means that if your case expression has 4 elements you get 3 spools… if you case expression has 5 elements you get 4 spools and so on. Simply put…..query performance decreases as the number of elements in the case expression increase. Take a look at the example below.
IO:
Table 'Worktable'. Scan count 539, logical reads 4081, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'TestData2'. Scan count 4, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'TestData'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
So how should we change the query to help the optimizer make a better decision?
The best solution is to allow the optimizer to get the computed value, while it is joining the TestData2 table, as shown below.
SELECT RowNum,SomeChar, (SELECT CASE SomeChar WHEN 1 THEN 'Type1' WHEN 2 THEN 'Type2' WHEN 3 THEN 'Type3' END FROM dbo.TestData2 t2 WHERE t2.RowNum = t1.RowNum) FROM dbo.TestData t1 WHERE [RowNum] <= 500
IO:
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'TestData2'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'TestData'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
As you can see, this is a much better query plan. The key here is the optimizer is able to use a compute scalar operator upstream, while joining the tables. Because the computed value is joined to the TestData table, we do not have to worry about spooling the data.
Conclusion
For me, correlated subqueries can have inconsistent behavior and often bear performance problems, such as this one. Do not get me wrong, correlated subqueries are not all bad, but they should be thoroughly tested. In my opinion, the best way to write this query is to LEFT OUTER JOIN dbo.TestData2. An outer join will provide more consistent performance.
SELECT t1.RowNum,t1.SomeChar, CASE t2.SomeChar WHEN 1 THEN 'Type1' WHEN 2 THEN 'Type2' WHEN 3 THEN 'Type3' END FROM dbo.TestData t1 LEFT JOIN dbo.TestData2 t2 ON t1.RowNum = t2.RowNum WHERE t1.[RowNum] <= 500
Until next time, happy coding.
6 comments:
Great blog post. Did you see Hugo Kornelis' related Connect item: https://connect.microsoft.com/SQLServer/feedback/details/546437/coalesce-subquery-1-may-return-null ...?
Hi Paul,
Thanks for the comment. I hadn't seen Hugo's connect item. I am glad to see that this is being looked into, as this type of behavior, is the exact reason I dislike correlated subqueries.
I decided to blog about this after seeing a few similar threads on the MSDN forums. I figured it would be good to expose this type of behavior.
Thanks for the feedback.
Nice work...
Thanks
Another way to get a good plan:
SELECT t1.RowNum, t1.SomeChar,
CASE CA.SomeChar
WHEN 1 THEN 'Type1'
WHEN 2 THEN 'Type2'
WHEN 3 THEN 'Type3'
END
FROM dbo.TestData t1
OUTER
APPLY (
SELECT t2.SomeChar
FROM dbo.TestData2 t2
WHERE t2.RowNum = t1.RowNum
) CA
WHERE t1.[RowNum] <= 500;
This happens to be transformed to a right outer join by the optimiser. It's quite good at de-correlating queries in general - but I agree CASE + correlated subquery can be a problematic construction.
Paul
What a fabulous post. Would like to thank the admin for sharing this post in our vision.
Spoken English Class in Anna Nagar
Spoken English Class in Porur
Spoken English Class in T Nagar
Spoken English Class in Adyar
Spoken English Classes in Chennai
Best Spoken English Classes in Chennai
IELTS Coaching in Chennai
IELTS Coaching Centre in Chennai
English Speaking Classes in Mumbai
IELTS Classes in Mumbai
Great Information. Thanks for the post. Acronis True Image 2022 Crack
Windows Movie Maker 2022 Crack
Microsoft Office 2022 Crack
Adobe Photoshop 2022 Crack
Post a Comment