Monday, June 22, 2009

Correlated sub queries for good or evil?

Today on the MSDN forums I engaged a good discussion, on the innards of a correlated sub query http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/0a4e384c-f8c6-45b7-a4db-c14827500fe3/. I will start off by saying that all correlated sub queries are not evil, just most :-). All kidding aside it is not that correlated sub queries are bad, it is just there are better methods to hanlde these types of queries. Here is the BOL link to correlated sub queries, http://msdn.microsoft.com/en-us/library/ms187638.aspx. Let’s get started.

You are probably asking yourself, “How can a correlated sub query be bad?” The answer is because of a nested loop join. A nested loop join takes all the results from the outer query and searches for the a resulting value in the inner query. This type of join can be very expensive, especially for queries with large outer results. For example say that the outer query returns 1 million rows and the inner query has 500 hundred. In this example, each row returned by the outer table has to queried against the inner table, which means that inner table is queried 1 million times! Now that is a serious performance problem. Let’s have a look at some sample code to see how this really works.

First we will create the sample table: (Sample data compliments of Michael Asher)

use [tempdb]
GO
 
IF OBJECT_ID('dbo.item') IS NOT NULL
BEGIN
    DROP TABLE dbo.item;
END
GO
 
create table dbo.Item (
 [ItemID] int not null,
 [name] varchar(50) not null);
 
IF OBJECT_ID('dbo.Child') IS NOT NULL
BEGIN
    DROP TABLE dbo.Child;
END
GO
create table dbo.Child (
 Child1ID int identity(1,1) not null,
 ItemID int not null,
 txt varchar(100) not null );
 
INSERT INTO Item 
SELECT TOP(1000000) rn, 'Row: ' + CONVERT(VARCHAR,rn)
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS rn
FROM sys.columns a
CROSS JOIN sys.columns b
CROSS JOIN sys.columns c
) AS A
CREATE CLUSTERED INDEX idx_Item ON Item(ItemID)

Correlated sub query example:

SELECT 
    i.[name],
    (SELECT c.[txt] FROM child c WHERE c.ItemID = i.ItemID)
FROM dbo.Item i
WHERE
    i.ItemId < 1000

Here is the execution plan and IO statistics:

image

Table 'Child'. Scan count 999, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Item'. Scan count 1, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

As you can see a nested loop join was used to get the data via the correlated sub query and the table “child” was scanned 999 times. Obviously, the increased IO and CPU translates into performance degradation.

Note: You can mouse over the item table show plan operator to view the actual number of rows.

Now let’s change the query to use a join.

SELECT 
    i.[name],
    c.[txt]
FROM dbo.Item i
LEFT JOIN child c
    ON c.ItemID = i.ItemID
WHERE
    i.ItemId < 1000

Execution plan and IO statistics:

image

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Item'. Scan count 1, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Child'. Scan count 1, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The difference between the queries is night and day. The number of scans decreased to 1, which is a heck of a lot better than 999. Moving a correlated sub query into the join makes the query much more performant and was an easy to implement solution. This is one of the most common performance problems that I have encountered over the years. The solution is easy to fix, but occurs a lot because developers do not understand the impact this type of query can have. The bottom line is RDMS are optimized for joins and best practice suggests the use of joins, to relate data and not correlated sub queries. This does not mean that there are not special cases where using a correlated sub query may be faster , but these cases are the exception and not the rule.

Extra Content:

Note: There are many instances where the query optimizer can transform the correlated sub query into a join. The catch is this usually occurs when a correlated sub query is used in conjunction with Exists/In, in the where clause. A correlated sub query in the select list will more than likely use a nested loop join. Regardless of how the optimizer transforms the query text, I would rather code the TSQL correctly the first time and not have to worry about the optimizer fixing the query for me.

Here is a sample of how the optimizer changes the correlated sub query to an inner join.

SELECT 
    i.[name]
FROM dbo.Item i
WHERE
    i.ItemId < 1000
    AND EXISTS(
        (SELECT c.[txt] FROM child c WHERE c.ItemID = i.ItemID)
    )

6 comments:

Kent Waldrop said...

Sorry I gave up on the thread; It was interesing but it became a bit high strung. Thanks for blogging on this one.

Kent Waldrop

Adam Haines said...

Yeah, I kinda did too. The post started off well, but ended up so far off tangent that it became frustrating. I posted the blog because I know that these types of queries impact performance. I see it day in and day out, which may not be known to everyone.

Deepak said...

Adam,

Thank you for writing this post. Very interesting.

Anonymous said...

Excellent post! very helpful

Anonymous said...

Thank you for the clear explanation, i was looking for this kind of explanation and it was perfect for my situation

Anonymous said...

How you will handle a situation where you want only one record from another table as joins replicates the row.