Tuesday, June 2, 2009

To Update or not to update

I was browsing the MSDN forums, which is where I spend a hefty chunk of my time, and came across a post that uses a select query and an update statement, with a from clause. Here is the link: http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/71787ae0-0b27-4033-a761-123e1ffa444c/. The update statement uses the exact query as the select statement; however, the significance about this post is the SELECT statement returned x rows, but the UPDATE affected a lesser number of rows.  I thought I would go ahead and demonstrate why this occurs, as it can be surprising, if you do not know to look for it.  One solution said this was due to duplicate rows in the joining table, but this is not entirely true. This behavior occurs when the join relationship is represented as a one-to-many or a many-to-many relationship, where the table being updated has 1 or more rows and the referencing table has 2 or more rows.  The UPDATE affects less rows because the optimizer aggregates/removes (depending on the query plan) duplicate matching keys (columns used to define the relationship) before performing the UPDATE.  The end result is a many-to-one update. As you can see, it does not matter if the rows in the joining table are distinct or not, the only thing that matters is the columns used to create the relationship.

So what does this all mean? It means that when you SELECT the data all rows are returned, but when you UPDATE you only can affect the number of rows affected in the table being updated. The catch is the value of the column being updated will be the first value found in the joining table. 

I know that is a lot to take in, so let’s create some code to visually see what is going on.

Here is a small scale sample.

set nocount on;

IF OBJECT_ID('tempdb..#t1') IS NOT NULL
BEGIN
    DROP TABLE [#t1];
END
GO

CREATE TABLE #t1(
id INT PRIMARY KEY CLUSTERED,
col char(1)
)

insert into #t1 values (1,'a');
insert into #t1 values (2,'b');
insert into #t1 values (3,'c');

IF OBJECT_ID('tempdb..#t2') IS NOT NULL
BEGIN
    DROP TABLE [#t2];
END
GO

CREATE TABLE #t2(
id1 int,
col2 char(1)
)

insert into #t2 values (1,'x');
insert into #t2 values (1,'y');
insert into #t2 values (1,'z');
insert into #t2 values (2,'c');

set nocount off;

select *
from #t1 t1
INNER join #t2 t2
    ON    t1.id = t2.id1
WHERE
    t1.id = 1

update t1
set col = t2.col2
from #t1 t1 
INNER join #t2 t2
    ON    t1.id = t2.id1
WHERE
    t1.id = 1

SET NOCOUNT ON;

select *
from #t1

If you look at the messages tab, you will see the SELECT statement returned 3 rows, but the UPDATE affected 1 row. 

image

The next thing we should look at is the query plans for the above query.  Run the query and mouse over each show plan operator prior to the stream aggregate (Note: you have to read a query plan from top-to-bottom-right to left) to look at the actual number of rows. Now mouse over the stream aggregate show plan operator. This operator aggregates all the data in the referencing table via the join columns.  If you look at the actual number of rows, for this operator you will note it is now 1.  This is where the optimizer groups the data and keeps the first value returned. 

image

Note: You may see different show plan operators, depending on the indices you have in your environment and the way your query is built.  A non-clustered index or clustered index query plan may not show a stream aggregate, but it does use a distinct sort operator that will help limit the number of rows returned based on the joining key column. The end result will be the same regardless.

No comments: