In this post, I will address a common misconception that IO statistics are always reliable. The truth of the matter is IO stats can sometimes yield incorrect information, which in turn may influence bad coding habits. I have had developers tell me that the scalar UDF query is better because it has less IO than the set based TVF query. In situations like these it is important to express the message that you must account for more than IO when implementing optimization techniques, but in some cases IO can be misinterpreted. The question on the table is, “How can IO statistics be wrong?” The short answer is IO stats are mostly correct and only under certain circumstances IO statistics are misrepresented in SSMS. So what are these magical circumstances? The IO statistics become invalid anytime a scalar UDF is used. The optimizer only accounts for the base table and not any of the IO encountered inside the scalar UDF, which misconstrues the query IO. Let’s look at an example.
First I will create the tables, with data.
IF OBJECT_ID('tempdb.dbo.t1') IS NOT NULL
DROP TABLE tempdb.dbo.t1;
CREATE TABLE t1(
INSERT INTO t1 VALUES (1,'a');
INSERT INTO t1 VALUES (2,'b');
IF OBJECT_ID('tempdb.dbo.t2') IS NOT NULL
DROP TABLE tempdb.dbo.t2;
CREATE TABLE t2(
t2_id INT IDENTITY(1,1),
INSERT INTO t2 VALUES (1,'c');
INSERT INTO t2 VALUES (1,'d');
INSERT INTO t2 VALUES (1,'e');
INSERT INTO t2 VALUES (1,'f');
INSERT INTO t2 VALUES (2,'d');
INSERT INTO t2 VALUES (2,'g');
The next step is to create our scalar UDF.
CREATE FUNCTION dbo.fn_ConcatenateCols(@id INT)
DECLARE @rtn varchar(8000)
SELECT @rtn = COALESCE(@rtn + ',','') + t2.col
WHERE t2.t1_id = @id
Now that I have all my sample DDL in place, we can run a simple test to measure our IO.
SET NOCOUNT ON
SET STATISTICS IO ON
SET STATISTICS IO OFF
Table 't1'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
So what is missing from the results above? If you look closely you will see that t2 is nowhere in the IO stats. Let’s start a profiler trace and run the same query again. Open SQL Server profiler and use the standard template. Once the profile is tracing, run the same query again. If you want the most accurate number of reads make sure to turn off query results Tools –> Options –> Query Results –> SQL Server –> Results To Grid –> Discard Results After Execution. This time around you will see the number of reads is 9 , as shown below.
As you can see IO statistics are a lot different than the actual number of logical reads, using IO Statistics. This behavior can be a huge surprise to many unsuspecting victims. Yes I did use the word victim :). I say victim because this usually occurs to an individual that expects IO to be presented correctly and trusts Microsoft enough to not question their information. This “victim” never thinks twice about questioning the information returned, which can be a huge performance problem.
The take away is developers should always be careful when using scalar functions because they can really degrade performance and never trust anyone’s word, not even Microsoft’s or mine. Always test yourself. If you have not done so, I recommend reading my post on correlated subqueries, as it does apply to functions as well, http://jahaines.blogspot.com/2009/06/correlated-sub-queries-for-good-or-evil.html. In my next post, I will show you how to get rid scalar functions and use Inline TVFs to optimize performance, while encapsulating code logic.
Until next time, happy coding.