Tuesday, October 20, 2009

IO Stats – What Are You Missing?

In this post, I will address a common misconception that IO statistics are always reliable.  The truth of the matter is IO stats can sometimes yield incorrect information, which in turn may influence bad coding habits. I have had developers tell me that the scalar UDF query is better because it has less IO than the set based TVF query.  In situations like these it is important to express the message that you must account for more than IO when implementing optimization techniques, but in some cases IO can be misinterpreted. The question on the table is, “How can IO statistics be wrong?”  The short answer is IO stats are mostly correct and only under certain circumstances IO statistics are misrepresented in SSMS.  So what are these magical circumstances? The IO statistics become invalid anytime a scalar UDF is used.  The optimizer only accounts for the base table and not any of the IO encountered inside the scalar UDF, which misconstrues the query IO.  Let’s look at an example.

First I will create the tables, with data.

USE [tempdb]
GO
 
IF OBJECT_ID('tempdb.dbo.t1') IS NOT NULL
BEGIN
    DROP TABLE tempdb.dbo.t1;
END
GO
 
CREATE TABLE t1(
id INT,
col CHAR(1)
);
GO
 
INSERT INTO t1 VALUES (1,'a');
INSERT INTO t1 VALUES (2,'b');
GO
 
IF OBJECT_ID('tempdb.dbo.t2') IS NOT NULL
BEGIN
    DROP TABLE tempdb.dbo.t2;
END
GO
 
CREATE TABLE t2(
t2_id INT IDENTITY(1,1),
t1_id INT,
col CHAR(1)
);
GO
 
INSERT INTO t2 VALUES (1,'c');
INSERT INTO t2 VALUES (1,'d');
INSERT INTO t2 VALUES (1,'e');
INSERT INTO t2 VALUES (1,'f');
INSERT INTO t2 VALUES (2,'d');
INSERT INTO t2 VALUES (2,'g');
GO

The next step is to create our scalar UDF.

CREATE FUNCTION dbo.fn_ConcatenateCols(@id INT)
RETURNS VARCHAR(8000)
AS
BEGIN
    DECLARE @rtn varchar(8000)
    
    SELECT @rtn = COALESCE(@rtn + ',','') + t2.col
    FROM dbo.t2
    WHERE t2.t1_id = @id
    
    RETURN @rtn
END
GO

Now that I have all my sample DDL in place, we can run a simple test to measure our IO.

SET NOCOUNT ON 
GO
SET STATISTICS IO ON
GO
 
--Missing I/O
SELECT id,dbo.fn_ConcatenateCols(id)
FROM [dbo].[t1];
 
SET STATISTICS IO OFF
GO
/*
id          
----------- --------
1           c,d,e,f
2           d,g
 
Table 't1'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
*/

So what is missing from the results above?  If you look closely you will see that t2 is nowhere in the IO stats.  Let’s start a profiler trace and run the same query again.  Open SQL Server profiler and use the standard template. Once the profile is tracing, run the same query again. If you want the most accurate number of reads make sure to turn off query results Tools –> Options –> Query Results –> SQL Server –> Results To Grid –> Discard Results After Execution. This time around you will see the number of reads is 9 , as shown below.

image

As you can see IO statistics are a lot different than the actual number of logical reads, using IO Statistics.  This behavior can be a huge surprise to many unsuspecting victims.  Yes I did use the word victim :).  I say victim because this usually occurs to an individual that expects IO to be presented correctly and trusts Microsoft enough to not question their information.  This “victim” never thinks twice about questioning the information returned, which can be a huge performance problem. 

The take away is developers should always be careful when using scalar functions because they can really degrade performance and never trust anyone’s word,  not even Microsoft’s or mine. Always test yourself.  If you have not done so, I recommend reading my post on correlated subqueries, as it does apply to functions as well, http://jahaines.blogspot.com/2009/06/correlated-sub-queries-for-good-or-evil.html.  In my next post, I will show you how to get rid scalar functions and use Inline TVFs to optimize performance, while encapsulating code logic.

Until next time, happy coding.

Friday, October 16, 2009

Exporting Binary Files To The File System

In my last post I demonstrated how to use SSIS to load binary files into a SQL Server 2005 VARBINARY(MAX) column, http://jahaines.blogspot.com/2009/10/ssis-importing-binary-files-into.html.  This post will focus on recreating the binary documents on the file system.  I will be using a combination of TSQL and the BCP utility to perform the export, http://msdn.microsoft.com/en-us/library/ms162802.aspx.  I will be using the same table and data from the last post.  I will start by creating the TSQL to dynamically create a BCP command.

Below is the stored procedure I will use to export the data.  You will see that I have decided to use a cursor to process all documents.  The procedure also accepts a DocID, which limits the export to a single document.  A cursor is fine here because we are limited to executing a single BCP command; however, you could create an SSIS package that executes the stored procedure across multiple streams, if you need parallel processing.

CREATE PROCEDURE usp_ExportBinaryFiles(
    @DocID INT = NULL,
    @OutputFilePath VARCHAR(500) = 'C:\'
)
AS 
BEGIN
 
DECLARE @sql VARCHAR(8000)
 
IF @DocID IS NULL --Open Cursor to export all images
BEGIN
 
    DECLARE curExportBinaryDocs CURSOR FAST_FORWARD FOR
    SELECT 'BCP "SELECT Doc FROM [tempdb].[dbo].[Documents] WHERE DocId =' 
        + CAST(DocId AS VARCHAR(5)) + '" queryout ' + @OutputFilePath 
        + DocName + '.' + DocType + ' -S A70195\Dev -T -fC:\Documents\Documents.fmt'
    FROM dbo.Documents
 
    OPEN curExportBinaryDocs
    FETCH NEXT FROM curExportBinaryDocs INTO @sql
 
    WHILE @@FETCH_STATUS = 0
        BEGIN
            --PRINT @sql
            EXEC xp_cmdshell @sql,NO_OUTPUT
            
            FETCH NEXT FROM curExportBinaryDocs INTO @sql
        END
 
    CLOSE curExportBinaryDocs
    DEALLOCATE curExportBinaryDocs
END
ELSE --Export a single image
BEGIN
    SELECT @sql = 'BCP "SELECT Doc FROM [tempdb].[dbo].[Documents] WHERE DocId =' 
        + CAST(DocId AS VARCHAR(5)) + '" queryout ' + @OutputFilePath 
        + DocName + '.' + DocType + ' -S A70195\Dev -T -fC:\Documents\Documents.fmt'
    FROM dbo.Documents
    WHERE DocID = @DocID
    
    --PRINT @sql
    EXEC xp_cmdshell @sql,NO_OUTPUT
END
 
END
GO

The above stored procedure dynamically builds a BCP command.  Currently the stored procedure has two main parameters, DocID and OutputFilePath.  DocID is used when you want to export a single row, if no value is supplied the stored procedure will export all documents.  The outputFilePath is the directory where the files will be exported.  You will note that I have hard coded my server name and format file path.  You can add additional parameters if you need these attributes to be dynamic.  The next item on my list is to actually create the format file.  Create a .fmt file somewhere on your file system.   I put mine in the same folder as my other documents.  Copy the code below into the format file.

9.0   
1   
1       SQLBINARY     0       0       ""   1     Doc                                       ""  

You may be asking yourself, why the format file looks skimpy, or is lacking content.  The format file is lacking content because it only has what we need.  All we need in the format file is the VARBINARY(MAX) column and its data type. You can find more info on format files here, http://msdn.microsoft.com/en-us/library/ms191516.aspx.

Once the format file is in place, we can execute our stored procedure to process all documents.

EXEC dbo.usp_ExportBinaryFiles @OutputFilePath = 'C:\Documents\BCP_Out\'

That’s it!  It is that easy to output VARBINARY data onto the file system.  SQL Server 2005 has a slue of tools that make working with binary data very simplistic.  I hope that you have learned something new. Please stay tuned, as I plan to focus more on TSQL concepts and performance considerations.

Until next time happy coding.

Wednesday, October 14, 2009

SSIS – Importing Binary Files Into A VARBINARY(MAX) Column

Have you every had the need to import images, Word documents, Excel documents, or any other type of file into a SQL Server table?  If yes, you are in luck.  SQL Server 2005 gives us a powerful and scalable method to import binary files or documents, into a relational table.  I will be using the term binary file and document interchangeably throughout this post to describe a file on the file system.  I will be utilizing SSIS and the VARBINAARY(MAX) data type to import the document. Let us start by creating the sample DDL.  In this example, I will need a staging table and a table to house our binary data. 

USE [tempdb]
GO
 
IF OBJECT_ID('tempdb.dbo.[Doc_Stage]') IS NOT NULL
BEGIN
    DROP TABLE [dbo].[Doc_Stage];
END
GO
 
CREATE TABLE [dbo].[Doc_Stage](
DocId INT IDENTITY(1,1) PRIMARY KEY,
DocName VARCHAR(50) NOT NULL,
DocPath VARCHAR(1000) NOT NULL,
DocType VARCHAR(4) NOT NULL
);
GO
 
IF OBJECT_ID('tempdb.dbo.[Documents]') IS NOT NULL
BEGIN
    DROP TABLE [dbo].[Documents];
END
GO
 
CREATE TABLE [dbo].[Documents](
DocId INT IDENTITY(1,1) PRIMARY KEY,
DocName VARCHAR(50) NOT NULL,
Doc VARBINARY(MAX) NULL,
DocType VARCHAR(4)
);
GO

Next, I need to create the SSIS package.  The first step is to add an Execute SQL Task to the designer canvas.  Configure the Execute SQL Task, as shown below.

Note: You will need to create the source connection to the database where you created the DDL

image 

The SQL Statement that I used is below.

TRUNCATE TABLE [dbo].[Doc_Stage] ;

Next I will create a variable called FilePath that is of the string data type.  Now I am ready to add a ForEach Loop Container to the canvas.  Add the container to the canvas and configure it as shown below.

Note: I am grabbing all file types. If you only want a specific type, change the File to include the extension that you want.  E.g. *.jpg

image image

Next, I will have to create another Execute SQL Task, but this time, I have to drag the task into the ForEach Loop Container.  Configure the task as shown below.

image

The SQL Statement is presented below:

INSERT INTO [dbo].[Doc_Stage](DocName,DocPath,DocType) VALUES ('DocName','DocPath','jpg');

Now for the tricky part.  In this step, I have to build an expression to dynamically build an insert statement.  The insert will capture the document name, path, and type.  Click the expressions tab and create an expression, on the property SQLStatementSource (near the bottom).  Below is the code for the expression.

"INSERT INTO [dbo].[Doc_Stage](DocName,DocPath,DocType) VALUES ('" +
REPLACE(RIGHT(@[User::FilePath],FINDSTRING(Reverse(@[User::FilePath] ) ,"\\", 1)-1),RIGHT(@[User::FilePath],4),"")
+ "','" + @[User::FilePath] 
+ "','" + RIGHT( @[User::FilePath] ,FINDSTRING( REVERSE(@[User::FilePath] ),".",1)-1)
+ "');"
 

With that out of the way, we can press on.  The last task I will need is a Data Flow Task.  Drag the data flow task to the canvas.  Here is what my canvas looks like at present:

image

Open the Data Flow task and drag an OLE DB Source to the canvas and configure it as shown below.

image 

Make sure to click the Columns tab to set the column mappings.  Next, I will drag a Import Column Transformation to the canvas and configure it as shown below.

image image

Note: Make sure to take note of the output column’s LineageID.  You will need to take this ID and add it to the Input Column’s FileDataColumnID.

image image

We are almost there!!! The last step is to add the OLE DB Destination.  I will add the OLE DB Destination and configure it as shown below. Do not forget to click the Mappings tab to map the columns.

image

That’s it!!!! Click the debug button and all of the components should light up green. I have successfully implemented a document library solution that allows the insertion of any document type into SQL Server 2005.  Stay tuned because I will show you how to export the images to the file system using BCP and TSQL.

Until next time happy coding.

Friday, October 9, 2009

Missing Committed Rows, In The Read Committed Isolation Level

Today, I am going to talk about the possibility of a SELECT statement that misses committed rows, in the read committed isolation level.  You may be asking your self, what!!!! How can committed rows be missed in the READ COMMITTTED isolation level?  This is the perfect example of an oxy-moron in my opinion, but it can and does happen.  Perhaps you have noticed this behavior or perhaps has not been that apparent to you or your users.  Granted this only occurs under certain circumstances, but it can really confuse a lot of people.  The problem exists because of the method SQL Server uses to scan a table.  When SQL Server scans a given table it takes a shared lock one row at a time.  The problem occurs when a transaction obtains an  exclusive lock, which prohibits the table from proceeding further with the scan.  What do you think will happen if the inserted value occurs before the currently scanned row?  You got it… the row does not appear in the result set.  Take a look at the graphic below to see what is actually happening.

Drawing1

Now it is time to see this behavior in action.

In a new query window, run the below code. I will be referring to this window as Query1.

create table t (a int primary key, b int)
insert t values (1, 1)
insert t values (2, 2)
insert t values (3, 3)
 
begin tran
update t set b = 2 where a = 3

Next, create a new query window, which I will refer to as Query2, and paste and execute the following code.

select * from t ORDER BY a ASC

Note: I use an order by clause to guarantee the sort.  Otherwise, the data will be returned as it is found on disk

Now open a third window.  Paste and execute the code below.

INSERT INTO t VALUES (0,10)

Return to Query1 (The query with the UPDATE statement), and execute the code below.

INSERT INTO t VALUES (4,10)
select * from t
commit tran

After jumping through all these hoops, here are the results:

Query1:

image

Query 2:

image

Query 3:

image

Query 3 does not really tell us much other than our insert was committed to the database.  The query that tells us the story is Query 2.  As you can see, Query 2 is missing the row we inserted and committed in Query3.  If you repeat the steps above but change the sort order to DESC you will see that all rows are returned, unless of course you decide to insert a row after the current MAX a value.  Another thing of note is the clustered index.  If you change the clustered index to NONCLUSTERED, you will get all rows.  Why does this happen?  The answer is when a CLUSTERED index is scanned thus it returns the data in the order of the index sort; however, when a heap is scanned it returns data as it finds it on disk, unless an order by is specified.  Please realize that no sort is ever guaranteed, without an ORDER BY clause.  As you can see the direction of the scan impacts which data will be missing and which data is displayed.  How do you resolve this issue?  The only answer is to choose an isolation level with more consistency and less concurrency, like serializable or snapshot.  Here is the BOL entry for transactional isolation level, http://msdn.microsoft.com/en-us/library/ms173763.aspx.

There you have it.  I have demonstrated how committed rows can be skipped, in the read committed isolation level.  I wonder what else can happen in the read committed isolation level?  If you want a hint, try updating a row that has not been scanned yet and an row that has already been scanned, in the final commit part.  I think you will be surprised that your query will return invalid data and will be missing data. 

Happy coding.

Monday, October 5, 2009

Determine Which Indexes Are Not Being Used

Have you ever had the need to find out which indexes are actually being used?  If you are using SQL 2005 and greater, you are in luck.  SQL Server 2005 introduced a new dynamic management view that gives all the usage statistics, sys.dm_db_index_usage_stats (http://msdn.microsoft.com/en-us/library/ms188755.aspx).  What does this mean?  Well it means we can capture which indexes are being used and which ones are an administrative burden.  What’s the catch?  The catch is the usage statistics do not persist a SQL Server restart. The statistics will not be of much value after a server restart, so an option may be to insert the results of the view into a permanent table, for later analysis.  Let’s look at an example.

SELECT 
    QUOTENAME(s.name) + '.' + QUOTENAME(t.name) AS TableName,
    i.name AS IdxName,
    i.type_desc AS IdxType,
    ius.user_seeks,
    (ius.user_seeks*1.) / NULLIF((ius.user_seeks+ius.user_scans+ius.user_lookups),0) AS [%Seek],
    ius.user_scans,
    (ius.user_scans*1.) / NULLIF((ius.user_seeks+ius.user_scans+ius.user_lookups),0) AS [%Scan],
    ius.user_lookups,
    (ius.user_lookups*1.) / NULLIF((ius.user_seeks+ius.user_scans+ius.user_lookups),0) AS [%Lookup],
    ius.user_updates
FROM sys.indexes i
LEFT JOIN sys.dm_db_index_usage_stats ius
    ON  ius.object_id = i.object_id
        AND ius.index_id = i.index_id
        AND database_id = DB_ID()--Current DB
INNER JOIN sys.tables t
    ON t.object_id = i.object_id
INNER JOIN sys.schemas s
    ON t.schema_id = s.schema_id
WHERE
    t.type = 'U'
    AND t.is_ms_shipped = 0
ORDER BY ius.user_seeks + ius.user_scans + ius.user_lookups DESC
 

Note: I have used a left join.  If an index has never been used, it will not have an entry in sys.dm_db_index_usage_stats.

The columns that really drive this view are user_seeks, user_scans, user_lookups, and user_updates.  User_seeks represents the number of seeks on a given index, since SQL Server last started.  Likewise, user_scans and user_lookups represent the number of scans and the number of lookups respectively.  One of the most important columns is user_updates.  User_updates represents the number of inserts, updates, and deletes to the index. If you find your index in a scenario where the number of user_updates is greater than the number of seeks, lookups, or scans, you should consider dropping the index.  There is no magic number to dictate when an index should be dropped because of the maintenance overhead.  You should use your best judgment when dropping an index. 

There you have it a simplistic method to get the index usage details.  How else can this dynamic management view be used?  I leave this to you to find out, but I will leave you with one other use.

The other use for this view is to identify the most queried tables.  By looking at the index usage stats, we can deduce how often the table is queried. 

--Most Accessed Tables
SELECT 
    DB_NAME(ius.database_id) AS DBName,
    OBJECT_NAME(ius.object_id) AS TableName,
    SUM(ius.user_seeks + ius.user_scans + ius.user_lookups) AS TimesAccessed    
FROM sys.indexes i
INNER JOIN sys.dm_db_index_usage_stats ius
    ON  ius.object_id = i.object_id
        AND ius.index_id = i.index_id
WHERE
    ius.database_id = DB_ID()
GROUP BY 
    DB_NAME(ius.database_id),
    OBJECT_NAME(ius.object_id)
ORDER BY SUM(ius.user_seeks + ius.user_scans + ius.user_lookups) DESC

That is it.  I have provided two uses of sys.dm_db_index_usage_stats.  I hope this post will help you identify poorly performing indexes or you most active tables. 

Happy Coding.

Friday, October 2, 2009

SQL Server MVP – October 2009

I am very honored and pleased to announce that I have been awarded and recognized, as a SQL Server MVP.  An MVP is not defined as the most technically savvy professional in the world, it is defined as a community member dedicated to sharing knowledge, professional growth,  mentorship and community.  This award is so prestigious that I feel blessed for having been nominated and even more blessed to receive it. I feel a little insignificant in receiving the award, as I know the caliber of past awardees.  To me, I feel like I am comparing myself to giants, but I guess we all have to start somewhere :).  I believe the recipients of this award are what give it merit.  All recipients of this award go above and beyond to help the SQL Server community. This is exactly what I plan to do.  Now that I have been awarded, I feel even more enthused and privileged to help such a great community.  No other online  community has ever made me feel more at home than the SQL Server community.  I can truly say that I have found my home and calling.

Wednesday, September 30, 2009

SSIS – Performing An UPSERT

For those of you who may be wondering what the heck an UPSERT is, an UPSERT is an UPDATE and INSERT into a given table. Typically, an UPSERT operation requires two separate transactions. Usually, the first transaction issued is the UPDATE and then the INSERT is issued shortly after. In this post I will be extending the package I created, in my previous post, http://jahaines.blogspot.com/2009/09/ssis-only-inserting-rows-that-do-not.html. Once your SSIS package resembles the package in the link, you can proceed with the next step. If you want to skip the prior post, you can download the package at the bottom on this post.

The first item of interest is the Lookup transformation. First I will drag an OLE DB Command onto the canvas and connect it to the Lookup transformation. A dialog box should appear. When the dialog box appears choose the Lookup Match Output, in the Output dropdown list. An OLE DB Command, is a parameterized TSQL statement that can issued against a given data set. Before I start configuring the transformation, I would like to point out that this method is extremely easy to setup, but has a major drawback. The OLE DB Command has to be issued for each row returned by the Input, so this type of process works recursively or iteratively. I will demonstrate another method later in this post. With that out of the way, I will start configuring the transformation. First, I will build an UPDATE command using parameters for each column and the predicate.

My canvas currently looks like the screenshot below.

image

The next step is to configure the OLE DB Command transformation. Double-click the OLE DB Command transformation. In the Connection Manager dropdown list, choose the database where you created dbo.SSIS_Import. Click the “Component Properties” tab. Within the “Component Properties” tab, you will need to click the ellipses next to the “SQL Command” property. Paste the code presented below into the box and click ok.

UPDATE dbo.SSIS_Import
SET 
     SomeInt = ?,
     SomeChar = ?
FROM dbo.SSIS_Import import
WHERE SomeId = ?

The code above is a pretty simplistic UPDATE statement. The key thing to note is the “?”. Each of the ‘?” will be given a parameter value in SSIS. The parameters are dynamically named in the order they appear in the command. Next, I will be mapping these parameters to my input columns. Click the “Column Mappings” tab. Align the input columns to each parameter. My column mapping is shown below. Once complete, click “Ok.”

image

That it!!! We have successfully created an UPSERT operation using SSIS… but wait what is the catch? The catch is the UPDATE will operate like a cursor processing one UPDATE command at a time, which can be a nightmare from a performance standpoint. You may be wondering how we can make this process set based. There is not much you can do in SSIS alone, so I will need to find other means. The best way to make this process more scalable is to leverage SQL Server and SSIS.

The first step is to drop the OLE DB command I just created. I will then drag a OLE DB Destination to the canvas and connect it to the Lookup Match Output. My canvas looks like below.

image

Next, I will switch gears and write some TSQL code. I will need to open SSMS and connect to the database where dbo.SSIS_Import exists. Firstly, I have to create a view. I will not use the view to select data, but will use the view as a intermediate object to insert data.

CREATE VIEW vw_SSIS_Import
AS
SELECT [SomeId],[SomeInt],[SomeChar]
FROM dbo.[SSIS_Import]
GO

Now that my view is created, I am going to create an INSTEAD of trigger, on my view. The instead of trigger will allow me to use my OLE DB Destination to bulk insert from SSIS. The bulk insert from SSIS gives me a mechanism to pass the rows for UPDATE, as a set of data. Once I have the rows in a set, the trigger can efficiently UPDATE the data.

CREATE TRIGGER trg_UPdate_SSIS_Import
ON dbo.vw_SSIS_Import
INSTEAD OF INSERT
AS 
BEGIN
 
    UPDATE import
    SET 
        SomeInt = i.SomeInt,
        SomeChar = i.SomeChar
    FROM dbo.SSIS_Import import
    INNER JOIN inserted i
        ON i.SomeId = import.SomeId
 
END
GO

In the code above, I am using efficient TSQL to update all rows that are inserted into the view, from the SSIS package. I am now ready to configure the OLE DB Destination. Double-click the OLE DB Destination to launch the configure dialog box. Make sure your package is using the right connection manager and choose the view. Click the “Column Mappings” tab and then click '”Ok.” Right-click the OLE DB Destination and choose properties. In the “Fast Load Options”, type “FIRE_TRIGGERS.” This allows the SSIS insert to fire the INSTEAD OF trigger, on the view. That’s it we are done!!! The view/INSTEAD OF TRIGGER method is extremely easy to configure and implement; however, it does require that additional database objects to be created.

There you have it. I have demonstrated two methods to UPSERT data, using SSIS. The first method relies on the OLE DB Command, which has the limitation of having to run for each and every row. The latter option is a more scalable solution that requires the creation of a view and trigger. Both methods have pros and cons and I leave it to you to determine which is best for your environment.

Until next time, Happy Coding.

PS: If you want to do all of the work in SQL, you can allow the trigger to perform the UPDATE and INSERT. This method reduces the complexity of the SSIS package because you only need a single source and destination, with no Lookup.

Download SSIS Package:

Note: The package was created in BIDS 2008, which is not compatible with BIDS 2005.