Sunday, September 20, 2009

Sequencing Data Using SQL Server 2000+ (Part 1)

Do to the  overwhelming amount of data sequencing questions asked on the MSDN forums, I will demonstrate how to implement the most common solutions for sequencing data, using SQL Server 2000 and greater.  Before starting I would like to give a disclaimer.  I believe that data sequencing should be handled in the application/presentation layer.  SQL Server is not really designed to present data in a “pretty” format.  I am not saying data sequencing cannot be done in SQL Server or SQL Server cannot provide a highly performant data sequencing solution. I believe it is best to handle sequencing, in the application/presentation layer.  With the disclaimer out of the way, let’s push forward.  This post is part one of a two part series.  This post will primarily focus on basic data sequencing.  Let us start by creating our sample table.

SET NOCOUNT ON
GO
 
USE [tempdb]
GO
 
IF OBJECT_ID('tempdb.dbo.Orders') IS NOT NULL
BEGIN
    DROP TABLE [dbo].[Orders]
END;
GO
 
CREATE TABLE dbo.[Orders](
OrderId INT IDENTITY(1,1),
CustId INT,
OrderDate DateTIME
);
GO
 
INSERT INTO [dbo].[Orders] (CustId,OrderDate)
SELECT 1,'1/1/09 08:00 AM' UNION ALL 
SELECT 1,'1/1/09 10:00 AM' UNION ALL 
SELECT 1,'2/1/09 02:00 PM' UNION ALL
SELECT 1,'3/1/09 03:15 PM' UNION ALL
SELECT 1,'3/8/09 10:00 PM' UNION ALL
SELECT 1,'3/8/09 10:00 PM' UNION ALL --Show that ties do not affect the output
SELECT 2,'1/1/09 07:30 AM' UNION ALL
SELECT 2,'1/1/09 05:45 PM' UNION ALL
SELECT 2,'2/2/09 11:38 PM';
GO

The first method I will demonstrate is the Row_Number, http://msdn.microsoft.com/en-us/library/ms186734.aspx ,  method.  Row_Number is a function that was introduced in SQL Server 2005, which means it is only available in SQL Server 2005 and 2008.  The Row_Number function allows developers to sequence/rank a resultset.  For this sample, I will not be using the partition clause, within the Row_Number function.  The partition clause creates a sequence based on a partition or grouping of data.  I will discus partitioning sequences further in part two of this series.  Now to the first sample.

--Row Sequence Number- Row_Number() Function (2005+)
SELECT o1.OrderId,o1.CustId,o1.OrderDate,ROW_NUMBER() OVER(ORDER BY o1.OrderId ASC) AS seq
FROM [dbo].[Orders] o1
ORDER BY [seq]
 
/*
OrderId     CustId      OrderDate               seq
----------- ----------- ----------------------- --------------------
1           1           2009-01-01 08:00:00.000 1
2           1           2009-01-01 10:00:00.000 2
3           1           2009-02-01 14:00:00.000 3
4           1           2009-03-01 15:15:00.000 4
5           1           2009-03-08 22:00:00.000 5
6           1           2009-03-08 22:00:00.000 6
7           2           2009-01-01 07:30:00.000 7
8           2           2009-01-01 17:45:00.000 8
9           2           2009-02-02 23:38:00.000 9
*/

In the above sample, I used the Row_Number function and specified the order of the sequencing to be based on the OrderId ascending.  The query above is really straight forward and easy to implement.  The Row_Number function makes sequencing data really easy and is quite useful for advanced queries.  The next sample I will demonstrate uses a correlated subquery to count the number of rows less than or equal to the current value.  The problem with this method is the subquery.  A subquery can really degrade performance and makes a solution less scalable. You can get more information on the problems with correlated subqueries here,  http://jahaines.blogspot.com/2009/06/correlated-sub-queries-for-good-or-evil.html.

--Row Sequence Number- Correlated Subquery (2000+)
SELECT o1.OrderId,o1.CustId,o1.OrderDate,(SELECT COUNT(*) FROM [dbo].[Orders]o2 WHERE o2.OrderId <= o1.OrderId) AS seq
FROM [dbo].[Orders]o1
ORDER BY [seq]
 
/*
OrderId     CustId      OrderDate               seq
----------- ----------- ----------------------- -----------
1           1           2009-01-01 08:00:00.000 1
2           1           2009-01-01 10:00:00.000 2
3           1           2009-02-01 14:00:00.000 3
4           1           2009-03-01 15:15:00.000 4
5           1           2009-03-08 22:00:00.000 5
6           1           2009-03-08 22:00:00.000 6
7           2           2009-01-01 07:30:00.000 7
8           2           2009-01-01 17:45:00.000 8
9           2           2009-02-02 23:38:00.000 9
*/

The next method I will be demonstrating uses the same logic as the subquery, but uses a self join with an aggregate function. I like this query a lot more than the subquery. This solution can significantly reduce IO, therefore, allowing it perform a lot better than the subquery method. The problem with this query is the triangular join. A triangular join occurs when you join a table with another table (this can be the same table), but uses a greater than or less than operator, instead of an equality join. It is called a triangular join because the shape the data makes when this type of join is used. If you want more info on triangular joins, Jeff Moden has written a great article, http://www.sqlservercentral.com/articles/T-SQL/61539/.

--Row Sequence- Aggregate (2000+)
SELECT o1.OrderId,o1.CustId,o1.OrderDate,COUNT(*) AS seq
FROM [dbo].[Orders]o1
INNER JOIN [dbo].[Orders]o2 ON o2.OrderId <=o1.OrderId
GROUP BY o1.OrderId,o1.CustId, o1.OrderDate
ORDER BY [seq]
 
/*
OrderId     CustId      OrderDate               seq
----------- ----------- ----------------------- -----------
1           1           2009-01-01 08:00:00.000 1
2           1           2009-01-01 10:00:00.000 2
3           1           2009-02-01 14:00:00.000 3
4           1           2009-03-01 15:15:00.000 4
5           1           2009-03-08 22:00:00.000 5
6           1           2009-03-08 22:00:00.000 6
7           2           2009-01-01 07:30:00.000 7
8           2           2009-01-01 17:45:00.000 8
9           2           2009-02-02 23:38:00.000 9
*/

The final method I will demonstrate is the temp table method.  This method can be highly performant and is can sometimes be the best solution, especially in versions of SQL Sever prior to 2005.  The key to the temp table method is to use INSERT INTO/SELECT with an ORDER BY clause.  I also like to create a clustered index, on the ORDER BY column. This solution scales really well, especially when working with lots of data.  When dealing with large quantities of data, this is often the best performing solution because it does not use triangular join, does not have extra IO from subqueries, and can take advantage of bulk insertion and minimally logged transactions.  The negative aspect is you will be using a large temporary object, so this could hinder other operations that use tempdb.

--Row Sequence- Temp Table (2000+)
IF OBJECT_ID('tempdb..#Row_Sequencing') IS NOT NULL
BEGIN
    DROP TABLE #Row_Sequencing;
END;
 
CREATE TABLE #Row_Sequencing(
seq INT IDENTITY(1,1),
OrderId INT PRIMARY KEY CLUSTERED, --Used to create a clustered index, which sorts the data,
CustId INT,
OrderDate DATETIME
);
 
INSERT INTO #Row_Sequencing ([OrderId],[CustId],[OrderDate])
SELECT o.OrderId,o.CustId,o.OrderDate
FROM [dbo].[Orders]o
ORDER BY o.OrderId ASC; --You must issue an order by to guarantee the data insertion order
 
SELECT * FROM [#Row_Sequencing] ORDER BY [seq];
 
/*
seq         OrderId     CustId      OrderDate
----------- ----------- ----------- -----------------------
1           1           1           2009-01-01 08:00:00.000
2           2           1           2009-01-01 10:00:00.000
3           3           1           2009-02-01 14:00:00.000
4           4           1           2009-03-01 15:15:00.000
5           5           1           2009-03-08 22:00:00.000
6           6           1           2009-03-08 22:00:00.000
7           7           2           2009-01-01 07:30:00.000
8           8           2           2009-01-01 17:45:00.000
9           9           2           2009-02-02 23:38:00.000
*/

There you have it.  I have demonstrated the most common solutions, for Sequencing data.  Stay tuned for part two of this series. In part two, I will be focusing on creating sequence partitions, using SQL Server 2000 and greater.  I hope that you have learned something new and can make practical use of the samples  provided.

Happy Coding.

No comments: