Demystifying SQL Server: Sequencing Data Using SQL Server 2000+ (Part 2)

This is part two of a two part series where I will be demonstrating partitioned ranking or what I call partitioned sequences. In part one of this series, we looked at sequencing data, http://jahaines.blogspot.com/2009/09/sequencing-data-in-sql-server-part-1.html. A sequence number is often attributed as a row number given to each row of a result set. A partitioned sequence is a sequence of numbers that is distinct across a grouping of data. In my opinion, I believe it to be easier to partition sequence data, if you have a somewhat sequential surrogate key, like a date column or even better an identity column. It is a lot easier to handle sequenced partitioning when you have a unique sequential key, but not required. Let’s get started by creating our same sample table.

SET NOCOUNT ON

GO

USE [tempdb]

GO

IF OBJECT_ID('tempdb.dbo.Orders') IS NOT NULL

BEGIN

    DROP TABLE [dbo].[Orders]

END;

GO

CREATE TABLE dbo.[Orders](

OrderId INT IDENTITY(1,1),

CustId INT,

OrderDate DateTIME

);

GO

INSERT INTO [dbo].[Orders] (CustId,OrderDate)

SELECT 1,'1/1/09 08:00 AM' UNION ALL

SELECT 1,'1/1/09 10:00 AM' UNION ALL

SELECT 1,'2/1/09 02:00 PM' UNION ALL

SELECT 1,'3/1/09 03:15 PM' UNION ALL

SELECT 1,'3/8/09 10:00 PM' UNION ALL

SELECT 1,'3/8/09 10:00 PM' UNION ALL --Show that ties do not affect the output

SELECT 2,'1/1/09 07:30 AM' UNION ALL

SELECT 2,'1/1/09 05:45 PM' UNION ALL

SELECT 2,'2/2/09 11:38 PM';

GO

Now that we have our sample table, we can begin our first partitioning sequence, using the row_number function (SQL Server 2005+).

--Row Sequence Partitioning: 2005+

SELECT *, ROW_NUMBER() OVER(PARTITION BY o1.CustId ORDER BY o1.OrderId ASC) AS seq

FROM [dbo].[Orders]o1

ORDER BY o1.OrderId, o1.CustId

In the above example, the Row_Number, http://msdn.microsoft.com/en-us/library/ms186734.aspx , function is driving the results. As you can see, I am using the partition by clause to “group” the sequence by o.CustId. The partitioning clause creates a new sequence for each grouping. The order by clause of the Row_Number function dictates the sequence order. I have chosen ASC order, so the sequence will be generated on each CustId starting with the smallest OrderId. until it reaches the largest OrderId.

Results:

Now, we will move onto to solutions for SQL Server 2000. The next example will use a correlated subquery, which can really degrade query performance. I do not really like correlated subqueries, but I will list this as a possible method, but advise you to use another method, http://jahaines.blogspot.com/2009/06/correlated-sub-queries-for-good-or-evil.html

--Partitioning: 2000 slow

SELECT *,(SELECT COUNT(*) FROM [dbo].[Orders]o2 WHERE o2.OrderId <= o1.OrderId AND o1.CustId=o2.[CustId]) AS seq

FROM [dbo].[Orders]o1

ORDER BY o1.OrderId, o1.CustId

Results:

There really is not a whole lot of magic to the correlated subquery solution. All we are doing is counting the number of rows less than the current OrderId, which effectively generates our partitioned sequence. The next method is the derived table method.

--Partitioning: 2000 much faster

SELECT o1.OrderId,o1.CustId,o1.OrderDate,COUNT(*) AS seq

FROM [dbo].[Orders]o1

INNER JOIN [dbo].[Orders]o2 ON o2.OrderId <=o1.OrderId AND o1.[CustId]=o2.[CustId]

GROUP BY o1.OrderId,o1.CustId, o1.OrderDate

ORDER BY o1.OrderId, o1.CustId, o1.OrderDate

Again, there is no magic formula here. We are using the same logic, as the correlated subquery, yet we using a derived table. The derived table method is a better solution than the correlated subquery method and CAN be more performant than the row_number method. Whether or not the derived table method is better performing than the Row_Number function depends on several factors, but the biggest factor is the number of sort operators. the derived table method will mostly use one sort operator (Order by Clause) and a stream aggregate, whereas, the Row_Number solution can use two sort operators, if the ORDER BY clause does not align with the ORDER BY clause of the Row_Number function. In this scenario, the Row_Number solution can be more expensive than the derived table solution. In most cases the derived table method is more costly than the row_number function because it requires another join, which translates into higher IO, while the Row_Number function reads from a single table , in this scenario.

Results:

There you have it. We have successfully implemented sequenced partitioning, in SQL 2000 and greater versions. The main thing to remember for solutions prior to SQL Server 2005 is that in some cases you may have to handle ties. I find it best to handle sequence partitioning using a ranking function, such as Row_Number. If you are using SQL 2000 and do not have a sequential key like an identity, you may have a little heartache, especially if you are trying to handle ties. It is not impossible, but it is more tricky and you have to be more creative with your solution. I hope that you have learned something that you can use in your environment and as always, happy coding.

3 comments:

Unknown said...: I have also heard about the way of microsoft outlook repair; January 21, 2011 at 6:47 AM
Anonymous said...: Sir,
Thank You for your Row Number implementation.It is working in SQL Server-2000 as i wanted.; March 2, 2013 at 3:56 AM
James Zicrov said...: Hello Sir,I just want to know that does SSIS Upsert has only role to play in data sequencing.; April 11, 2019 at 5:23 AM

Demystifying SQL Server

Wednesday, September 23, 2009

Sequencing Data Using SQL Server 2000+ (Part 2)

3 comments:

Subscribe

Search This Blog

About Me

Twitter

Twitter

Blogs I follow

Followers

Blog Archive

Keywords