Wednesday, September 23, 2009

Sequencing Data Using SQL Server 2000+ (Part 2)

This is part two of a two part series where I will be demonstrating partitioned ranking or what I call partitioned sequences.  In part one of this series, we looked at sequencing data, http://jahaines.blogspot.com/2009/09/sequencing-data-in-sql-server-part-1.html.  A sequence number is often attributed as a row number given to each row of a result set.  A partitioned sequence is a sequence of numbers that is distinct across a grouping of data.  In my opinion, I believe it to be easier to partition sequence data, if you have a somewhat sequential surrogate key, like a date column or even better an identity column.  It is a lot easier to handle sequenced partitioning when you have a unique sequential  key, but not required. Let’s get started by creating our same sample table.

SET NOCOUNT ON
GO
 
USE [tempdb]
GO
 
IF OBJECT_ID('tempdb.dbo.Orders') IS NOT NULL
BEGIN
    DROP TABLE [dbo].[Orders]
END;
GO
 
CREATE TABLE dbo.[Orders](
OrderId INT IDENTITY(1,1),
CustId INT,
OrderDate DateTIME
);
GO
 
INSERT INTO [dbo].[Orders] (CustId,OrderDate)
SELECT 1,'1/1/09 08:00 AM' UNION ALL 
SELECT 1,'1/1/09 10:00 AM' UNION ALL 
SELECT 1,'2/1/09 02:00 PM' UNION ALL
SELECT 1,'3/1/09 03:15 PM' UNION ALL
SELECT 1,'3/8/09 10:00 PM' UNION ALL
SELECT 1,'3/8/09 10:00 PM' UNION ALL --Show that ties do not affect the output
SELECT 2,'1/1/09 07:30 AM' UNION ALL
SELECT 2,'1/1/09 05:45 PM' UNION ALL
SELECT 2,'2/2/09 11:38 PM';
GO

Now that we have our sample table, we can begin our first partitioning sequence, using the row_number function (SQL Server 2005+). 

--Row Sequence Partitioning: 2005+
SELECT *, ROW_NUMBER() OVER(PARTITION BY o1.CustId ORDER BY o1.OrderId ASC) AS seq
FROM [dbo].[Orders]o1
ORDER BY o1.OrderId, o1.CustId

In the above example, the Row_Number, http://msdn.microsoft.com/en-us/library/ms186734.aspx ,  function is driving the results.  As you can see, I am using the partition by clause to “group” the sequence by o.CustId.  The partitioning clause creates a new sequence for each grouping.  The order by clause of the Row_Number function dictates the sequence order.  I have chosen ASC order, so the sequence will be generated on each CustId starting with the smallest OrderId. until it reaches the largest OrderId.

Results:

image

Now, we will move onto to solutions for SQL Server 2000.  The next example will use a correlated subquery, which can really degrade query performance.  I do not really like correlated subqueries, but I will list this as a possible method, but advise you to use another method, http://jahaines.blogspot.com/2009/06/correlated-sub-queries-for-good-or-evil.html

--Partitioning: 2000 slow
SELECT *,(SELECT COUNT(*) FROM [dbo].[Orders]o2 WHERE o2.OrderId <= o1.OrderId AND o1.CustId=o2.[CustId]) AS seq
FROM [dbo].[Orders]o1
ORDER BY o1.OrderId, o1.CustId

Results:

image

There really is not a whole lot of magic to the correlated subquery solution.  All we are doing is counting the number of rows less than the current OrderId, which effectively generates our partitioned sequence.  The next method is the derived table method. 

--Partitioning: 2000 much faster
SELECT o1.OrderId,o1.CustId,o1.OrderDate,COUNT(*) AS seq
FROM [dbo].[Orders]o1
INNER JOIN [dbo].[Orders]o2 ON o2.OrderId <=o1.OrderId AND o1.[CustId]=o2.[CustId]
GROUP BY o1.OrderId,o1.CustId, o1.OrderDate
ORDER BY o1.OrderId, o1.CustId, o1.OrderDate

Again, there is no magic formula here.  We are using the same logic, as the correlated subquery, yet we using a derived table.  The derived table method is a better solution than the correlated subquery method and CAN be more performant than the row_number method.  Whether or not the derived table method is better performing than the Row_Number function depends on several factors, but the biggest factor is the number of sort operators.  the derived table method will mostly use one sort operator (Order by Clause) and a stream aggregate, whereas, the Row_Number solution can use two sort operators, if the ORDER BY clause does not align with the ORDER BY clause of the Row_Number function.  In this scenario, the Row_Number solution can be more expensive than the derived table solution.  In most cases the derived table method is more costly than the row_number function because it requires another join, which translates into higher IO, while the Row_Number function reads from a single table , in this scenario.

Results:

image

There you have it.  We have successfully implemented sequenced partitioning, in SQL 2000 and greater versions.  The main thing to remember for solutions prior to SQL Server 2005 is that in some cases you may have to handle ties.  I find it best to handle sequence partitioning using a ranking function, such as Row_Number.  If you are using SQL 2000 and do not have a sequential key like an identity, you may have a little heartache, especially if you are trying to handle ties.  It is not impossible, but it is more tricky and you have to be more creative with your solution.  I hope that you have learned something that you can use in your environment and as always, happy coding.

3 comments:

Unknown said...

I have also heard about the way of microsoft outlook repair

Anonymous said...

Sir,
Thank You for your Row Number implementation.It is working in SQL Server-2000 as i wanted.

James Zicrov said...

Hello Sir,I just want to know that does SSIS Upsert has only role to play in data sequencing.