I have seen a lot of questions regarding how to obtain the most current row, for a given grouping. The type of query is very typical for a one-to-many relationship. In our example, we will be creating an Orders table and our objective is to find the most recent order for each customer. Let’s start by creating our table structure.
USE [tempdb]
GO
IF OBJECT_ID('tempdb.dbo.Customers') IS NOT NULL
BEGIN
DROP TABLE dbo.Customers;
END
GO
CREATE TABLE dbo.Customers(
CustomerID INT IDENTITY(1,1) PRIMARY KEY,
EmpSalesID INT,
FName VARCHAR(25),
LName VARCHAR(25),
Phone CHAR(12)
);
INSERT INTO Customers VALUES (1,'Adam','Haines','555-555-5555');
INSERT INTO Customers VALUES (1,'John','Deere','555-555-7777');
INSERT INTO Customers VALUES (2,'Allision','Smith','555-555-8888');
IF OBJECT_ID('tempdb.dbo.Orders') IS NOT NULL
BEGIN
DROP TABLE dbo.Orders;
END
GO
CREATE TABLE dbo.Orders(
OrderId INT IDENTITY(1,1) PRIMARY KEY,
CustomerID INT,
Order_Amt NUMERIC(9,2),
Order_Dt DATETIME
);
INSERT INTO dbo.Orders VALUES (1,199.50,'2009-08-30');
INSERT INTO dbo.Orders VALUES (1,49.99,'2009-09-05');
INSERT INTO dbo.Orders VALUES (2,2500.00,'2009-09-08');
INSERT INTO dbo.Orders VALUES (3,1.00,'2009-08-08');
INSERT INTO dbo.Orders VALUES (3,2.00,'2009-09-07');
INSERT INTO dbo.Orders VALUES (3,3.00,'2009-09-08');
Now that we have our table structure, we can start writing our query. As you can see, we have a one-to-many relationship between customer and orders. When we join our Customers table to the Orders table we will see multiple rows returned, but our objective is to return the most recent row for each customer. I will illustrate two different methods, for obtaining this data. The first method uses a CTE (Common Table Expression). The key to this method is the row_number function, as this can be used to partition and sort the data to our liking. In my example, I partition by the customer id and ordered by the order_dt DESC because I want to return the newest order placed by each customer. This gives the newest record a sequence value of 1 per CustomerID. I can then filter the outermost CTE for all sequences that are equal to 1, or the most current.
--SQL 2005+
--create cte to select the data we want
;WITH Orders (seq,CustomerID,FName,LName,OrderId,Order_Amt,Order_Dt)
AS
(
SELECT --use the row_number function to get the newest record
ROW_NUMBER() OVER(PARTITION BY c.[CustomerID] ORDER BY Order_Dt DESC) AS seq,
c.CustomerID,c.FName,c.LName,o.OrderId,o.Order_Amt,o.Order_Dt
FROM dbo.Customers c
INNER JOIN dbo.Orders o
ON c.[CustomerID] = o.[CustomerID]
)
SELECT CustomerID,FName,LName,OrderId,Order_Amt,Order_Dt
FROM Orders
WHERE seq = 1 --filter for the newest record only
ORDER BY CustomerID
Next, I will show you a method that works on SQL Server versions less than 2005. This method works on the same principal but uses the max aggregate to get the most current order_dt per customerId. We can then join this order_dt back to the orders table to get the most recent order row data.
--SQL 2000
SELECT o.CustomerID,c.FName,c.LName,o.OrderId,o.Order_Amt,o.Order_Dt
FROM dbo.Customers c
INNER JOIN dbo.Orders o
ON c.[CustomerID] = o.[CustomerID]
INNER JOIN(
SELECT CustomerId, MAX(Order_Dt) AS Order_Dt
FROM dbo.Orders
GROUP BY [CustomerId]
) AS Newest_Order
ON O.CustomerID = Newest_Order.CustomerId
AND o.[Order_Dt] = Newest_Order.[Order_Dt]
ORDER BY o.[OrderId]
There you have it!!! I have provided two simplistic methods to get the most current row. This method is very straight forward and very easy to implement.
Happy Coding.
12 comments:
Hi Adam...
Great post as always.
I would suggest that the SQL2000 query could be rewritten as below. It creates the exact same query plan as your query, but I think it's a little more readable as to what's going on:
SELECT whatever
FROM Customers c
JOIN Orders o ON c.CustomerID=o.CustomerID
WHERE Order_Dt=
(SELECT MAX(Order_Dt)
FROM Orders o2
WHERE o2.CustomerID=o.CustomerID)
Unfortunately another problem with the SQL2000 method is that it can return multiple entries for the same customer. If you INSERTed another row into the Orders table...
INSERT INTO Orders
VALUES (3,4.00,'2009-09'08');
...and execute the query again, you'd get two entries for CustomerID 3.
You'd have to introduce a tiebreaker of some kind like this ugly method:
SELECT whatever
FROM Customers c
JOIN Orders o ON c.CustomerID=o.CustomerID
WHERE CONVERT(varchar,Order_Dt,126)+STR(OrderID)=
(SELECT MAX(CONVERT(varchar,Order_Dt,126)+STR(OrderID))
FROM Orders o2
WHERE o2.CustomerID=o.CustomerID)
Also, here's a really clever out-of-the-box way (SQL2005 only) to also get the most recent order information that I saw posted on the T-SQL Forum:
SELECT TOP 1 WITH TIES Whatever
FROM Customers c
JOIN Orders o on c.CustomerID=o.CustomerID
ORDER BY
RANK() OVER
(PARTITION BY c.CustomerID
ORDER BY o.Order_Dt desc, o.OrderID desc)
--Brad
Thanks for the input Brad.
I agree that the SQL 2000 method could have some problems, although unlikely. It comes back to knowing your data. I could have used the identity column to get the most current record, for each customer. Using the surrogate key eliminates any tie issues, but I wanted to choose a column that more closely reflected the data I am trying to return.
I would hope that no customer can place multiple orders at the exact ms, but anything is possible :). If this scenario does occur you can use a surrogate key instead of the date or you can create logic to eliminate ties.
Thanks for sharing your thoughts!!
@Adam
At the same time can happen when loading a history from elsewhere. Or back- or forward- dating.
Later, people forget these things.
Just some thought.
@Brian,
Good point :). In this scenario it is likely that data integrity can be smudged.
My thoughts are that no person should ever have the same exact timestamp on an order, regardless of backloading data. A person is only able to make a single order in a millisecond's time, so no other record should have this exact timestamp. If you are transfering data from another system that has a lesser precise data type, this is definitely possible. I think we are back to the precision of the data type, knowing our data and maintaining the integrity of our data.
Thanks for the comment.
Unless the time is part of a natural key, i'd use a surrogate. Leaving data integrity up to common sense is a sure way to find out that it isn't so common. :)
There is also a "vector" method for doing this kind of query that is usually more efficient thant the given method and works in all version of SQL Server from the current millenium; however, this method is anything but straight forward.
Kent Waldrop
Kent,
Do you have an example of this other method, or a link that may demonstrate it? I would love to see this method and how it works.
Do you prefer that I post here or email you? I will gather the information and check back.
Kent
Here are several examples from Uma:
http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/4b569f6a-de09-476f-8d3c-293d8d89aa4a/
in this one it is Uma's last example:
http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/8bedb6c8-9bdb-49b9-8824-d332c3eb78a2/
In this post, Denis Repin suggests something similar:
http://social.msdn.microsoft.com/forums/en-US/transactsql/thread/1409d8e6-257b-4096-9df4-d992895c65f5/
I believe you have also heard about the exchange ost file recovery tool
Thank you so much for this post. You saved my day at work! I was trying to figure out how to accomplish this with my client's data.
I'm very grateful for the extensive explanation and great details!
Cheers from Costa Rica!
I guess one needs to always look ahead and search for the most happening aspect of SSIS which makes it able enough to provide large complex database operations.
SSIS Upsert
Post a Comment