Tuesday, July 28, 2009

The Real SQL Pages, A Beginner’s Guide To Indexing

Okay, so the title may be a little lame :-) , but I think it suits the basic analogy to describe indexing quite well. Today I was asked on the MSDN forums, how indexing really works, in SQL Server. I began type a lot of techno babble and then thought to myself, will the OP even understand what I am talking about? The answer is probably not. I deleted all my text and came up with a simple analogy that is easy for everyone to visualize. Essentially, indexing in SQL Server works like a phonebook. That’s right ladies and gentleman, a phonebook.

How is indexing like a phonebook? Imagine that you have a phonebook in front of you and you have a handful of bookmarks. If you want to find all the phone numbers where the person’s name begins with A, you will place a bookmark at the first page of that letter and on each subsequent page of that letter. If you want a more granular bookmark, you can create a composite key, that contains information about names, within the letter A. These other keys will provide density details that allow the optimizer to make better cardinality estimates, or in your case allow you to find the pages you are looking for in the phonebook more quickly. For example, you have a composite index on (A,Adam’ Haines). The first letter will allow you to quickly identify all letters that start with A and the second column will allow you to more quickly identify names that are equal to ‘Adam Haines’ . You cannot directly retrieve the phone number for ‘Adam Haines’ , without using the letter bookmark because ‘Adam Haines’ is contained within the letter A. Essentially, we have use one of our letter A bookmarks and then jump to ‘Adam Haines’. Think about it this way. If I asked you to find all the phone numbers where the person’s name started with the letter A, you would place bookmarks on each page where the name starts with the letter A. If I asked you to find all the phone numbers where the person name is ‘Adam Haines’, your bookmarks would be useless because you are looking for a specific name and not a letter. Without knowing the letter, we have to flip through or scan the pages to find the value. If you want to be able to search for a specific name, you have to place a bookmark on the page where the name exists. Let’s see this in action.

We will start by creating our table structure:

IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'PhoneBook')
BEGIN
    DROP TABLE dbo.PhoneBook;
END
GO
 
CREATE TABLE dbo.PhoneBook(
Letter CHAR(1),
Name VARCHAR(100),
PhoneNumber VARCHAR(12),
PRIMARY KEY (Letter,Name,PhoneNumber)
);
GO
 
INSERT INTO dbo.PhoneBook VALUES ('A','Adam''s DB Consulting','555-555-5555');
INSERT INTO dbo.PhoneBook VALUES ('A','Alex BI Consulting','555-555-1234');
GO

Now we can run a few simple queries. The first query will filter the data using a filter on the letter A. This query will yield an index seek because Letter is the first column of our index.

SELECT *
FROM dbo.[PhoneBook]
WHERE [Letter] = 'A' --Index seek

Results:

image

In the second query will use the same letter, but we will also include a specific name we want to find. The query plan is the same as the prior. On larger datasets, this query should be less costly because we are using a more granular predicate, so the optimizer should be able to find the row more quickly.

SELECT *
FROM dbo.[PhoneBook]
WHERE [Letter] = 'A' AND Name = 'Adam''s DB Consulting' --Index seek

image

In the third query we will try searching for just the specific name.

SELECT *
FROM dbo.[PhoneBook]
WHERE Name = 'Adam''s DB Consulting' --Clustered Index scan
GO

Results:

image

Wait what the heck happened? The optimizer scanned the index even though the Name column is included in the index key!!!!! The answer is quite simple. As I stated earlier, the only column that is “seekable” is the first column in the index key. The other columns are used for density vectors, which help with cardinality estimates. In no way do the other columns help satisfy a predicate, where the first column is not present. Think back to where we put that bookmark. We put the bookmark on the letter A and the names within A, but we did not include the letter, in our search. How do you know where that name is specifically, without knowing the letter? The reality is we cant, so we have to scan. This is why we have to create an index on the name column.

--Create an index on Name to get an index seek
CREATE UNIQUE NONCLUSTERED INDEX ncl_idx_PhoneBook_PhoneNumber ON dbo.[PhoneBook](Name);
GO

Now lets run the same query.

SELECT *
FROM dbo.[PhoneBook]
WHERE Name = 'Adam''s DB Consulting' --Nonclustered Index seek
GO

Results:

image

I hope this analogy has made things a little clearer. Now this by no means all there is to know about indexing, but it is a good fundamental look at how they work. Let me know if you guys have anything to add, or anything I may have missed.

Happy coding.

Monday, July 27, 2009

SQL Server 2005/2008 Querying Hierarchal Data

SQL Server 2005 introduced a new query construct that has the look and feel of a derived table, but offers much more.  This new construct is called a CTE or common table expression.  CTEs allow developers to name result sets and can directly query the name set, as if it were a table.  One of the features that we will primarily looking at today is the CTE’s recursive functionality.  If you are not familiar with CTEs , you can read more about them in BOL http://msdn.microsoft.com/en-us/library/ms175972.aspx.  Another method will will explore has to do with the a new data type. SQL Server 2008 has a new data type to make querying hierarchical data more efficient; called a “Hierarchy Id.”  Hierarchy Id is a great construct that really makes hierarchal data more manageable and more performant. Here is the BOL documentation for the Hierarchy Id data type , http://technet.microsoft.com/en-us/library/bb677290.aspx.  We will be using both the recursive CTE and the hierarchy id to query our hierarchal data.  Let’s get started.

We will start by creating our table structure.

The first table we are going to create is our employees table.  This table will house all the pertinent information related to each employee.

USE [tempdb]
GO
 
IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'Employee')
BEGIN
    DROP TABLE dbo.Employee;
END
GO
 
--Table to hold employee information
CREATE TABLE dbo.Employee(
EmployeeId INT IDENTITY(1,1) PRIMARY KEY,
FName VARCHAR(50) NOT NULL,
MInitial CHAR(1),
LName VARCHAR(50) NOT NULL
);
GO
 
--Insert employee information
INSERT INTO dbo.Employee VALUES ('James','A','Haines');
INSERT INTO dbo.Employee VALUES ('John','A','Smith');
INSERT INTO dbo.Employee VALUES ('Jason','B','Jones');
INSERT INTO dbo.Employee VALUES ('Michelle','L','Williams');
INSERT INTO dbo.Employee VALUES ('Lauren','T','Gaubert');
INSERT INTO dbo.Employee VALUES ('Addison','S','Keller');
INSERT INTO dbo.Employee VALUES ('Avery','C','Ramos');
INSERT INTO dbo.Employee VALUES ('Justin',null,'Matherne');
GO

The next table we will create is the positions table.  This table stores all the positions/job titles, within the company.

IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'Positions')
BEGIN
    DROP TABLE dbo.Positions;
END
GO
 
--Table to hold employee positions
CREATE TABLE dbo.Positions(
PositionId INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Position VARCHAR(50) NOT NULL UNIQUE
);
GO
 
--Insert Employee Position Information
INSERT INTO dbo.Positions VALUES ('CEO');
INSERT INTO dbo.Positions VALUES ('Senior Director - Development');
INSERT INTO dbo.Positions VALUES ('Development Manager');
INSERT INTO dbo.Positions VALUES ('QA Manager');
INSERT INTO dbo.Positions VALUES ('Project Mananger');
INSERT INTO dbo.Positions VALUES ('Project lead');
INSERT INTO dbo.Positions VALUES ('Developer');
INSERT INTO dbo.Positions VALUES ('QA Tester');
GO

Next, we create the Employee_Position table.  Employee_Position stores the employee id from the employee id and the position id from the positions table.  This table relates a given employee to a position.

IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'Employee_Position')
BEGIN
    DROP TABLE dbo.Employee_Position;
END
GO
 
--Table to relate employees to positions
CREATE TABLE dbo.Employee_Position(
EmployeeId INT NOT NULL REFERENCES dbo.[Employee](EmployeeId),
PositionId INT NOT NULL REFERENCES dbo.[Positions](PositionId),
PRIMARY KEY (EmployeeId,PositionId)
);
GO
 
--Insert Employee Position relations
INSERT INTO dbo.Employee_Position VALUES (1,1);
INSERT INTO dbo.Employee_Position VALUES (2,2);
INSERT INTO dbo.Employee_Position VALUES (3,3);
INSERT INTO dbo.Employee_Position VALUES (4,4);
INSERT INTO dbo.Employee_Position VALUES (5,5);
INSERT INTO dbo.Employee_Position VALUES (6,6);
INSERT INTO dbo.Employee_Position VALUES (7,7);
INSERT INTO dbo.Employee_Position VALUES (8,8);
GO

The final table we will be creating is the primary table, for this article, Employee_Hierarchy.  Employee_Hierarchy stores hierarchal employee data.  Essential this table has two main columns, EmployeeId and ManagerEmployeeId.  EmployeeId is the employee id, which relates to the Employee table.  ManagerEmployeeId is the employee id of the employee’s manager.

IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'Employee_Hierarchy_2005')
BEGIN
    DROP TABLE dbo.Employee_Hierarchy_2005;
END
GO
 
--Table to hold employee hierarchal data
CREATE TABLE dbo.Employee_Hierarchy_2005(
EmployeeId INT NOT NULL REFERENCES dbo.[Employee](EmployeeId),
ManagerEmployeeId INT REFERENCES dbo.[Employee](EmployeeId),
PRIMARY KEY (EmployeeId)
);
GO
 
--Insert employee hierarchial data
INSERT INTO dbo.Employee_Hierarchy_2005 VALUES (1,NULL);
INSERT INTO dbo.Employee_Hierarchy_2005 VALUES (2,1);
INSERT INTO dbo.Employee_Hierarchy_2005 VALUES (3,2);
INSERT INTO dbo.Employee_Hierarchy_2005 VALUES (4,2);
INSERT INTO dbo.Employee_Hierarchy_2005 VALUES (5,3);
INSERT INTO dbo.Employee_Hierarchy_2005 VALUES (6,5);
INSERT INTO dbo.Employee_Hierarchy_2005 VALUES (7,3);
INSERT INTO dbo.Employee_Hierarchy_2005 VALUES (8,4);
GO
 
--Create index to help query
CREATE NONCLUSTERED INDEX ncl_idx_Employee_Hierarchy_2005_ManagerEmployeeId ON [Employee_Hierarchy_2005](ManagerEmployeeId);
GO

The first solution we are going to address is the recursive CTE solution.  The CTE solution works in SQL Server 2005 and greater.  I have placed comments to help explain what each step of the CTE is doing.  Essentially, what occurs is the anchor query is joined to the Manager Employee Id and the recursive element then walks down the hierarchy based on the relationship between Employee Id and Manager Employee Id. For example,  The CEO has an Employee Id of 1.  The anchor query selects the CEO’s employee Id (1) and then this employee id to join onto the Employee Hierarchy table, using the Manager Employee Id column.  Once the the optimizer has the second employee id (2), it uses the second Employee Id to find all employees that employee id manages. Essentially, the query starts from the anchor Employee Id and walks its way to the end of the table, returning all the child employees.  Because of the nature of recursion, a recursive CTE can sometimes be inefficient.  To optimize recursive CTEs one should have the proper indexing  in place and try to limit the amount of data, in the recursive construct.

;WITH cte AS
(
--This is the anchor query
SELECT eh.employeeid, eh.ManagerEmployeeId, 0 AS Position_Depth --default level of the root employee
FROM dbo.Employee_Hierarchy_2005 eh
WHERE eh.[ManagerEmployeeId] IS NULL --get the root level (CEO)
    
UNION ALL --this joins the anchor query and appends the below query results
 
SELECT eh.employeeid, eh.ManagerEmployeeId,    cte.Position_Depth + 1 AS Position_Depth --For each depth, increment by 1
FROM dbo.Employee_Hierarchy_2005 eh INNER JOIN [cte] ON eh.ManagerEmployeeId = cte.EmployeeId --join the anchor to employee hierarchy
)
SELECT cte.EmployeeId, e.FName, e.MInitial,    e.LName, cte.[ManagerEmployeeId], p.Position, cte.Position_Depth
FROM cte
INNER JOIN [dbo].[Employee] e ON e.[EmployeeId] = cte.[EmployeeId] --get the employee details
INNER JOIN [dbo].[Employee_Position] ep ON e.[EmployeeId] = ep.[EmployeeId] --get the employee positions
INNER JOIN [dbo].[Positions] p ON ep.[PositionId] = p.[PositionId] --get the employee position description 

Results:

image

Next we will take a look at the new Hierarchy Id data type.  The Hierarchy Id data type is only available in SQL Server 2008.  The Hierarchy Id data type is a binary representation of a hierarchy structure. Hierarchy Id can be a very compact and effective method to store hierarchal data.  More information about the hierarchy id data type can be found in BOL and in this great SSC article, http://www.sqlservercentral.com/articles/SQL+Server+2008/62204/.

Let’s create a new hierarchal table for the hierarchy Id example. 

IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'Employee_Hierarchy_2008')
BEGIN
    DROP TABLE dbo.Employee_Hierarchy_2008;
END
GO
 
--Table to hold employee hierarchal data
CREATE TABLE dbo.Employee_Hierarchy_2008(
EmployeeId INT NOT NULL PRIMARY KEY REFERENCES dbo.[Employee](EmployeeId),
ManagerEmployeeId INT REFERENCES dbo.[Employee](EmployeeId),
EmpHierarchy HIERARCHYID
);
GO
 
CREATE NONCLUSTERED INDEX ncl_idx_Employe_Hierarchy_2008_EmpHierarchy ON [dbo].[Employee_Hierarchy_2008](EmpHierarchy) include(ManagerEmployeeId);
GO

We can now populate the new Employee_Hierarchal table.  We will need to use a recursive CTE to populate our Hierarchy Id column. The code to populate the new table is below.  We are using the GetDescendant built-in function to get the descendant of each employee we recursively select.

;WITH Emp_Hierarchy(EmployeeId, ManagerEmployeeId, EmpHierarchy) AS
(
SELECT eh.EmployeeId, eh.ManagerEmployeeId, hierarchyid::GetRoot() AS EmpHierarchy
FROM dbo.[Employee_Hierarchy_2005] eh
WHERE eh.ManagerEmployeeId is NULL
    
UNION ALL
 
SELECT eh.EmployeeId, eh.ManagerEmployeeId, Parent.EmpHierarchy.GetDescendant(NULL,NULL)
FROM dbo.[Employee_Hierarchy_2005] eh
INNER JOIN Emp_Hierarchy AS Parent ON eh.[ManagerEmployeeId] = Parent.[EmployeeId]
 )
INSERT INTO Employee_Hierarchy_2008([EmployeeId],ManagerEmployeeId, EmpHierarchy)
SELECT [EmployeeId], [ManagerEmployeeId], [EmpHierarchy]
FROM [Emp_Hierarchy]
GO

Just to show you what the hierarchy looks like run a select statement on the table and convert the hierarchy id column ToString().  As you can see, the EmpHierarchy is stored in a binary format. 

SELECT *,EmpHierarchy.ToString()
FROM Employee_Hierarchy_2008 eh2008
 
Results:
 
image 

We now have our table and our data, so we can proceed to query hierarchy.  We will use the built-in function IsDecendantOf to check whether or not a given employee id is a descendant of the employee id passed in.

Select eh.employeeid, e.FName, e.MInitial, e.LName, eh.ManagerEmployeeId, p.Position, EmpHierarchy.GetLevel() AS Position_Depth
From dbo.[Employee_Hierarchy_2008] eh
INNER JOIN [Employee] e    ON eh.[EmployeeId] = e.[EmployeeId]
INNER JOIN [Employee_Position] ep ON e.[EmployeeId] = ep.[EmployeeId]
INNER JOIN [Positions] p ON ep.[PositionId] = p.[PositionId]
WHERE EmpHierarchy.IsDescendantOf(hierarchyid::GetRoot()) = 1

Note: below I have chosen to use the built-in function GetRoot() to get the topmost employee.  You can use a variable or sub query to search for employees in any node it does not have to be the root.

Results:

image

There you have it!!! Two methods to query hierarchal data.  If you are curious about the performance difference between the two methods, the hierarchy id can be more performant and less costly than the recursive CTE.  Each of these methods shines in their own way, and both have limitations. I leave it to you to decide which is best for your environment.

Happy Coding.

Tuesday, July 21, 2009

Centralized Data Collecting, Using SSIS

What is the best method to centralize data collection and/or history metadata related to SQL Server Instances?  This is a common question that a lot of companies/developers ask themselves.  A lot of solutions, especially pre SQL Server 2008,  require objects to exist on each participating instance, or require that you create a batch process that loops through a text file of server names.  SQL Server 2008 introduced a centralized storage mechanism to collect and manage performance data for 1 or more instances, aliased as Data Management Warehouse.  Here is a great link that briefly covers the new 2008 feature, http://www.simple-talk.com/sql/learn-sql-server/sql-server-2008-performance-data-collector/.  Today I will be focusing on a highly scalable SSIS solution.  SSIS offers a lot of flexibility,  scalability and has a lot of different security measures, that are not available to TSQL and command line programming.   For those unfamiliar with SSIS, SSIS (SQL Server Integration Services) is a tool used to transform and load data amongst a variety of sources including SQL Server, flat files, Oracle, AS400, MS Access etc.   In a recent blog post, I demonstrated how to setup a new SSIS package and how to dynamically change the destination file name.  This post has a lot of meat for beginners and may be worth a read,  http://jahaines.blogspot.com/2009/07/ssis-dynamically-naming-destination.html.  I will not get into the particulars of creating the SSIS package here, but will recommend that any persons not familiar with SSIS to read the above linked post.

Let’s get started by opening SSMS and browsing to the SQL Server instance you want to use as your central data store.  I chose to create a new database called “SQLMngt_DW”.  The ultimate plan for this database is to store job history metadata for all SQL Server Instances.  Feel free to create your database where you see fit.

USE [master]
GO
 
CREATE DATABASE [SQLMngt_DW] 
GO

Now lets create a table to store server names.

USE [SQLMngt_DW]
GO
 
IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'Servers')
BEGIN
    DROP TABLE dbo.[Servers];
END
GO
 
--Create a Servers table to house all your instances
CREATE TABLE dbo.Servers(
InstanceId INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
InstanceName VARCHAR(50)
);
GO

Now we can insert some ServerNames into our dbo.Servers table.

USE [SQLMngt_DW]
GO
 
INSERT INTO dbo.Servers VALUES ('Instance1\dev');
INSERT INTO dbo.Servers VALUES ('Instance2\dev');

**Note: Make sure to insert some instance names into the dbo.Servers table; otherwise, your package will timeout when trying to dynamically change the connection string. Also, make sure you enter valid instance names.  The names above are dummy server names, you should alter the inserts for you environment.

We need to create on last table, called JobHistory.  This table will store job history data for all servers listed in the servers table.

IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'JobHistory')
BEGIN
    DROP TABLE dbo.[JobHistory];
END
GO
 
--Create a job history table
CREATE TABLE dbo.JobHistory(
JobHistoryId INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
ServerName VARCHAR(50),
JobId UNIQUEIDENTIFIER,
JobName VARCHAR(50),
RunDate DATETIME,
JobStatus VARCHAR(11),
Duration VARCHAR(10)
)

Now that we have our database and tables, we can start creating our package.  Create a new SSIS package and create two variables. One variable will need to be of object type and the other of string, as shown below.  The value for the string variable, which I named “connectionString” should have a value equal to the server name where you placed your centralized database.

image

Next drag the Execute SQL Task component to the canvas.  Your canvas should look like below.

image 

Open the Execute SQL Task and configure the general tab, as shown below. Make sure to create a connection to the instance where you created your database and JobHistory table. I named my connection SQLMngt_DW.

image

The code for the SQL Statement is below.

select InstanceName from dbo.Servers

Now, we need to configure the result set tab.  Make the appropriate configurations, as shown below.

image

Our SQL Task is fully configured.  We can proceed in creating the ForEach task.  Drag the ForEachLoop Container to the canvas.  Make sure to create a precedence constraint between the SQL Task and the loop.  Essentially, all you need to do is drag the green arrow from the SQL task to the ForEachLoop Container. Once the loop container is in place, you have to drag a data flow task into the container. Your canvas should look like the screenshot below.

image

Now we have to configure both controls.  Let’s start with the loop.  Double-click the foreachloop container to configure the properties.  I will start with the collection tab. 

image

Next configure the variable mappings tab, as shown below.

image

Now, we have to configure the Data Flow Task.  Create both a source and destination OLE DB task, within the Data Flow Task, and create a precedence constraint between the two.  Once the source and destination have been created, you should create a new OLEDB connection.  This connection string needs to point to any valid Instance. Note: It does not matter which instance you choose.  I have chosen to point my new connection to the same server as the SQLMngt_DW.  I named my new connection dyn_source (dynamic source).  Your canvas should look like below.

image

Now open the OLE DB Source, in the data flow task and configure it, as shown below.

Note: The code for the SQL Command is below the screenshot.

image

SQL Command Code:

SELECT 
    CONVERT(VARCHAR(50),@@SERVERNAME) AS ServerName,
    j.[job_id] AS JobId,
    CONVERT(VARCHAR(50),j.NAME) AS JobName,
    CONVERT(DATETIME,CONVERT(VARCHAR(8),jh.[run_date]) 
    + SPACE(1) 
    + STUFF(STUFF(RIGHT(REPLICATE('0',6) 
    + CONVERT(VARCHAR(6),run_time),6),3,0,':'),6,0,':')) AS rundate,
    Run_Status.descr AS jobstatus,
    STUFF(STUFF(RIGHT(REPLICATE('0',6) 
    + CONVERT(VARCHAR(6),Run_Duration),6),3,0,':'),6,0,':') AS duration
FROM msdb..[sysjobs] j
INNER JOIN msdb..[sysjobschedules] jsch
    ON j.[job_id] = jsch.[job_id]
INNER JOIN msdb..[sysschedules] sch
    ON jsch.[schedule_id] = sch.[schedule_id]
INNER JOIN msdb..[sysjobhistory] jh
    ON jh.[job_id] = j.[job_id]
INNER JOIN(
    SELECT 0 AS run_status ,'Failed' AS descr UNION ALL
    SELECT 1,'Succeeded' UNION ALL
    SELECT 2,'Retry' UNION ALL
    SELECT 3,'Canceled' UNION ALL
    SELECT 4,'In progress'
) AS Run_Status
    ON jh.[run_status] = Run_Status.[run_status]
WHERE
    j.NAME NOT LIKE 'Tivoli%' --tivoli backups
    AND j.[enabled] = 1
    AND sch.[enabled] = 1
    AND jh.[step_name] = '(Job outcome)'
    AND CONVERT(DATETIME,CONVERT(VARCHAR(8),jh.[run_date]) 
        + SPACE(1) 
        + STUFF(STUFF(RIGHT(REPLICATE('0',6) 
        + CONVERT(VARCHAR(6),run_time),6),3,0,':'),6,0,':')) 
        BETWEEN DATEADD(DAY,DATEDIFF(DAY,0,GETDATE()),0)
            AND DATEADD(DAY,DATEDIFF(DAY,0,GETDATE())+1,0)

Make sure to click the columns tab so SSIS can automatically set your columns for you.

Now let’s configure the destination.  The destination is actually quite simple.  We have to open the properties and choose our  SQLMngt_DW source and the JobHistory table, as shown below.  Make sure to select the mappings tab to verify all the columns are aligning properly.  You may see the JobHistoryId column is set to ignore, which is fine because this is an identity column and we do not want to insert directly into it.

image

Now we get to do the fun part :-), which is setting up the dynamic connection string via the connectionstring variable.  Right-click the dyn_source connection string and click properties.  Once the properties display, you can click the ellipses next to expressions and choose connectionstring.  Click the ellipses in the expression textbox to open the expression editor and paste the following expression:

"Data Source=" + @[User::ConnectionString] + ";Initial Catalog=msdb
;Provider=SQLNCLI10.1;Integrated Security=SSPI"
 
Note: I am using Integrated security. If your connection string is use SQL Authentication, you will need to supply username and password in the connection string.

image

Believe it or not, that is it!!!!   Click on the play sign at the top of the SSIS application to run the package.  We should see all tasks light up green, unless something is not configured correctly, in which case you will see red. 

image

Our last step is to validate our package did what it is supposed to do.  We can validate this by querying our job table.

image

There you have it.  You now have a very flexible and scalable method to collect data from one or more sources.  The SSIS method really beats having to manage a command line solution or having to manage/deploy stored procedures to each server.  Configuring the SSIS package is not that tedious and can be done fairly quickly.  It actually takes a lot longer to describe how to do this, than actually do it.  Once you get use to using SSIS, this process should only take you 10-15 minutes to setup and configure.   I hope that you have learned something new and I know this will help some of you out there.  I know this post may seem overwhelming for those of you have not really had the opportunity to dive into SSIS yet, so I have provided a link, where you can download the project.  Please be aware that I am using SQL 2008 and BIDS 2008, so my package will not work with BIDS 2005, as the package structure changed in BIDS 2008.

Download the project files: http://cid-6f041c9a994564d8.skydrive.live.com/self.aspx/.Public/SSIS%7C_Data%7C_Collection/SSIS%7C_DataCollection.zip

Happy coding!!

Thursday, July 16, 2009

SSIS – Dynamically Naming Destination Output Files

When I am not wearing my DBA hat, I put on my BI (Business Intelligence) hat.  BIDs (Business Intelligence Development Studio) is a very powerful and scalable business intelligence software package that was introduced in SQL Server 2005.  Within BIDs, I will focusing on SSIS (SQL Server Integration Services) projects.  SSIS projects are designed to extract, transform, and load data, which is more commonly known as ETL.  In this post, I will be focusing on extracting data and loading it to a destination, there is no need for me to transform the data.  I will start by creating a new SSIS package.  You can launch BIDs by browsing to Start –> Programs –> SQL Server 2005/2008 –> SQL Server Business Intelligence Studio. Once BIDs has launched, you should click New Project and select Integration Services Project, as shown below.

image

We will name our project DynamicDestinationFileName.  Once the project is created, we will switch over to SSMS and create a sample table, in tempdb.  Log into your instance of SQL Server and execute the code below.

SET NOCOUNT ON;
GO
 
USE [tempdb]
GO
 
IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'MyTable')
BEGIN
    DROP TABLE dbo.MyTable;
END
GO
CREATE TABLE dbo.MyTable(
Id INT IDENTITY(1,1) PRIMARY KEY,
Col CHAR(1)
);
GO
 
INSERT INTO dbo.MyTable VALUES ('A');
INSERT INTO dbo.MyTable VALUES ('B');
INSERT INTO dbo.MyTable VALUES ('C');
INSERT INTO dbo.MyTable VALUES ('D');
INSERT INTO dbo.MyTable VALUES ('E');
GO

We have just created our SSIS data source, which means this is the table that will feed our destination flat file.  Switch back to BIDs and drag a “Data Flow” task to the designer, from the toolbox. You screen should look like below.

image

Now we can double-click the data flow task to configure it, or you can click the data flow task tab above the canvas.  We will need to drag a OLE DB data source and a flat file destination, from the toolbox.  Your canvas should look like below.

image

Select the OLE DB Source and drag the green arrow to the flat file destination.  Now, let’s configure the source.  Double-click the source and click New.

image

Click new again and then type in the database server where you just ran the scripts to create the table.  In the database dropdown select tempdb.  Your connection should look like below.  You can test your connection by clicking the test connection button.  After testing the connection, click ok to apply the changes and then ok again. 

image

Now choose MyTable from the table dropdown. 

image

Once you select the table, click the columns tab to assign columns to the data source.  Click Ok and were down with the data source!!!! Now let’s move onto the destination.

image

Double-click the flat file destination and create a new connection.  When you click “New” to create the destination connection a window will popup and ask for the flat file destination format. 

The options include delimited, ragged right, fixed-width etc.  The differences between these types is not relevant to the task at hand, and will not be covered here.  BOL has a lot of information about the differences between output types.  Okay, lets move on by selecting delimited and ok.

image

Now we will need to define the properties of the destination file. Click the browse button to choose where the destination file will be created.  I will be creating my file to the C:\MyTable.txt.  you may be screaming, HUH!!!!, YOU SAID THE DESTINATION NAME WOULD BE DYNAMIC!!!!!!  Do not fret it will be and I will show you how to transform the static destination to a dynamic one, in a few minutes.

image

Okay now, click the columns tab on the left to make sure the columns look correct.  You should see the columns Id and  Col.  If everything looks good, click ok.  Note: there should not be any data because the flat file is empty at present, but the columns should still be correct.

image

Now we should be in the destination editor window.  Click the mappings tab, as shown below and make sure everything aligned properly. If so, click ok.

image

I hope you are still with me :-).  Finally we can get to the meat of the post.  Select the Flat file connection manager in the connection managers pane –> right-click and choose properties and click the ellipses next to “expressions”, as shown below.

image

Choose connection string from the dropdown and then click the ellipses in the expression box.  You will have to build the expression to create the destination connection string.  The expression editor uses VB coding, so that may be a benefit for some of you out there.  I have provided the code below to build the expression.  Click the evaluate button to see what the connection string will evaluate too. Note: I had to use double backslash because it is required because the backslash acts as a escape character, in the expression editor.

Note: I like to use variables to store file paths because it makes the code cleaner and is easier to modify the package, as you only need to update the variable in once, whereas, you may have to update lots of expressions, if they are hardcoded.

"C:\\MyTable_" +
Right("0" + (DT_STR,4,1252) DatePart("m",getdate()),2) +
Right("0" + (DT_STR,4,1252) DatePart("d",getdate()),2) + 
(DT_STR,4,1252) DatePart("yyyy",getdate()) + ".txt"

Here is what the expression looks like and should evaluate too.

image

Now lets, run our package and test it out. Click the green play symbol to run the package.  When you run the package all the tasks should turn green, which indicates they succeeded.  If yours turned red, something went wrong. 

image

Now we just have to make sure our file is where we put it and the name contains today’s date in MMDDYYYY format. 

image

We are good, the package did everything we wanted it too.  So there you have it…. a method to create a destination file, with a dynamic name.  It is really quite easy to do, the tedious and mundane part is creating and setting up the package itself.  I hope you learned something new and can use this now or somewhere down the road. 

Stay tuned, I will be posting a lot more BI solutions, tips,  and tricks.

Friday, July 10, 2009

Should I Intersect or Except?

Two very useful functions were introduced, in SQL Server 2005, Intersect and Except.  These two functions can greatly limit the amount of coding required to find the differences or similarities between two result sets.  Let’s start with the Except operator.  Except returns the distinct values from the left query that do not exist in the right query.  When I say left and right query I am referring to the position of the query in relation to EXCEPT.  Take a look at the example below.  In this example, the table 1 query is to the left and table 2 is to the right.  This means the result will include all the rows from table1 than do not exist in table2.

SELECT * FROM Table1 EXCEPT SELECT * FROM Table2

You are probably thinking okay…, but how can this help me?  I believe the best way to demonstrate the behavior is through example.  Let’s start by creating test tables.

DECLARE @t1 TABLE(
id INT,
col CHAR(1)
)
 
INSERT INTO @t1 VALUES (1,'a');
INSERT INTO @t1 VALUES (2,'b');
INSERT INTO @t1 VALUES (3,'c');
INSERT INTO @t1 VALUES (4,'d');
 
DECLARE @t2 TABLE(
id INT,
col CHAR(1)
)
 
INSERT INTO @t2 VALUES (1,'a');
INSERT INTO @t2 VALUES (2,'b');
INSERT INTO @t2 VALUES (4,'d');
INSERT INTO @t2 VALUES (5,'e');

Now, we can test.  In this example we will use the same logic that we presented earlier. 

--use EXCEPT
SELECT 
    *
FROM @t1
 
EXCEPT
 
SELECT *
FROM @t2

The results are as follows:

image

The results is only id =3, col=’c’ because this row is the only row that existed in @t1 that did not exist in @t2.  This example may not seem like it saves you a lot of coding,  but imagine how many statements you would need to check each column, especially when the table has lots of columns. Now let’s expand the example to show distinct values, in both tables.  So how do we get the distinct rows, for each table?  If you said, use two except operators, you are correct.  We can simply reverse the left and right queries to change the logic such that we get the distinct rows from @t2.  We will use the Union all operator to append the results together, so that we get one final result set of rows and the table they exist in.

--use EXCEPT
SELECT *
FROM(
    SELECT 
        't1' AS tbl,id,col
    FROM @t1
 
    EXCEPT --orginal except
 
    SELECT 
        't1' AS tbl,id,col
    FROM @t2
 
    UNION ALL
    
    SELECT 
        't2' AS tbl,id,col
    FROM @t2
 
    EXCEPT –reverse the except logic
 
    SELECT 
        't2' AS tbl,id,col
    FROM @t1
) AS a    

Results:

image

Now lets move on to the Intersect operator.  Intersect does what the name implies. Intersect returns all the rows where the query results match between the left and right queries.

Let’s see it in action:

--use intersect
SELECT 
    *
FROM @t1
 
INTERSECT
 
SELECT *
FROM @t2

image

The rows returned show all the values that exist in both query result sets.  This is a very quick and easy way to get like values.  The alternative to using the intersect method is to use a predicate that contains every column in the table, as shown below.

--use join
SELECT 
    t1.*
FROM @t1 t1
INNER JOIN @t2 t2
    ON  t1.id = t2.id 
    AND t1.col = t2.col

You may be thinking that this does not look so bad, which I can agree with; however, what happens when your queries has 10 or more columns?  Can you imagine having to scale this code to work with 20 filters.  I just think it is a lot easier to use intersect for this, oppose to hand coding the predicate. One could argue that checksum is a valid alternative.  Checksum can be a good alternative; however, it prevents an index from being used, so I would  be a little apprehensive to use it, especially if you are implementing the solution in a production environment.

As you can see, these new function removes a lot of  coding complexity.  The performance of each of these operators varies depending on the size and complexity of the results returned by each query and the indexing in place.  I typically use these operators to quickly identify RI violators and to quickly identify data that exists in one table and not another.  I hope that this tip saves you time and that you learned something new. 

Happy coding!