Chapter One
Getting Started
This book is about applications. Specifically, this book is about applying the functionality of SQL Server 2005 Integration Services (SSIS) to help you envision, develop, and implement your data processing needs. The discussions throughout the book spotlight how SSIS can help you accomplish your data integration and processing requirements.
Core to the data processing that SSIS does best is extraction, transformation, and loading (ETL). Over the years, this ETL has taken on a range of different meanings, from the general perspective of moving data from somewhere to somewhere else, to the specific application of data warehousing ETL. In fact, ETL has its roots in business intelligence (BI) and data warehouse processing.
This chapter provides important background information for generalized ETL that DBAs will need, as well as basic data warehousing ETL concepts. In addition, this chapter includes a practical review of SSIS functionality and provides the foundation for building the book''s examination of applying the functionality of SSIS to help you accomplish your individual goals in data integration and processing requirements.
Choosing the Right Tool for the Job
If you have any inclination toward home remodeling, chances are you enjoy walking through the tools area of your local home improvement store. Hundreds of different tools have been manufactured that perform a variety of functions and, in some cases, some fairly esoteric uses.
Any novice handyman can attest to the adage that the right tool for the job makes the job easier. The same concept applies when it comes to handling data. There''s no doubt that, depending on the right situation, there may be a specific tool to handle such a function. Think about all the different types of data processing needs that you have across your organization: Data synchronization between systems Data extraction from ERP systems Ad hoc reporting Replication (both homogeneous and heterogeneous) PDA data synchronization Legacy system integration Vendors and partner data files integration Line of business data Customer and employee directory synchronization Data warehouse ETL processing
As you may know, when it comes to data processing, there are a lot of tools out there. Some are created for specific situations (such as folder synchronizing tools), whereas other tools are designed to perform a variety of functions for different situations. So, the traditional question often posed is which tool can best meet the business and logical requirements to perform the tasks needed?
Consider the host of tools found in the ever-evolving Microsoft toolset. You can use Transact SQL (TSQL) to hand-code a load, Host Integration Server to communicate with a heterogeneous data source, BizTalk to orchestrate messages in a transactional manner, or SSIS to load data in batches. Each of these tools plays a role in the data world.
Although there can be overlaps, each tool has a distinct focus and target purpose. When you become comfortable with a technology, there''s always the tendency to want to apply that technology beyond its intended "sweet spot" when another tool would be better for the job. You''ve no doubt heard the phrase "when you''re a hammer, everything looks like a nail." For example, C# developers may want to build an application to do something that SSIS could potentially do in an hour of development time. The challenge everyone faces entails time and capacity. There is no way everyone can be an expert across the board. Therefore, developers and administrators alike should be diligent about performing research on tools and technologies that complement each other, based on different situations.
For example, many organizations use BizTalk for a host of purposes beyond the handling of business-to-business communication and process workflow automation. These same organizations may be perplexed as to why BizTalk doesn''t scale to meet the needs of the organization''s terabyte data warehousing ETL. The easy answer is that the right tool for bulk BI processing is an ETL tool such as SSIS. In fact, as shown in Figure 1-1, SSIS provides an excellent platform for leveraging its high-performance data pipeline.
The process outlined in Figure 1-1 may be simple enough, but essentially what SSIS does is to provide the technology to make the process efficient and scalable, and provide the functionality to handle data errors.
This chapter reviews ETL concepts in more detail, and gets you started with an SSIS example. Before diving into the expert level details found in the ensuing chapters, reminding you about ETL concepts and SSIS features will help solidify the background needed before moving to the details of the SSIS application.
This book focuses on the three most common categories of SSIS usage: Data warehouse ETL Data integration SSIS administration
Before going any further, it makes sense to consider the purpose and background of each of these types of ETL.
Data Warehousing ETL
Some of you may be well-versed in data warehousing and related ETL concepts, but for those who are not, here is a high-level overview of data warehousing. Data warehousing focuses on decision support, or enabling better decision making through organized accessibility of information. As opposed to a transactional system such as a point of sale (POS), Human Resources (HR), or customer relationship management (CRM) that is designed to allow rapid transactions to capture information data, a data warehouse is tuned for reporting and analysis. In other words, instead of focusing on the entry of information, data warehousing is focused on the extraction and reporting of information to show trending, summary, and data history.
Databases designed for data warehousing are created in a structure called a dimensional model, which involves two types of tables. Dimension tables hold informational data or attributes that describe entities. Fact tables capture metrics or numeric data that describe quantities, levels, sales, or other statistics. A data warehouse may involve many dimension tables and fact tables. Figure 1-2 shows the relationships between several dimension tables and one fact table in a structure often called a star schema.
The focus of this book is not on the design of the dimension tables and fact tables, but rather on getting data into these structures from other repositories. Processing ETL for data warehousing involves extracting data from source systems or files, performing transformation logic on the data to correlate, cleanse, and consolidate, and then loading a data warehouse environment for reporting and analysis (see Figure 1-3).
For those who are already versed in ETL concepts and practice, you''ll know that when it comes to developing a data warehouse ETL system, moving from theory to practice often presents the biggest hurdle. Did you know that ETL typically takes up between 50 and 70 percent of a data warehousing project? That is quite a daunting statistic. What it means is that even though presenting the data is the end goal and the driving force for business, the largest portion of developing a data warehouse is spent not on the presentation and organization of the data, but in the behind-the-scenes processing to get the data ready.
Data Integration
You can also use SSIS to synchronize data between systems, or to replicate data. For example, you may want to create a business-to-business portal site, and you may need the site to interface with the source data on the mainframe. In this case, you may get the data delivered in nightly extracts from the mainframe and load it into your SQL Server table. Another very common ETL task that DBAs face is receiving files from File Transfer Protocol (FTP) servers (or on network shares) that must be processed and loaded into another system. This type of process involves moving files and then processing the data, which may involve de-duping, combining files, cleaning bad data, and so on.
Part of the job of integrating data may include data extraction. Data extraction is moving data out of a source, and, although it sounds easy enough, some of the challenges involve extracting only changes and also optimizing the extraction to make it scalable. Chapter 3 provides more information on data extraction for applying SSIS to data sources and extraction techniques.
In addition, if you think you have a perfect data source, think again! Chances are you will be dealing with missing data, mistyped data, NULL values, and just plain dirty data. Refer to Chapters 5 and 7 to learn how to handle real-world data situations. Data quality issues span a range of challenges, and you will need to plan your data cleansing to ensure you can accommodate your final goal of data processing.
SSIS Administration
If you are a DBA responsible for SSIS packages (whether you created them or were given responsibility), then chances are you will have the responsibility of monitoring the package execution and ensuring that the transactions are correctly implemented to keep the database in a consistent state. Some of you will also be responsible for package deployment with a team of people, and securing packages so that only the right people have the ability to access and execute packages.
Although not every environment will be upgrading from SQL Server 2000 DTS, many DBAs do face this situation. DTS adoption was broad because of its ease of use and execution. And now that SSIS is here and proven, you may need to take your packages and move them to SSIS. This is easier said than done, given the architectural changes in the products. Chapter 11 provides more information on DTS migration to help get you there without losing your mind.
Yet another aspect of SSIS is database administration. In fact, in SQL Server 2005, SSIS is also used to help manage your database environment. For example, SQL Server maintenance plans (such as database backups, index defragmentation, database consistency checking, and so on) use SSIS behind the scenes to coordinate the administration operations.
Optimizing and scaling SSIS is another common responsibility for both DBAs and developers alike. Chapter 12 targets scaling SSIS, including ways to optimize destinations and how to take advantage of SSIS functionality to make faster, more scalable packages.
SSIS Review
Many of you have practical, learn-on-the-go experience using SSIS, and are looking to this book to take your knowledge to the next level and to fill in any knowledge gaps. Others of you have a good book knowledge of SSIS and want to extend your skills. And there may be those of you (like us) who like to dive right in, skip the intro books, and go right for the expert material.
To set the stage and provide a common starting point, we will walk you through a package creation that involves many of the SSIS key concepts. It''s impossible to wrap all the SSIS functionality into this one example, but you will get a good review of the basic features. If you feel you already know how to use the basic SSIS features comfortably, feel free to skip this section. The ensuing discussions, however, assume that you know some of these basic concepts and will not walk you through these steps again. You should be familiar with some basics such as how to create a solution. This example assumes some SSIS background, so if you need a more fundamental review, see one of the starter SSIS books available such as Professional SQL Server 2005 Integration Services (Wiley Publishing, 2006).
In this package creation walk-through, you will start by developing a simple package Data Flow that pulls a limited range of data (by using package variables) from a source and adds some data transformation steps. Since SSIS includes more than just processing data, you will then be working with the control flow and creating complementary tasks and containers with Precedence Constraints to control what happens in what order in this example. The final step is executing the package.
To begin, first start by opening the Business Intelligence Development Studio (BIDS) tool and creating a new SSIS project called ExpertSSIS. Then, rename the default package that is created (Package.dtsx) to Chapter1.dtsx. Confirm that you''d like to rename the package object as well.
Creating a Connection Manager
After creating the package, you can create the first connection manager. Right-click in the Connection Manager pane at the bottom of the package editor and select New OLE DB Connection. On the Configure OLE DB Connection Manager screen, if you do not have an AdventureWorks connection in the Data Connections list, click New. Type localhost for the Server Name property, and select AdventureWorks from the Database drop-down box. Click OK to save the connection, and click OK again when you have the localhost.AdventureWorks connection selected. Right-click the newly created localhost.AdventureWorks connection manager and rename it to AdventureWorks.
Using the Control Flow
The Control Flow tab is where you perform the package''s workflow. For example, you may decide to receive a file through FTP, transform the data within the file, and then archive the file. In that example, you would have three tasks: An FTP Task A Data Flow Task A File System Task
If you need help with the configuration of these tasks, consult one of the starter books such as the Professional SQL Server 2005 Integration Services book. Configuration of these tasks is not addressed in this book. Instead, we will concentrate on how to build solutions using the tasks.
In the Control Flow tab of the example package, drag over a single Data Flow Task onto the design pane. This Data Flow Task will perform the transformation of data. Rename the task Create Product File by right-clicking the task and choosing Rename.
Select Variables from the SSIS menu, which opens the Variables window. Create three variables in the window: RowCount as an int32 data type StartDate as a datetime data type EndDate as a datetime data type
Create the default value of the StartDate to some date in 2001, and set the default value of the EndDate variable to some date in 2007.
Working in the Data Flow
When you double-click on the Data Flow Task, you are taken to the Data Flow tab. The Data Flow Task streams data from nearly any structured, semi-structured, or non-structured source to nearly any destination. In this case, you will pull data from a SQL Server source, transform the data by using a number of transformation components, and then write the result to a text file. Figure 1-4 shows the Data Flow tab in a package without the components defined yet. Notice that the toolbox contains adapters and transformations that you will be using throughout this book.
Drag over an OLE DB Source from the toolbox. Name the source Products. Double-click the source to configure it, as shown in Figure 1-5. Ensure that you''re pointing to the AdventureWorks connection manager and then change the Data access mode entry to SQL command. Enter the following command in the SQL command text window:
SELECT * FROM Production.Product WHERE SellStartDate > ? and SellStartDate < ?
The query returns all the products that are within a certain date range. The question marks represent parameter values that will be passed in through a variable. Click the Parameters button to map the question marks to the variables you''ve already created. Each question mark parameter you see in the Set Query Parameters window is ordinal (see Figure 1-6). So, the first question mark is represented by Parameter0, the second by Parameter1, and so on. Map Paramater0 to User::StartDate and Parameter1 to User::EndDate. Click OK to go back to the Data Flow tab.
Next, drag a Lookup transform onto the design pane. Connect the Product source to the Lookup transform by dragging the green arrow from the source to the transform. Name the Lookup transform Find Model Name.
(Continues...)
Excerpted from Expert SQL Server 2005 Integration Servicesby Brian Knight Erik Veerman Copyright © 2007 by Brian Knight . Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.