The following is a brief generalized overview of a framework I developed on one of my projects for ETL processing using SQL Server Integration Services (SSIS). It is meant to provide an organized, consistent, centrally configured and managed, and disaster-recovery and audit friendly environment in which SSIS ETL processes can be developed and executed. This SSIS framework utilizes a master package that all ETL processes are initiated from, a set of auxiliary packages that support the master package, a small set of tables that store ETL configuration, status, and execution log information, and a collection of data load packages that do the actual ETL processing.
The Master Package
Almost all SSIS data load ETL packages you create using this method will be initiated through a master package (pictured above). The master package provides a centralized control and management process that your ETL data loads must follow. The master package is run with the name of a specific load passed in as a parameter. Each object in the master package passes variables into and out of and executes an auxiliary package that performs a specific operation.
The Auxiliary packages
START creates an entry in a log table and pulls load-specific configuration data from a configuration database. I reused the SSIS package configurations table to store this configuration data, but at this point I pull the data using scripts instead of package configurations. This configuration data is then passed into many of the other child packages as needed.
FileCheck makes sure a file exists and is readable if the load is file-based. Then, various operations are performed on the file depending on the load. Typically, the file is renamed so an attempt will not be automatically made to run a load again against the file without user intervention. If the file is a zip, the files within the zip are extracted.
LoadCheck verifies if this ETL load should be run. It checks the status of the previous load attempt in a high-level ETLStatus table, determines the time period the load is meant to be run for, checks if data already exists for the period, etc. Using this information, LoadCheck makes the decision of whether the load should continue. The ETLStatus table is then updated with info based on the decision made.
LoadDatabase performs the actual SSIS ETL data load processing. More details of how a data load package can be set up can be found below.
FileArchive will perform end of load file operations. Similar in function to the FileCheck step, this step often includes the cleanup of temporary or unzipped files. FileArchive also usually includes the zipping, timestamping, and moving of the source files to an archive location.
END updates the ETLStatus table to indicate the outcome of the ETL load, completes the log entry for the load in the database, and emails a log journal to interested parties.
A Data load package
The following is an example of how a data load package can be structured. Different types of data loads have different requirements of course, and so can require different control and data manipulation logic. A fact table, for example, will likely involve all or mostly inserts, where a dimension table may need lots of updating. A type 2 slowly changing dimension would require more work to load than a simple dimensional table. But even though there are variations in what you will need to include to manipulate your data and update your tables correctly, you can design your SSIS data load packages to follow a similar operational pattern.
The control flow is usually kept simple. First, a script component adds a note to the log that the data is about to be loaded. You may also want to add an entry to an audit table at this point with the load step and the start time. If you have a package that calls a bunch of child packages to do that actual work, it’s nice to be able to look up individual stats for the child packages in an audit table for debugging or performance monitoring.
There may be logic at this point that decides if existing data in the table should be deleted before the load is performed, if the load should be done using one data flow method or another, etc. Then the data flow is entered. If an error occurs in the data flow it is logged. The results captured within the data flow (record counts, error conditions, etc) are then checked to verify if everything ran as expected and information about the results is logged.
A debug component is added as the last component of the control flow. I set this debug component to do nothing when running in a scheduled production environment, but when running interactively in visual studio to pop up a notification window at the end of the load displaying all of the information that the package logged.
The fist object of the data flow is the data source. This gets its data from the connection manager that has been set up for the data source. Usually in the connection manager collection there is one connection to the database where we are saving the data to and one or more connections for the source data. In almost all cases when I set up a connection manager based on a file source I read in all of the fields as unicode characters. This prevents the data source from generating errors due to field type incompatibilities, etc. Reading from other types of databases, such as Oracle, can also introduce some data type differences that can often be handled best by first pulling them in as unicode.
Then we continue along the data flow.
A row number is added for each incoming record. This aids in the analysis and correction of rows that error. Some ETL tools automatically keep track of row numbers but SSIS does not, and especially for file-based data sources (or for ordered database sources) it can be helpful to have this information. Row count objects are spread throughout the data flow, keeping track of how many records pass through different stages of the load. This is used later for load statistics and to verify all incoming rows are properly accounted for.
The derived column object is used to convert the incoming columns (which I mentioned are all read in as unicode text) into the correct data type that is required for loading into the database. Simple data transformations are also done at this step. If data for a column cannot be properly typed or transformed the data row will be directed through the error leg of the data flow.
The advanced derived columns object is a place for column creation and transformation operations that cannot easily be accomplished in the simple derived column object. It is often necessary or much easier to perform transformation logic using C# than with the simple set of operations that SSIS natively provides. It is usually also easier to read and understand code here than it is trying to decode a long derived column one-liner full of logic, type casts, and transformation functions.
Next some additional control logic or data transformation work is done. As shown here, you might have a conditional split that filters out rows that you do not care about that match or do not match a certain condition. You might have a lookup that filters out rows that already exist in the destination, or that pulls in additional columns to the data flow from other tables. You might have a second advanced derived columns object that performs additional logic and transformation work on the data after new columns have been pulled in from a lookup.
After this, the SSIS insert and/or update operations are performed. A special waterfall insert model is illustrated here that is good for loads where mostly inserts are expected, like for fact tables. Attempts to fast (batch) load data are made in large (50,000, etc.) record collections. If there are no insert failures all records will get loaded in fast batches. If insert failures occur, fast inserts will be attempted in a smaller sized record collection, and then finally at a single record level. This method increases the speed at which records are inserted while still allowing records that can’t be inserted to move to the next stage of the data flow.
Finally, all rows that produced errors are merged together and capture error rows enumerates the details of those rows. SSIS generally only provides integer codes that identify errors, so this component uses methods to translate those codes into human readable messages – finding the name of the column that caused an error to be generated, a description of the error, etc.
ETL packages can use a variety of SSIS variabes. Some provide the package with configuration information, like database connection strings, file locations, and the period the data should be loaded into. With others you may record results of the package execution, like row counts and status and error messages. These variables can be manually entered, populated from package configurations, or passed between parent and child packages.
SSIS Package Configurations
Most packages can benefit from the use of package configurations. SSIS package configurations are one way of utilizing dynamic configuration data at package runtime.
This example uses two package configurations. Configuration 1 gets the connection string for your main database from an environment variable on your system and saves it into a SSIS variable for future use by your connection managers. This is a good way to start out with your configurations because this single variable is all you will need to change between your various development, staging, and/or production environments to point your ETL packages in the right direction. Configuration 2 (using the connection string from Configuration 1) populates several other SSIS variables that you commonly use in most of your loads (UNC path of SSIS packages, database designation “Prod, Dev…”, etc.) with data from a package configuration table. After that, you might use some additional configurations that pull data specific to the load being run.
There you have it, a brief overview of a framework for SSIS ETL development. This was build from the ground up using SSIS and SQL Server to replace a legacy ETL system, and was a solid performer used in a production environment. Like what you see? Have any questions or comments on the general methods outlined here? Feel free to ask.