Pages

Saturday, April 18, 2015

Dynamic Data Load - SSIS

I was recently tasked with loading tab-delimited flat files into a staging table on SQL Server database. On the very 1st look, the requirement looks straight forward isn’t it? But there is a catch. The format of the file was not fixed and there can be varying number of columns in the incoming file. Also, the files can have more than 200 columns at a given time. The requirement is such that that package should not fail while processing these files and should load files successfully at the end of process. The solution should be more dynamic-like data load approach rather than handling data mapping at the design time.

I searched for this scenario over the web and came across a number of posting implementing the similar logic differently.

To start with my development I referred to the information from the following link http://www.citagus.com/citagus/blog/importing-from-flat-file-with-dynamic-columns/ but for some reason it didn’t work out in my environment, especially while looping through & loading varying formatted files & given the fact that I was looking at more than 200-column files. So I enhanced & fine tune the logic a little bit and came up with the following solution.Please find below quick summary of the steps that I followed to achieve this and it has worked well for me.



 Fig 1: Snapshot of the SSIS package that I designed to achieve this.

SCT_Generate Staging Table - Customize a Script task (I used VB.NET, but one can use C# as well) to generate dynamic SQL statement to create table based on the header information retrieved from the sample file. Load dynamically generated SQL statement into a SSIS string variable. I used variable User::strSQLQuery in my case. Please refer to Exhibit 1 for the source code and additional comments.

SQL_Drop Staging Table - Execute SQL Task to drop Staging table (dbo.StagingTable_BulkLoad) from the database. We need this step as we drop and re-create dbo.StagingTable_BulkLoad table with each run based on the dynamic header information available in the flat file.

SQL_Generate Staging Table: Execute SQL task to re-create dbo.StagingTable_BulkLoad table based on dynamic column information. Please refer to Fig 2 for more details.

Bulk_Load Staging Table - Bulk-load tab-delimited flat file into SQL table. Importantly, we cannot use DFT task to achieve this as the format of the incoming file is not known at the design time & it can vary from file to file so we need flexibility in loading the file. Bulk Upload feature provides us that flexibility. Please refer to Fig 3 for more details.


          Public Sub Main()

        Try
            'Populate fully qualified File name from the Package Variables
            Dim strFileName As String = Dts.Variables("User::ProcessingDirectory").Value.ToString() _
                                + "\" + Dts.Variables("User::FileName").Value.ToString()
            Dim objStrReader As StreamReader
            objStrReader = New StreamReader(strFileName)

            'Read Header Info
            Dim strHeaderInfo As String
            strHeaderInfo = objStrReader.ReadLine

            'Dispose StreamReader Object to Release Memory as we are intersted only in Header Information
            objStrReader.Dispose()

            ' Split tab-delimited Header Information
            Dim ColumnHeaders As String() = strHeaderInfo.Split(vbTab) ' To Hold All Column Names in an array of String
            Dim strSQL = "", ColumnName As String = "" ' Variables to Hold Generated SQL & Column Name information
            Dim ColumnHeaderList As New ArrayList

            For Each ColumnName In ColumnHeaders
                Dim Count As Integer = 0

   ' To Handle Duplicate Columns, if Column Name already exists then use _1, _2 suffix to differentiate them
                If ColumnHeaderList.Contains(ColumnName.ToUpper()) Then
                    ColumnName = ColumnName + "_" + Count.ToString()
                    Count = Count + 1
                End If

                ColumnHeaderList.Add(ColumnName.ToUpper)

                ' Generate Table with Column lengh varchar(max) to cover any data overflow errors
                If strSQL <> "" Then
                    strSQL = strSQL & ", [" & ColumnName.ToString() & "] VARCHAR(max) "
                Else
                    strSQL = "[" & ColumnName.ToString() & "] VARCHAR(max) "
                End If

            Next

            Dim strSQLQuery As String

            strSQLQuery = "CREATE TABLE dbo.StagingTable_BulkLoad (" + strSQL.ToString() + ")"
            Dts.Variables("User::strSQLQuery").Value = strSQLQuery
            Dts.TaskResult = ScriptResults.Success

        Catch ex As Exception

            Dts.TaskResult = ScriptResults.Failure
        End Try

    End Sub

Exhibit 1 – VB.NET code block for “SCT_Generate Staging Table” task to generate dynamic SQL statement based on header information retrieved from the sample file.



Fig – 2: Execute SQL task (SQL_Generate Staging Table) to generate dbo.StagingTable_BulkLoad table based on dynamic header information



Fig 3: Bulk Insert Task (Bulk_Load Staging Data) to bulk load data from Flat File to SQL table.

Apart from these settings you will also require a Flat File connection manager & OLEDB Data Source to configure & to connect to the flat file source and the back-end SQL database at the run-time.

Please give this a try and let me know how it goes. I have been using this approach in my environment for a while now and it has worked well for me for now.

Thanks,
Rishi Saxena
View Rishi Saxena's profile on LinkedIn

Saturday, July 26, 2014

Deleting Millions of Records in SQL

How many of you have tasked with purging/archiving records from one of the biggest table of your data warehouse solution. The scenario looks quite common and I am pretty much sure everyone gets a chance to work on tasks like these sometime during their carrier.
There are more than 1 ways of doing this, each one has its own plus and minus, so I will try to cover them with why & why not to use them.

1. DELETE Statement - The delete statement can be as simple as "DELETE * FROM TABLE WHERE [Condition]". But wait a minute, before firing this statement to delete millions of records, out of a billion, bear in mind that deletion is a logged operation, which means all delete operations are logged in transaction log to assist with rollback just in case batch fails. If you try running delete statement without any consideration, the statement will go on and on putting pressure on the transaction log and eventually fail when it runs out of disk space. So running plain DELETE is not a good option. Even if space is not a constraint, trust me, it doesn't look good design to delete millions of rows with a plain where clause.

2. Truncate Statement - Truncate operations are not logged which means they don't exert any pressure on the transaction logs & is the fastest way to flush out records from a table. But - you cannot use WHERE clause in a truncate statement so in theory it is either delete all or none. So for getting rid of part of your biggest table, truncate is not a viable option.

3. DELETE in Batch - Next best option is to use some kind of batching logic to loop through the table & delete records in chunks of batch size of say for eg 0.5 m or 1 m. Please refer to following syntax to delete in batch size of 1 million or whatever suits best in your environment. Advantages of deleting data in batch is that it will only hold data equivalent to batch size in transaction logs and it will flush it out as soon as batch commits there way reusing log space & putting minimal overhead on the storage space.

DECLARE @BatchSize BIGINT
SET @BatchSize = ?
 WHILE @BatchSize <> 0
 BEGIN
   DELETE TOP (@BatchSize) FROM [Condition]
   SET @BatchSize = @@rowcount
 END

Only problem with deleting records in batches is the amount of time it would take to complete the operation. If timings are concern and you only have a fix window to flush out records, consider this option carefully.

4. DELETE in a Batch in a Phased Manner - This is another option that some teams have followed with great success. If the data is being deleted provides you flexibility where you can delete data in a phased manner then the delete operation can be phased out over multiple releases. Delete in batch, coupled with phased release (spanning over multiple days or weekends) can be a good option provided the deleted data doesn't interfere with the system & doesn't introduce orphaned records which are hard to track.
If there is dependency that all data has to be removed in a given release, with minimal impact to resources and within agreed time frame, the next option would be to :

5. Create & Swap Approach - In this approach a table, identical to original table being swapped is created in warehouse without any constraints or indexes on them. Data which is supposed to be retained is pushed in the heap, leaving behind data which is not required. Once the tables are ready, tables are swapped out by dropping indexes, constraints etc from original table and putting them back on the new table followed by renaming the tables.

Whichever way it is, it will take good effort to delete millions of records from a table which is close to a billion. So the strategy has to be carefully planned before you hit the 'DELETE' button.

With this, I will now sign off on this topic. All these options are more from my personal experience while working on the warehouse solutions but please free to share your feedback & comments & I will be more than happy to amend this posting further.

Thanks,
Rishi Saxena
View Rishi Saxena's profile on LinkedIn

Thursday, July 17, 2014

Data Testing & Automated Unit Test Framework - ETL Process

Data Testing is yet another important but mostly forgotten component of ETL development life cycle. A good ETL solution is incomplete without a good unit test framework.

Complexity of a project increases as more & more business rules are added to the package & more packages become part of the solution. This become further difficult to manage if you are working in a multi-team environment and code check-in from one developer starts breaking business logic for other developer. This is where a well-designed test framework can comes to everybody’s rescue.

A sample ETL package would go through a number of test scenarios, some of which would be something like:
  • Mapping Data Sheet Validation
  • Table Structure Validation
  • Data Validation
  • Data Consistency Validation
  • Data Transformation Validation
  • Null Check Validation
  • Record Count Validation
  • Duplicate Record Validation

The underlying logic behind Unit Test Framework is: if test cases are run on a fixed set of data & under fixed environment settings, you can expect consistent results. Having said that, it is essential to design sample data in a way that it covers each and every aspect of Business rules including data movements, transformations & data failures. As part of on-going development cycles, if new change breaks underlying Business Logic - the test cases will flag out failures immediately, prompting corrective action from Developer.

All you need is good understanding of your business flows, some sample files & good understanding of SQL and that’s it. If you have all this information handy, you can be up and running in no time with your own Unit Test framework J.

Steps that generally comprise of setting up a Unit Test Framework Environment are :
  • Fixing Sample data against which you are going to design test cases, sample data big enough to comprise majority of test scenarios.
  • Manipulating some of the sample data so that it channelize through different aspects of the packages being designed.
  • Populate master data/supporting tables with fixed data that is going to be used on an on-going basis. Design a pre-requisite script that verifies these master tables & truncate & re-load them if data in any one of them is not as per plan.
  • Design test cases “SQL Script” against sample data which comprise of different data verifications. The test cases should be designed carefully so that it covers majority of the scenarios for the given entity being tested.
  • Schedule a SQL job that would execute these scripts in a sequential manner. Configure SQL job to output outcome of each step to a log file which can be used for analysis/diagnostic purpose as required...... & That's it.

When new code gets published, if any of the test cases fail it will flag out discrepancy immediately. Some of the failure might be expected failures because of change in base functionality around which cases were originally designed - if so – it is advisable to update test cases to accommodate those scenarios. If those were unexpected failure – perform some diagnostic checks to verify reason for failure & implement a fix as applicable.

This way a lot of Time & Effort can be saved by developers for testing their own stuff including assisting with regression testing as product grows with time.

Sample Master pre-requisite Script: If data in supporting master tables doesn’t matches with the data against which Test Framework is originally designed, truncate those tables & reload with master data.
Similar code block applies to all other supporting tables. 


Sample Test Cases Script – These individual SQL scripts will comprise of collection of test cases (both Record Count Verification & Data Verification) to verify if the test scenarios still hold good of if there are any exception encountered because of code change.


Snapshot from Unit Test Framework – SQL Job:


In the figure above:

Step 1 signifies setting up master data in supporting tables & truncating data from warehouse tables to provide a clean slate before packages loads data.
Step 2 : Execute “Package 1” to perform business processing, this could be downloading files from FTP site or reading sample files from a shared directory & putting them in staging area.
Step 3: Running test cases against data populated by “Package 1”. The result of test cases can be flushed to a log file to signify how many test cases succeeded or failed.

On a similar logic, Step 2 & Step 3 can be repeated for the other packages of the solution for unit testing their business rules. Segregating packages under different steps provides opportunity to perform test cases against snapshot of data that is available after a particular package run.

Configuration Windows to direct output of the Job to a Log file


Once the job gets completed, output of the job is published in a log file for analysis purpose to identify if new functionality broke any old test cases or for general health check-ups as well.

With this, I will now sign off on this topic. Please do share your feedback & comments & I will be more than happy to amend this posting further. 

Thanks,
Rishi Saxena
View Rishi Saxena's profile on LinkedIn

Thursday, June 26, 2014

Decoding Data Mapping Sheets - ETL Process

An ETL (Extract Transform Load) process is all about moving a variety of data from Source System to the destination Data Warehouse System by taking data through a number of extraction, transformation, data cleansing & data validation processes.

Just imagine how easy it will get for someone as an ETL developer if he gets a chance to visualize all the transformations & business rules up front in an easy to interpret format. This is where mapping sheets come into picture.  

A carefully designed mapping sheet up-front can save a lot of pain as handling mapping information increasingly gets difficult as application grows with time. Only downside of using them is it takes good effort to create them & then keeping them up-to-date. But, trust me, rewards of using them easily outnumber the pain of not maintaining one as the system grows overtime
Each data migration project uses mapping sheets in one form or the other but the one which I have used too often & has worked exceptionally well for me is what I am detailing in this here.

To start with a sample Mapping Sheet would look something like this for any given entity as represented in Figure 1. The sheet can be extended further by using multiple sheets in the Excel workbook to represent other entities of the system.













Fig 1 – Sample Mapping Sheet structure

Generally, the extreme left of the sheet represents data from the Source System or the staging area. The middle layer represents data validation, data verification, data cleansing & business rule validation with a number of data-centric transformation rules in place. There can be more than 1 transformation layer in the centre depending upon how complex the ETL process is. The extreme right mostly represent Data Warehouse system.

In the Figure 2 below, I have taken example of Employee table in staging area to represent how actual table structure from database gets represented in a mapping sheet.

















Fig 2 – How a database table structure is represented in a Mapping Sheet Document

Carefully looking at the snapshot in figure 2, it provides almost all the information related to a table in a database. The information includes everything like - Schema name (stg), Name of the Table [Employee] along with name of the columns, data types & whether they allows NULL or NOT NULLS values. For eg - EmployeeID is INT & NOT NULL whereas EmployeeName is going to be varchar(255) NULL column.

On a similar line, mapping sheets can be extended to further represent structure of the given tables across different layers. Figure 3 below further helps you visualize how a field gets mapped, validated & transformed through different layers of the ETL process while being migrated from Source System to ODS to DW.






Fig 3 – Transformation rules at the table/field level

Another efficient use of mapping sheet would be to document Business rules in layman’s term against respective field right next to them. Please refer to Figure 4 below. The figure represents only a selective few & a very basic level transformation rules. Please note, the transformation rules represented here are only from informative purpose & real time transformation could vary considerably depending upon project requirements. The mapping sheets only works as placeholders to store transformation information.

Some of the sample business rules could be like
1.  Use database defaults to set to Current Date & Time for CreatedOn, ModifiedOn fields.
2.  Use SQL Case statement to set a field to an integer value in warehouse depending upon text information coming in from source feed.
3.  Placeholder to store I, U, D flags in ODS layer to perform respective actions in warehouse for a given entity – just to name a few
4.  Other transformations would be something like these -






















Figure 4 – Sample Transformation Rules :

Some of the key points that must be considered while designing a mapping information document:    
1        Design staging area with minimal constraints/indexes and with no or minimum data integrity checks in place. Consideration for Extraction process must be to load data from Source System as quickly as possible with minimal data leakage. Ideally, data type in this layer should be minimally restrictive to allow full data to pass through to staging area without any data loss. Like allowing NULLs, data size sufficiently big enough to hold data from feeding source system. 

2        Operational Data Store Layer - In this layer most of the business rules are defined and data types are generally tightly coupled with data types in warehouse layer. Most of the data transformations, error handling & data filtering are done in this layer. An ideal ODS layer should be able to maintain audit trail information to keep track of I, U, D operations to Data Warehouse. In this example, “Action" field in ods.Employee table can be used to maintain current state of a record throughout the life cycle of ETL process.                         

3        Data warehouse Layer - This layer contains only current version of data. The records normally gets Inserted, Updated & Deleted in Data Warehouse depending upon incoming “Action” field from ODS source. Normally, only selective few transformation are performed in this layer, the ones which are specific to data available in data warehouse s.                          

With this, I will now sign off on this topic. I have tried keeping this information in its basic & simplest form so I hope this information will come handy in your respective projects. Please do share your inputs, feedback & comments & I will be more than happy to amend/improve this posting further.

Thanks,

Rishi Saxena
View Rishi Saxena's profile on LinkedIn

Thursday, June 12, 2014

Database Design Consideration

So, this is my 2nd blog which details various database design consideration that one should follow while developing database design. This is just a high-level overview of guidelines that I normally follow. The list is not exhaustive, so there will be a few more considerations here and there depending upon different personnel and environment under which databases are designed. What this blog doesn’t include is any in-depth SQL syntaxes or code walk-through as the sole intention behind blogging this topic is to share information around how different bits and pieces fits together from database development perspective. 

Database design is one of the most important aspects of any application development life cycle. Performance issues because of poor design & technical debts are not always visible when the system is small & early production days but as the system grows with time they become more apparent and increasingly difficult to manage. It takes far more effort to improve performance for a live-application when measured against identifying & fixing issues during early phase of development cycle. It is always recommended to spend quality time (at least a couple of iterative cycles) in thinking, discussing & designing database architecture up front rather than leaving it with “Let's see how it will perform attitude”.  

OLTP/OLAP Systems & Normalization : 

OLTP System, also known as transactional systems contains day to day data processing information. These systems are built on top of Relational Database modelling techniques with optimal use of normalization for faster I/U/D operations. In these systems, maintaining data integrity & consistency with minimum redundancy takes precedence so data is carefully normalized to make this task easier. Having said that, the tables still has to be optimally normalized because an overly normalized database does have its own performance bottlenecks when performing I/U/D operation.

OLTP System, also known as transactional systems contains day to day data processing information. These systems are built on top of Relational Database modelling techniques with optimal use of normalization for faster I/U/D operations. In these systems, maintaining data integrity & consistency with minimum redundancy takes precedence so data is carefully normalized to make this task easier. Having said that, the tables still has to be optimally normalized because an overly normalized database does have its own performance bottlenecks when performing I/U/D operation.

Online Analytical Processing (OLAP) applications are designed for reporting & analysis purpose & it is built using dimensional modelling techniques. Here preference is to get data as quickly as possible in a required format, even at the cost of some redundant data & extra storage space. So some degree of de-normalization is commonly observed in OLAP system which means lesser joins among tables & faster retrievals.

The data between OLAP & OLTP systems are not normally in sync and usually lags behind by a few hours or a day even in some cases. However, there are automated jobs that run in the backend to sync up OLTP/OLAP systems over nightly as these systems are designed to work that way anyways.

Based on potential usage of the source system, careful consideration is required to strike a right balance with level of normalization.


Maintaining Data Integrity: 

There are certain business rules that can be designed & controlled at the database level there by providing some flexibility to not to replicate them on the interfacing application or the feeder system. For eg – log current Date & Time as LastModifiedDate, log current logged-in user’s identity as LastModifiedBy, Age of an employee should be between 21 & 60 just to name a few. All these rules at the database level help in maintaining data integrity. Primary keys, foreign keys, & default constraints are some of the out of the box features that can be used to maintain data integrity.

Primary Keys:
 Primary key is something which uniquely identifies a record in a table. Spend good time on choosing a primary key for a table, giving future expansion plan a consideration. Once the primary key is decided and database architecture is designed around that, it will take great effort to change it to something else in future. Apart from this, primary keys also holds an index that speeds up searches requests, choosing one which doesn’t change too often is another criteria behind selecting the right key.

Foreign Keys
: Another feature of Data Integrity is to handle Referential Integrity. Foreign key constraints come handy to control these validations. Putting foreign key constraints brings quality control as well as restricts data entry or deletions at the database level if the operation could result in orphaned and/or unaccounted records.
Consider a Billing operation system where nobody would like to look at a Bill or an Order which doesn’t have a customer associated with it. Proper use of foreign key constraint as per business requirement will always take care of scenarios like this without much add-on validations.

Default Constraints: 
It is always advisable to put a default constraint on a database field, if it can derive some business value out of default value, instead of making field nullable & storing NULL values. This also takes burden off the feeding system to not to worry about keying in information in a NOT NULL fields for which information is not available at that very moment.

Optimal Use 0f Data Types : 

Most important of them all is choosing the right data type. Remember, every data type consumes some storage space as well as index takes time to re-arrange data with every I/U/D operation, so choosing data type with right size does make its own difference. Imagine how much difference it can make for every byte saved on choosing the right data type if we are dealing with millions & millions of records.

Considering future expansion plan is another criterion that should be taken care off while deciding data types. Choosing a data type which works at present but potentially could fail in future is not a good data type decision making. This is where data profiling comes into picture and has to be done carefully to analyse how the data is being used currently and how it may be used in future.
For eg - choosing int datatype to store 0s & 1s & choosing datetime data type for storing Date of Birth/Joining date is definitely not a good design approach. It would have been a wiser decision to use a bit and date respectively.


Indexing Decisions :  

Indexes does speed up data access as it cuts down the number of records that needs to be searched through to return searched data but it does have negative impact on Inserts, Updates & delete operations.

Indexes does speed up data access as it cuts down the number of records that needs to be searched through to return searched data but it does have negative impact on Inserts, Updates & delete operations.

Creating an index is more like choosing a trade-off between whether the performance gain that is observed by creating a few indexes outweighs the potential issues of extra storage requirement as well as slowed down I/U/D/ operations.

There is no golden rule to the number of indexes a table should have and it also varies with whether you are designing an OLAP or an OLTP system but trying out different combination is the best way to figure out what is the optimum level of indexing your database tables should have.
For creating primary keys, Microsoft recommends to use integer as Primary keys because SQL Server is optimized to index integer data types. Similarly, when joining multiple tables, it is advisable to create non-clustered indexes on the foreign key columns to enhance performance.

Utilizing Stored Procedures / Views/ Triggers : 
An enterprise database design is incomplete without use of any of these. There are always workarounds available to not to use any or all-of-these but these are integral part of any database project and must be used for respective cases because advantages of using them mostly outweighs the disadvantages of not using them.

Stored Procedures: Stored procedures are compiled only once & then they are stored in an executable form, so stored procedure calls are both efficient as well as quick. Apart from this, it also provides flexibility, scalability as well as easy maintainability. Remember if you don't use stored procedure and embed all database related business processing in the application, how difficult it will become to make even slightest of changes to the code block. This will call for code re-compilation & deployment of entire code every time. Encapsulating business rules in stored proc makes code easier to maintain as well as it provides unit testability.

Views: For an enterprise level application, no one wants to expose each and every table & database object to everyone. This is where views come handy. They provide flexibility to hide business complexity as well as marking out sensitive data of your database and expose only what you want others to see. It also provides code reusability. What if you pull up a long query which contains joins after joins to get some data. You don't have to right big SQL statement every time, just put it in one of the view & extend views wherever it is required.

Triggers: These can be used to maintain audit trail information. These are fully customizable & are executed automatically when the data is inserted/updated/deleted so in theory these are designed once and are done for all. Apart from this, these can be used to encapsulate business rules at the database level thereby preventing errors to bubble through to the presentation layer or the feeder system.
Triggers do have their own disadvantages as well like - putting some extra overhead on database server & being invisible to the outside world which means at times it becomes difficult to track what went wrong with the data in the database. So the trade-offs have to be carefully analysed to make the decision to use them or not to use them.

Naming Conventions & Standards: 
Working on somebody else’s code is almost always a nightmare if proper naming conventions & standards are not followed. Following proper naming conventions bring both consistency as well as it reflects quality of work being developed. In a multi-team environment where more often than not the team that support an application is different from the team that actually developed the application, just imagine how difficult it will become for the support personnel to dig deep into each table to figure out what each bits and pieces are doing. A proper documentation & comments do play their own part in successful implementation of any data warehouse/front-end application.

In general, the bare minimum requirement for any database system is to :
1. Provide access to data as per the agreed format as quickly as possible.
2. Maintain data in a consistent, accurate & reliable state.
3. Minimum data redundancy which results in easier maintainability as well as it requires lower storage space.
4. Scope for scalability to meet future business requirements with minimum impact to existing system.

There are a number of factors on which database design consideration depends, some of which are as follows.

Always remember to give unique & easily identifiable names to the constraints, it makes it easier to diagnose & understand where they are used when someone generates consolidated database scripts for any reason.

Considering future expansion plan is another criterion that should be taken care off while deciding data types. Choosing a data type which works at present but potentially could fail in future is not a good data type decision making. This is where data profiling comes into picture and has to be done carefully to analyse how the data is being used currently and how it may be used in future.
For eg - choosing int datatype to store 0s & 1s & choosing datetime data type for storing Date of Birth/Joining date is definitely not a good design approach. It would have been a wiser decision to use a bit and date respectively.

Like I mentioned above, this is not an exhaustive list of guidelines but something which I refers to too often to design my applications so please do share your feedback & comments. I will be more than happy to amend/improve this posting further.


Thanks,
Rishi Saxena
View Rishi Saxena's profile on LinkedIn



References:
  • Microsoft 
  • Wikipedia
  • Other Blogging sites