Thursday, July 17, 2014

Data Testing & Automated Unit Test Framework - ETL Process

Data Testing is yet another important but mostly forgotten component of ETL development life cycle. A good ETL solution is incomplete without a good unit test framework.

Complexity of a project increases as more & more business rules are added to the package & more packages become part of the solution. This become further difficult to manage if you are working in a multi-team environment and code check-in from one developer starts breaking business logic for other developer. This is where a well-designed test framework can comes to everybody’s rescue.

A sample ETL package would go through a number of test scenarios, some of which would be something like:
  • Mapping Data Sheet Validation
  • Table Structure Validation
  • Data Validation
  • Data Consistency Validation
  • Data Transformation Validation
  • Null Check Validation
  • Record Count Validation
  • Duplicate Record Validation

The underlying logic behind Unit Test Framework is: if test cases are run on a fixed set of data & under fixed environment settings, you can expect consistent results. Having said that, it is essential to design sample data in a way that it covers each and every aspect of Business rules including data movements, transformations & data failures. As part of on-going development cycles, if new change breaks underlying Business Logic - the test cases will flag out failures immediately, prompting corrective action from Developer.

All you need is good understanding of your business flows, some sample files & good understanding of SQL and that’s it. If you have all this information handy, you can be up and running in no time with your own Unit Test framework J.

Steps that generally comprise of setting up a Unit Test Framework Environment are :
  • Fixing Sample data against which you are going to design test cases, sample data big enough to comprise majority of test scenarios.
  • Manipulating some of the sample data so that it channelize through different aspects of the packages being designed.
  • Populate master data/supporting tables with fixed data that is going to be used on an on-going basis. Design a pre-requisite script that verifies these master tables & truncate & re-load them if data in any one of them is not as per plan.
  • Design test cases “SQL Script” against sample data which comprise of different data verifications. The test cases should be designed carefully so that it covers majority of the scenarios for the given entity being tested.
  • Schedule a SQL job that would execute these scripts in a sequential manner. Configure SQL job to output outcome of each step to a log file which can be used for analysis/diagnostic purpose as required...... & That's it.

When new code gets published, if any of the test cases fail it will flag out discrepancy immediately. Some of the failure might be expected failures because of change in base functionality around which cases were originally designed - if so – it is advisable to update test cases to accommodate those scenarios. If those were unexpected failure – perform some diagnostic checks to verify reason for failure & implement a fix as applicable.

This way a lot of Time & Effort can be saved by developers for testing their own stuff including assisting with regression testing as product grows with time.

Sample Master pre-requisite Script: If data in supporting master tables doesn’t matches with the data against which Test Framework is originally designed, truncate those tables & reload with master data.
Similar code block applies to all other supporting tables. 

Sample Test Cases Script – These individual SQL scripts will comprise of collection of test cases (both Record Count Verification & Data Verification) to verify if the test scenarios still hold good of if there are any exception encountered because of code change.

Snapshot from Unit Test Framework – SQL Job:

In the figure above:

Step 1 signifies setting up master data in supporting tables & truncating data from warehouse tables to provide a clean slate before packages loads data.
Step 2 : Execute “Package 1” to perform business processing, this could be downloading files from FTP site or reading sample files from a shared directory & putting them in staging area.
Step 3: Running test cases against data populated by “Package 1”. The result of test cases can be flushed to a log file to signify how many test cases succeeded or failed.

On a similar logic, Step 2 & Step 3 can be repeated for the other packages of the solution for unit testing their business rules. Segregating packages under different steps provides opportunity to perform test cases against snapshot of data that is available after a particular package run.

Configuration Windows to direct output of the Job to a Log file

Once the job gets completed, output of the job is published in a log file for analysis purpose to identify if new functionality broke any old test cases or for general health check-ups as well.

With this, I will now sign off on this topic. Please do share your feedback & comments & I will be more than happy to amend this posting further. 

Rishi Saxena
View Rishi Saxena's profile on LinkedIn


  1. Thank you for sharing this knowledge in a blogpost.Really simple and even more effective and this worked great, very useful tips
    MSBI Training In Hyderabad

  2. Thank you Raju, glad you like the post :)