Data Testing is yet another important but mostly
forgotten component of ETL development life cycle. A good ETL solution is
incomplete without a good unit test framework.
Complexity of a project increases as more &
more business rules are added to the package & more packages become part of
the solution. This become further difficult to manage if you are working in a
multi-team environment and code check-in from one developer starts breaking
business logic for other developer. This is where a well-designed test
framework can comes to everybody’s rescue.
A sample ETL package would go through a number of
test scenarios, some of which would be something like:
- Mapping Data Sheet Validation
- Table Structure Validation
- Data Validation
- Data Consistency Validation
- Data Transformation Validation
- Null Check Validation
- Record Count Validation
- Duplicate Record Validation
The underlying logic behind Unit Test Framework is:
if test cases are run on a fixed set of data & under fixed environment
settings, you can expect consistent results. Having said that, it is essential
to design sample data in a way that it covers each and every aspect of Business
rules including data movements, transformations & data failures. As part of
on-going development cycles, if new change breaks underlying Business Logic -
the test cases will flag out failures immediately, prompting corrective action
from Developer.
All you need is good understanding of your business
flows, some sample files & good understanding of SQL and that’s it. If you
have all this information handy, you can be up and running in no time with your
own Unit Test framework J.
Steps that generally comprise of setting up a Unit
Test Framework Environment are :
- Fixing Sample data against which you are going to design test cases, sample data big enough to comprise majority of test scenarios.
- Manipulating some of the sample data so that it channelize through different aspects of the packages being designed.
- Populate master data/supporting tables with fixed data that is going to be used on an on-going basis. Design a pre-requisite script that verifies these master tables & truncate & re-load them if data in any one of them is not as per plan.
- Design test cases “SQL Script” against sample data which comprise of different data verifications. The test cases should be designed carefully so that it covers majority of the scenarios for the given entity being tested.
- Schedule a SQL job that would execute these scripts in a sequential manner. Configure SQL job to output outcome of each step to a log file which can be used for analysis/diagnostic purpose as required...... & That's it.
When new code gets published, if any of the test
cases fail it will flag out discrepancy immediately. Some of the failure might
be expected failures because of change in base functionality around which cases
were originally designed - if so – it is advisable to update test cases to
accommodate those scenarios. If those were unexpected failure – perform some diagnostic
checks to verify reason for failure & implement a fix as applicable.
This way a lot of Time & Effort can be saved by
developers for testing their own stuff including assisting with regression
testing as product grows with time.
Sample
Master pre-requisite Script: If data in supporting
master tables doesn’t matches with the data against which Test Framework is
originally designed, truncate those tables & reload with master data.
Similar
code block applies to all other supporting tables.
Sample
Test Cases Script – These individual SQL scripts will comprise
of collection of test cases (both Record Count Verification & Data
Verification) to verify if the test scenarios still hold good of if there are
any exception encountered because of code change.
Snapshot
from Unit Test Framework – SQL Job:
In the figure above:
Step 1 signifies setting up master data in supporting tables & truncating data from warehouse tables to provide a clean slate before packages loads data.
Step 2 : Execute “Package 1” to perform business processing,
this could be downloading files from FTP site or reading sample files from a
shared directory & putting them in staging area.
Step 3: Running test cases against data populated by “Package
1”. The result of test cases can be flushed to a log file to signify how many test
cases succeeded or failed.
On a similar logic, Step 2 & Step 3 can be repeated for the other packages of the solution for unit testing their business rules. Segregating packages under different steps provides opportunity to perform test cases against snapshot of data that is available after a particular package run.
Configuration
Windows to direct output of the Job to a Log file
Once the job gets completed, output of the job is
published in a log file for analysis purpose to identify if new functionality
broke any old test cases or for general health check-ups as well.
With this, I will now sign off on this topic. Please do share your feedback & comments & I will be more than happy to amend this posting further.
Thank you Raju, glad you like the post :)
ReplyDelete