Saturday, July 26, 2014

Deleting Millions of Records in SQL

How many of you have tasked with purging/archiving records from one of the biggest table of your data warehouse solution. The scenario looks quite common and I am pretty much sure everyone gets a chance to work on tasks like these sometime during their carrier.
There are more than 1 ways of doing this, each one has its own plus and minus, so I will try to cover them with why & why not to use them.

1. DELETE Statement - The delete statement can be as simple as "DELETE * FROM TABLE WHERE [Condition]". But wait a minute, before firing this statement to delete millions of records, out of a billion, bear in mind that deletion is a logged operation, which means all delete operations are logged in transaction log to assist with rollback just in case batch fails. If you try running delete statement without any consideration, the statement will go on and on putting pressure on the transaction log and eventually fail when it runs out of disk space. So running plain DELETE is not a good option. Even if space is not a constraint, trust me, it doesn't look good design to delete millions of rows with a plain where clause.

2. Truncate Statement - Truncate operations are not logged which means they don't exert any pressure on the transaction logs & is the fastest way to flush out records from a table. But - you cannot use WHERE clause in a truncate statement so in theory it is either delete all or none. So for getting rid of part of your biggest table, truncate is not a viable option.

3. DELETE in Batch - Next best option is to use some kind of batching logic to loop through the table & delete records in chunks of batch size of say for eg 0.5 m or 1 m. Please refer to following syntax to delete in batch size of 1 million or whatever suits best in your environment. Advantages of deleting data in batch is that it will only hold data equivalent to batch size in transaction logs and it will flush it out as soon as batch commits there way reusing log space & putting minimal overhead on the storage space.

SET @BatchSize = ?
 WHILE @BatchSize <> 0
   DELETE TOP (@BatchSize) FROM [Condition]
   SET @BatchSize = @@rowcount

Only problem with deleting records in batches is the amount of time it would take to complete the operation. If timings are concern and you only have a fix window to flush out records, consider this option carefully.

4. DELETE in a Batch in a Phased Manner - This is another option that some teams have followed with great success. If the data is being deleted provides you flexibility where you can delete data in a phased manner then the delete operation can be phased out over multiple releases. Delete in batch, coupled with phased release (spanning over multiple days or weekends) can be a good option provided the deleted data doesn't interfere with the system & doesn't introduce orphaned records which are hard to track.
If there is dependency that all data has to be removed in a given release, with minimal impact to resources and within agreed time frame, the next option would be to :

5. Create & Swap Approach - In this approach a table, identical to original table being swapped is created in warehouse without any constraints or indexes on them. Data which is supposed to be retained is pushed in the heap, leaving behind data which is not required. Once the tables are ready, tables are swapped out by dropping indexes, constraints etc from original table and putting them back on the new table followed by renaming the tables.

Whichever way it is, it will take good effort to delete millions of records from a table which is close to a billion. So the strategy has to be carefully planned before you hit the 'DELETE' button.

With this, I will now sign off on this topic. All these options are more from my personal experience while working on the warehouse solutions but please free to share your feedback & comments & I will be more than happy to amend this posting further.

Rishi Saxena
View Rishi Saxena's profile on LinkedIn

Thursday, July 17, 2014

Data Testing & Automated Unit Test Framework - ETL Process

Data Testing is yet another important but mostly forgotten component of ETL development life cycle. A good ETL solution is incomplete without a good unit test framework.

Complexity of a project increases as more & more business rules are added to the package & more packages become part of the solution. This become further difficult to manage if you are working in a multi-team environment and code check-in from one developer starts breaking business logic for other developer. This is where a well-designed test framework can comes to everybody’s rescue.

A sample ETL package would go through a number of test scenarios, some of which would be something like:
  • Mapping Data Sheet Validation
  • Table Structure Validation
  • Data Validation
  • Data Consistency Validation
  • Data Transformation Validation
  • Null Check Validation
  • Record Count Validation
  • Duplicate Record Validation

The underlying logic behind Unit Test Framework is: if test cases are run on a fixed set of data & under fixed environment settings, you can expect consistent results. Having said that, it is essential to design sample data in a way that it covers each and every aspect of Business rules including data movements, transformations & data failures. As part of on-going development cycles, if new change breaks underlying Business Logic - the test cases will flag out failures immediately, prompting corrective action from Developer.

All you need is good understanding of your business flows, some sample files & good understanding of SQL and that’s it. If you have all this information handy, you can be up and running in no time with your own Unit Test framework J.

Steps that generally comprise of setting up a Unit Test Framework Environment are :
  • Fixing Sample data against which you are going to design test cases, sample data big enough to comprise majority of test scenarios.
  • Manipulating some of the sample data so that it channelize through different aspects of the packages being designed.
  • Populate master data/supporting tables with fixed data that is going to be used on an on-going basis. Design a pre-requisite script that verifies these master tables & truncate & re-load them if data in any one of them is not as per plan.
  • Design test cases “SQL Script” against sample data which comprise of different data verifications. The test cases should be designed carefully so that it covers majority of the scenarios for the given entity being tested.
  • Schedule a SQL job that would execute these scripts in a sequential manner. Configure SQL job to output outcome of each step to a log file which can be used for analysis/diagnostic purpose as required...... & That's it.

When new code gets published, if any of the test cases fail it will flag out discrepancy immediately. Some of the failure might be expected failures because of change in base functionality around which cases were originally designed - if so – it is advisable to update test cases to accommodate those scenarios. If those were unexpected failure – perform some diagnostic checks to verify reason for failure & implement a fix as applicable.

This way a lot of Time & Effort can be saved by developers for testing their own stuff including assisting with regression testing as product grows with time.

Sample Master pre-requisite Script: If data in supporting master tables doesn’t matches with the data against which Test Framework is originally designed, truncate those tables & reload with master data.
Similar code block applies to all other supporting tables. 

Sample Test Cases Script – These individual SQL scripts will comprise of collection of test cases (both Record Count Verification & Data Verification) to verify if the test scenarios still hold good of if there are any exception encountered because of code change.

Snapshot from Unit Test Framework – SQL Job:

In the figure above:

Step 1 signifies setting up master data in supporting tables & truncating data from warehouse tables to provide a clean slate before packages loads data.
Step 2 : Execute “Package 1” to perform business processing, this could be downloading files from FTP site or reading sample files from a shared directory & putting them in staging area.
Step 3: Running test cases against data populated by “Package 1”. The result of test cases can be flushed to a log file to signify how many test cases succeeded or failed.

On a similar logic, Step 2 & Step 3 can be repeated for the other packages of the solution for unit testing their business rules. Segregating packages under different steps provides opportunity to perform test cases against snapshot of data that is available after a particular package run.

Configuration Windows to direct output of the Job to a Log file

Once the job gets completed, output of the job is published in a log file for analysis purpose to identify if new functionality broke any old test cases or for general health check-ups as well.

With this, I will now sign off on this topic. Please do share your feedback & comments & I will be more than happy to amend this posting further. 

Rishi Saxena
View Rishi Saxena's profile on LinkedIn