How many of you have tasked with purging/archiving records from one of the biggest table of your data warehouse solution. The scenario looks quite common and I am pretty much sure everyone gets a chance to work on tasks like these sometime during their carrier.
There are more than 1 ways of doing this, each one has its own plus and minus, so I will try to cover them with why & why not to use them.
1. DELETE Statement - The delete statement can be as simple as "DELETE * FROM TABLE WHERE [Condition]". But wait a minute, before firing this statement to delete millions of records, out of a billion, bear in mind that deletion is a logged operation, which means all delete operations are logged in transaction log to assist with rollback just in case batch fails. If you try running delete statement without any consideration, the statement will go on and on putting pressure on the transaction log and eventually fail when it runs out of disk space. So running plain DELETE is not a good option. Even if space is not a constraint, trust me, it doesn't look good design to delete millions of rows with a plain where clause.
2. Truncate Statement - Truncate operations are not logged which means they don't exert any pressure on the transaction logs & is the fastest way to flush out records from a table. But - you cannot use WHERE clause in a truncate statement so in theory it is either delete all or none. So for getting rid of part of your biggest table, truncate is not a viable option.
3. DELETE in Batch - Next best option is to use some kind of batching logic to loop through the table & delete records in chunks of batch size of say for eg 0.5 m or 1 m. Please refer to following syntax to delete in batch size of 1 million or whatever suits best in your environment. Advantages of deleting data in batch is that it will only hold data equivalent to batch size in transaction logs and it will flush it out as soon as batch commits there way reusing log space & putting minimal overhead on the storage space.
DECLARE @BatchSize BIGINT
SET @BatchSize = ?
WHILE @BatchSize <> 0
DELETE TOP (@BatchSize) FROM [Condition]
SET @BatchSize = @@rowcount
Only problem with deleting records in batches is the amount of time it would take to complete the operation. If timings are concern and you only have a fix window to flush out records, consider this option carefully.
4. DELETE in a Batch in a Phased Manner - This is another option that some teams have followed with great success. If the data is being deleted provides you flexibility where you can delete data in a phased manner then the delete operation can be phased out over multiple releases. Delete in batch, coupled with phased release (spanning over multiple days or weekends) can be a good option provided the deleted data doesn't interfere with the system & doesn't introduce orphaned records which are hard to track.
If there is dependency that all data has to be removed in a given release, with minimal impact to resources and within agreed time frame, the next option would be to :
5. Create & Swap Approach - In this approach a table, identical to original table being swapped is created in warehouse without any constraints or indexes on them. Data which is supposed to be retained is pushed in the heap, leaving behind data which is not required. Once the tables are ready, tables are swapped out by dropping indexes, constraints etc from original table and putting them back on the new table followed by renaming the tables.
Whichever way it is, it will take good effort to delete millions of records from a table which is close to a billion. So the strategy has to be carefully planned before you hit the 'DELETE' button.
With this, I will now sign off on this topic. All these options are more from my personal experience while working on the warehouse solutions but please free to share your feedback & comments & I will be more than happy to amend this posting further.