How many of you have tasked with purging/archiving
records from one of the biggest table of your data warehouse solution. The
scenario looks quite common and I am pretty much sure everyone gets a chance to
work on tasks like these sometime during their carrier.
There are more than 1 ways of doing this, each one has its
own plus and minus, so I will try to cover them with why & why not to use
them.
1. DELETE
Statement - The delete statement can be as simple as "DELETE * FROM
TABLE WHERE [Condition]". But wait a minute, before firing this statement
to delete millions of records, out of a billion, bear in mind that deletion is
a logged operation, which means all delete operations are logged in transaction
log to assist with rollback just in case batch fails. If you try running delete
statement without any consideration, the statement will go on and on putting
pressure on the transaction log and eventually fail when it runs out of disk
space. So running plain DELETE is not a good option. Even if space is not a constraint, trust me, it doesn't look
good design to delete millions of rows with a plain where clause.
2. Truncate
Statement - Truncate operations are not logged which means they don't exert
any pressure on the transaction logs & is the fastest way to flush out
records from a table. But - you cannot use WHERE clause in a truncate statement
so in theory it is either delete all or none. So for getting rid of part of
your biggest table, truncate is not a viable option.
3. DELETE in
Batch - Next best option is to use some kind of batching logic to loop
through the table & delete records in chunks of batch size of say for eg
0.5 m or 1 m. Please refer to following syntax to delete in batch size of 1
million or whatever suits best in your environment. Advantages of deleting data
in batch is that it will only hold data equivalent to batch size in transaction
logs and it will flush it out as soon as batch commits there way reusing log
space & putting minimal overhead on the storage space.
DECLARE @BatchSize BIGINT
SET @BatchSize = ?
WHILE @BatchSize
<> 0
BEGIN
DELETE TOP
(@BatchSize) FROM [Condition]
SET @BatchSize
= @@rowcount
END
Only problem with deleting records in batches is the
amount of time it would take to complete the operation. If timings are concern
and you only have a fix window to flush out records, consider this option carefully.
4. DELETE in a
Batch in a Phased Manner - This is another option that some teams have
followed with great success. If the data is being deleted provides you flexibility
where you can delete data in a phased manner then the delete operation can be
phased out over multiple releases. Delete in batch, coupled with phased release
(spanning over multiple days or weekends) can be a good option provided the
deleted data doesn't interfere with the system & doesn't introduce orphaned
records which are hard to track.
If there is dependency that all data has to be removed
in a given release, with minimal impact to resources and within agreed time
frame, the next option would be to :
5. Create & Swap Approach - In this approach a
table, identical to original table being swapped is created in warehouse
without any constraints or indexes on them. Data which is supposed to be
retained is pushed in the heap, leaving behind data which is not required. Once
the tables are ready, tables are swapped out by dropping indexes, constraints
etc from original table and putting them back on the new table followed by
renaming the tables.
Whichever way it is, it will take good effort to delete
millions of records from a table which is close to a billion. So the strategy
has to be carefully planned before you hit the 'DELETE' button.
With this, I will now sign off on this topic. All these options are more from my personal experience while working on the warehouse solutions but please free to share your feedback & comments & I will be more than happy to amend this posting further.
Thanks,