Pages

Saturday, July 26, 2014

Deleting Millions of Records in SQL

How many of you have tasked with purging/archiving records from one of the biggest table of your data warehouse solution. The scenario looks quite common and I am pretty much sure everyone gets a chance to work on tasks like these sometime during their carrier.
There are more than 1 ways of doing this, each one has its own plus and minus, so I will try to cover them with why & why not to use them.

1. DELETE Statement - The delete statement can be as simple as "DELETE * FROM TABLE WHERE [Condition]". But wait a minute, before firing this statement to delete millions of records, out of a billion, bear in mind that deletion is a logged operation, which means all delete operations are logged in transaction log to assist with rollback just in case batch fails. If you try running delete statement without any consideration, the statement will go on and on putting pressure on the transaction log and eventually fail when it runs out of disk space. So running plain DELETE is not a good option. Even if space is not a constraint, trust me, it doesn't look good design to delete millions of rows with a plain where clause.

2. Truncate Statement - Truncate operations are not logged which means they don't exert any pressure on the transaction logs & is the fastest way to flush out records from a table. But - you cannot use WHERE clause in a truncate statement so in theory it is either delete all or none. So for getting rid of part of your biggest table, truncate is not a viable option.

3. DELETE in Batch - Next best option is to use some kind of batching logic to loop through the table & delete records in chunks of batch size of say for eg 0.5 m or 1 m. Please refer to following syntax to delete in batch size of 1 million or whatever suits best in your environment. Advantages of deleting data in batch is that it will only hold data equivalent to batch size in transaction logs and it will flush it out as soon as batch commits there way reusing log space & putting minimal overhead on the storage space.

DECLARE @BatchSize BIGINT
SET @BatchSize = ?
 WHILE @BatchSize <> 0
 BEGIN
   DELETE TOP (@BatchSize) FROM [Condition]
   SET @BatchSize = @@rowcount
 END

Only problem with deleting records in batches is the amount of time it would take to complete the operation. If timings are concern and you only have a fix window to flush out records, consider this option carefully.

4. DELETE in a Batch in a Phased Manner - This is another option that some teams have followed with great success. If the data is being deleted provides you flexibility where you can delete data in a phased manner then the delete operation can be phased out over multiple releases. Delete in batch, coupled with phased release (spanning over multiple days or weekends) can be a good option provided the deleted data doesn't interfere with the system & doesn't introduce orphaned records which are hard to track.
If there is dependency that all data has to be removed in a given release, with minimal impact to resources and within agreed time frame, the next option would be to :

5. Create & Swap Approach - In this approach a table, identical to original table being swapped is created in warehouse without any constraints or indexes on them. Data which is supposed to be retained is pushed in the heap, leaving behind data which is not required. Once the tables are ready, tables are swapped out by dropping indexes, constraints etc from original table and putting them back on the new table followed by renaming the tables.

Whichever way it is, it will take good effort to delete millions of records from a table which is close to a billion. So the strategy has to be carefully planned before you hit the 'DELETE' button.

With this, I will now sign off on this topic. All these options are more from my personal experience while working on the warehouse solutions but please free to share your feedback & comments & I will be more than happy to amend this posting further.

Thanks,
Rishi Saxena
View Rishi Saxena's profile on LinkedIn

Thursday, July 17, 2014

Data Testing & Automated Unit Test Framework - ETL Process

Data Testing is yet another important but mostly forgotten component of ETL development life cycle. A good ETL solution is incomplete without a good unit test framework.

Complexity of a project increases as more & more business rules are added to the package & more packages become part of the solution. This become further difficult to manage if you are working in a multi-team environment and code check-in from one developer starts breaking business logic for other developer. This is where a well-designed test framework can comes to everybody’s rescue.

A sample ETL package would go through a number of test scenarios, some of which would be something like:
  • Mapping Data Sheet Validation
  • Table Structure Validation
  • Data Validation
  • Data Consistency Validation
  • Data Transformation Validation
  • Null Check Validation
  • Record Count Validation
  • Duplicate Record Validation

The underlying logic behind Unit Test Framework is: if test cases are run on a fixed set of data & under fixed environment settings, you can expect consistent results. Having said that, it is essential to design sample data in a way that it covers each and every aspect of Business rules including data movements, transformations & data failures. As part of on-going development cycles, if new change breaks underlying Business Logic - the test cases will flag out failures immediately, prompting corrective action from Developer.

All you need is good understanding of your business flows, some sample files & good understanding of SQL and that’s it. If you have all this information handy, you can be up and running in no time with your own Unit Test framework J.

Steps that generally comprise of setting up a Unit Test Framework Environment are :
  • Fixing Sample data against which you are going to design test cases, sample data big enough to comprise majority of test scenarios.
  • Manipulating some of the sample data so that it channelize through different aspects of the packages being designed.
  • Populate master data/supporting tables with fixed data that is going to be used on an on-going basis. Design a pre-requisite script that verifies these master tables & truncate & re-load them if data in any one of them is not as per plan.
  • Design test cases “SQL Script” against sample data which comprise of different data verifications. The test cases should be designed carefully so that it covers majority of the scenarios for the given entity being tested.
  • Schedule a SQL job that would execute these scripts in a sequential manner. Configure SQL job to output outcome of each step to a log file which can be used for analysis/diagnostic purpose as required...... & That's it.

When new code gets published, if any of the test cases fail it will flag out discrepancy immediately. Some of the failure might be expected failures because of change in base functionality around which cases were originally designed - if so – it is advisable to update test cases to accommodate those scenarios. If those were unexpected failure – perform some diagnostic checks to verify reason for failure & implement a fix as applicable.

This way a lot of Time & Effort can be saved by developers for testing their own stuff including assisting with regression testing as product grows with time.

Sample Master pre-requisite Script: If data in supporting master tables doesn’t matches with the data against which Test Framework is originally designed, truncate those tables & reload with master data.
Similar code block applies to all other supporting tables. 


Sample Test Cases Script – These individual SQL scripts will comprise of collection of test cases (both Record Count Verification & Data Verification) to verify if the test scenarios still hold good of if there are any exception encountered because of code change.


Snapshot from Unit Test Framework – SQL Job:


In the figure above:

Step 1 signifies setting up master data in supporting tables & truncating data from warehouse tables to provide a clean slate before packages loads data.
Step 2 : Execute “Package 1” to perform business processing, this could be downloading files from FTP site or reading sample files from a shared directory & putting them in staging area.
Step 3: Running test cases against data populated by “Package 1”. The result of test cases can be flushed to a log file to signify how many test cases succeeded or failed.

On a similar logic, Step 2 & Step 3 can be repeated for the other packages of the solution for unit testing their business rules. Segregating packages under different steps provides opportunity to perform test cases against snapshot of data that is available after a particular package run.

Configuration Windows to direct output of the Job to a Log file


Once the job gets completed, output of the job is published in a log file for analysis purpose to identify if new functionality broke any old test cases or for general health check-ups as well.

With this, I will now sign off on this topic. Please do share your feedback & comments & I will be more than happy to amend this posting further. 

Thanks,
Rishi Saxena
View Rishi Saxena's profile on LinkedIn

Thursday, June 26, 2014

Decoding Data Mapping Sheets - ETL Process

An ETL (Extract Transform Load) process is all about moving a variety of data from Source System to the destination Data Warehouse System by taking data through a number of extraction, transformation, data cleansing & data validation processes.

Just imagine how easy it will get for someone as an ETL developer if he gets a chance to visualize all the transformations & business rules up front in an easy to interpret format. This is where mapping sheets come into picture.  

A carefully designed mapping sheet up-front can save a lot of pain as handling mapping information increasingly gets difficult as application grows with time. Only downside of using them is it takes good effort to create them & then keeping them up-to-date. But, trust me, rewards of using them easily outnumber the pain of not maintaining one as the system grows overtime
Each data migration project uses mapping sheets in one form or the other but the one which I have used too often & has worked exceptionally well for me is what I am detailing in this here.

To start with a sample Mapping Sheet would look something like this for any given entity as represented in Figure 1. The sheet can be extended further by using multiple sheets in the Excel workbook to represent other entities of the system.













Fig 1 – Sample Mapping Sheet structure

Generally, the extreme left of the sheet represents data from the Source System or the staging area. The middle layer represents data validation, data verification, data cleansing & business rule validation with a number of data-centric transformation rules in place. There can be more than 1 transformation layer in the centre depending upon how complex the ETL process is. The extreme right mostly represent Data Warehouse system.

In the Figure 2 below, I have taken example of Employee table in staging area to represent how actual table structure from database gets represented in a mapping sheet.

















Fig 2 – How a database table structure is represented in a Mapping Sheet Document

Carefully looking at the snapshot in figure 2, it provides almost all the information related to a table in a database. The information includes everything like - Schema name (stg), Name of the Table [Employee] along with name of the columns, data types & whether they allows NULL or NOT NULLS values. For eg - EmployeeID is INT & NOT NULL whereas EmployeeName is going to be varchar(255) NULL column.

On a similar line, mapping sheets can be extended to further represent structure of the given tables across different layers. Figure 3 below further helps you visualize how a field gets mapped, validated & transformed through different layers of the ETL process while being migrated from Source System to ODS to DW.






Fig 3 – Transformation rules at the table/field level

Another efficient use of mapping sheet would be to document Business rules in layman’s term against respective field right next to them. Please refer to Figure 4 below. The figure represents only a selective few & a very basic level transformation rules. Please note, the transformation rules represented here are only from informative purpose & real time transformation could vary considerably depending upon project requirements. The mapping sheets only works as placeholders to store transformation information.

Some of the sample business rules could be like
1.  Use database defaults to set to Current Date & Time for CreatedOn, ModifiedOn fields.
2.  Use SQL Case statement to set a field to an integer value in warehouse depending upon text information coming in from source feed.
3.  Placeholder to store I, U, D flags in ODS layer to perform respective actions in warehouse for a given entity – just to name a few
4.  Other transformations would be something like these -






















Figure 4 – Sample Transformation Rules :

Some of the key points that must be considered while designing a mapping information document:    
1        Design staging area with minimal constraints/indexes and with no or minimum data integrity checks in place. Consideration for Extraction process must be to load data from Source System as quickly as possible with minimal data leakage. Ideally, data type in this layer should be minimally restrictive to allow full data to pass through to staging area without any data loss. Like allowing NULLs, data size sufficiently big enough to hold data from feeding source system. 

2        Operational Data Store Layer - In this layer most of the business rules are defined and data types are generally tightly coupled with data types in warehouse layer. Most of the data transformations, error handling & data filtering are done in this layer. An ideal ODS layer should be able to maintain audit trail information to keep track of I, U, D operations to Data Warehouse. In this example, “Action" field in ods.Employee table can be used to maintain current state of a record throughout the life cycle of ETL process.                         

3        Data warehouse Layer - This layer contains only current version of data. The records normally gets Inserted, Updated & Deleted in Data Warehouse depending upon incoming “Action” field from ODS source. Normally, only selective few transformation are performed in this layer, the ones which are specific to data available in data warehouse s.                          

With this, I will now sign off on this topic. I have tried keeping this information in its basic & simplest form so I hope this information will come handy in your respective projects. Please do share your inputs, feedback & comments & I will be more than happy to amend/improve this posting further.

Thanks,

Rishi Saxena
View Rishi Saxena's profile on LinkedIn

Thursday, June 12, 2014

Database Design Consideration

So, this is my 2nd blog which details various database design consideration that one should follow while developing database design. This is just a high-level overview of guidelines that I normally follow. The list is not exhaustive, so there will be a few more considerations here and there depending upon different personnel and environment under which databases are designed. What this blog doesn’t include is any in-depth SQL syntaxes or code walk-through as the sole intention behind blogging this topic is to share information around how different bits and pieces fits together from database development perspective. 

Database design is one of the most important aspects of any application development life cycle. Performance issues because of poor design & technical debts are not always visible when the system is small & early production days but as the system grows with time they become more apparent and increasingly difficult to manage. It takes far more effort to improve performance for a live-application when measured against identifying & fixing issues during early phase of development cycle. It is always recommended to spend quality time (at least a couple of iterative cycles) in thinking, discussing & designing database architecture up front rather than leaving it with “Let's see how it will perform attitude”.  

OLTP/OLAP Systems & Normalization : 

OLTP System, also known as transactional systems contains day to day data processing information. These systems are built on top of Relational Database modelling techniques with optimal use of normalization for faster I/U/D operations. In these systems, maintaining data integrity & consistency with minimum redundancy takes precedence so data is carefully normalized to make this task easier. Having said that, the tables still has to be optimally normalized because an overly normalized database does have its own performance bottlenecks when performing I/U/D operation.

OLTP System, also known as transactional systems contains day to day data processing information. These systems are built on top of Relational Database modelling techniques with optimal use of normalization for faster I/U/D operations. In these systems, maintaining data integrity & consistency with minimum redundancy takes precedence so data is carefully normalized to make this task easier. Having said that, the tables still has to be optimally normalized because an overly normalized database does have its own performance bottlenecks when performing I/U/D operation.

Online Analytical Processing (OLAP) applications are designed for reporting & analysis purpose & it is built using dimensional modelling techniques. Here preference is to get data as quickly as possible in a required format, even at the cost of some redundant data & extra storage space. So some degree of de-normalization is commonly observed in OLAP system which means lesser joins among tables & faster retrievals.

The data between OLAP & OLTP systems are not normally in sync and usually lags behind by a few hours or a day even in some cases. However, there are automated jobs that run in the backend to sync up OLTP/OLAP systems over nightly as these systems are designed to work that way anyways.

Based on potential usage of the source system, careful consideration is required to strike a right balance with level of normalization.


Maintaining Data Integrity: 

There are certain business rules that can be designed & controlled at the database level there by providing some flexibility to not to replicate them on the interfacing application or the feeder system. For eg – log current Date & Time as LastModifiedDate, log current logged-in user’s identity as LastModifiedBy, Age of an employee should be between 21 & 60 just to name a few. All these rules at the database level help in maintaining data integrity. Primary keys, foreign keys, & default constraints are some of the out of the box features that can be used to maintain data integrity.

Primary Keys:
 Primary key is something which uniquely identifies a record in a table. Spend good time on choosing a primary key for a table, giving future expansion plan a consideration. Once the primary key is decided and database architecture is designed around that, it will take great effort to change it to something else in future. Apart from this, primary keys also holds an index that speeds up searches requests, choosing one which doesn’t change too often is another criteria behind selecting the right key.

Foreign Keys
: Another feature of Data Integrity is to handle Referential Integrity. Foreign key constraints come handy to control these validations. Putting foreign key constraints brings quality control as well as restricts data entry or deletions at the database level if the operation could result in orphaned and/or unaccounted records.
Consider a Billing operation system where nobody would like to look at a Bill or an Order which doesn’t have a customer associated with it. Proper use of foreign key constraint as per business requirement will always take care of scenarios like this without much add-on validations.

Default Constraints: 
It is always advisable to put a default constraint on a database field, if it can derive some business value out of default value, instead of making field nullable & storing NULL values. This also takes burden off the feeding system to not to worry about keying in information in a NOT NULL fields for which information is not available at that very moment.

Optimal Use 0f Data Types : 

Most important of them all is choosing the right data type. Remember, every data type consumes some storage space as well as index takes time to re-arrange data with every I/U/D operation, so choosing data type with right size does make its own difference. Imagine how much difference it can make for every byte saved on choosing the right data type if we are dealing with millions & millions of records.

Considering future expansion plan is another criterion that should be taken care off while deciding data types. Choosing a data type which works at present but potentially could fail in future is not a good data type decision making. This is where data profiling comes into picture and has to be done carefully to analyse how the data is being used currently and how it may be used in future.
For eg - choosing int datatype to store 0s & 1s & choosing datetime data type for storing Date of Birth/Joining date is definitely not a good design approach. It would have been a wiser decision to use a bit and date respectively.


Indexing Decisions :  

Indexes does speed up data access as it cuts down the number of records that needs to be searched through to return searched data but it does have negative impact on Inserts, Updates & delete operations.

Indexes does speed up data access as it cuts down the number of records that needs to be searched through to return searched data but it does have negative impact on Inserts, Updates & delete operations.

Creating an index is more like choosing a trade-off between whether the performance gain that is observed by creating a few indexes outweighs the potential issues of extra storage requirement as well as slowed down I/U/D/ operations.

There is no golden rule to the number of indexes a table should have and it also varies with whether you are designing an OLAP or an OLTP system but trying out different combination is the best way to figure out what is the optimum level of indexing your database tables should have.
For creating primary keys, Microsoft recommends to use integer as Primary keys because SQL Server is optimized to index integer data types. Similarly, when joining multiple tables, it is advisable to create non-clustered indexes on the foreign key columns to enhance performance.

Utilizing Stored Procedures / Views/ Triggers : 
An enterprise database design is incomplete without use of any of these. There are always workarounds available to not to use any or all-of-these but these are integral part of any database project and must be used for respective cases because advantages of using them mostly outweighs the disadvantages of not using them.

Stored Procedures: Stored procedures are compiled only once & then they are stored in an executable form, so stored procedure calls are both efficient as well as quick. Apart from this, it also provides flexibility, scalability as well as easy maintainability. Remember if you don't use stored procedure and embed all database related business processing in the application, how difficult it will become to make even slightest of changes to the code block. This will call for code re-compilation & deployment of entire code every time. Encapsulating business rules in stored proc makes code easier to maintain as well as it provides unit testability.

Views: For an enterprise level application, no one wants to expose each and every table & database object to everyone. This is where views come handy. They provide flexibility to hide business complexity as well as marking out sensitive data of your database and expose only what you want others to see. It also provides code reusability. What if you pull up a long query which contains joins after joins to get some data. You don't have to right big SQL statement every time, just put it in one of the view & extend views wherever it is required.

Triggers: These can be used to maintain audit trail information. These are fully customizable & are executed automatically when the data is inserted/updated/deleted so in theory these are designed once and are done for all. Apart from this, these can be used to encapsulate business rules at the database level thereby preventing errors to bubble through to the presentation layer or the feeder system.
Triggers do have their own disadvantages as well like - putting some extra overhead on database server & being invisible to the outside world which means at times it becomes difficult to track what went wrong with the data in the database. So the trade-offs have to be carefully analysed to make the decision to use them or not to use them.

Naming Conventions & Standards: 
Working on somebody else’s code is almost always a nightmare if proper naming conventions & standards are not followed. Following proper naming conventions bring both consistency as well as it reflects quality of work being developed. In a multi-team environment where more often than not the team that support an application is different from the team that actually developed the application, just imagine how difficult it will become for the support personnel to dig deep into each table to figure out what each bits and pieces are doing. A proper documentation & comments do play their own part in successful implementation of any data warehouse/front-end application.

In general, the bare minimum requirement for any database system is to :
1. Provide access to data as per the agreed format as quickly as possible.
2. Maintain data in a consistent, accurate & reliable state.
3. Minimum data redundancy which results in easier maintainability as well as it requires lower storage space.
4. Scope for scalability to meet future business requirements with minimum impact to existing system.

There are a number of factors on which database design consideration depends, some of which are as follows.

Always remember to give unique & easily identifiable names to the constraints, it makes it easier to diagnose & understand where they are used when someone generates consolidated database scripts for any reason.

Considering future expansion plan is another criterion that should be taken care off while deciding data types. Choosing a data type which works at present but potentially could fail in future is not a good data type decision making. This is where data profiling comes into picture and has to be done carefully to analyse how the data is being used currently and how it may be used in future.
For eg - choosing int datatype to store 0s & 1s & choosing datetime data type for storing Date of Birth/Joining date is definitely not a good design approach. It would have been a wiser decision to use a bit and date respectively.

Like I mentioned above, this is not an exhaustive list of guidelines but something which I refers to too often to design my applications so please do share your feedback & comments. I will be more than happy to amend/improve this posting further.


Thanks,
Rishi Saxena
View Rishi Saxena's profile on LinkedIn



References:
  • Microsoft 
  • Wikipedia
  • Other Blogging sites

Tuesday, June 3, 2014

A quick introduction to BIDS & MSBI

So, after thinking about writing a blog for quite some time, this is my first attempt at blogging. I have gained a lot of technical information as I grew as a professional over the years, thanks to all the active bloggers out there :-), so I believe it is now time to give something back to the community. So here is my first blog as a quick Introduction to BIDS & MSBI. 

In the year 2005, Microsoft Released IDE by the name BIDS (Business Intelligence Development Studio) which was more like an extension of Visual Studio IDE to allow developers to design data-centric applications using MSBI technologies and SQL Server database as the backend. BIDS has been part of all the major SQL Server releases since then & is available as part of SQL Server installation. It has tightly integrated with Visual Studio .NET IDEs 2005, 2008 & 2010/2008 R2 installations. With SQL Server 2012 release, it is no longer known as BIDS but is now known as SQL Server Data Tools (SSDT) and is released as a separate add-on product.

MSBI
Stands for Microsoft Business Intelligence. It is a complete suite from Microsoft to design & develop high-performance Data Integration & Analysis solutions using Microsoft SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), SQL Server Analysis Services (SSAS) & SQL Server Database.

Typical Use of MSBI Components:

SSIS:
   SQL Server Integration Services is an ETL tool (Extract, Transform and Load) which can be used for a variety of tasks like -
  • High-performance Data Migration & Integration with Legacy Systems
  • Automating Extraction, Transformation & Loading capabilities from OLTP Source to DW Solutions & Data Marts 
  • Automating repetitive Administrative tasks like backing up a cube/database server, incremental data loads etc

SSRS:
 SQL Server Reporting Services provides out of the box tools & components to allow a developer to design, develop, deploy & maintain a variety of reports for critical decision making. It also extends following capabilities -  
  • Interface & tools to create complex Reports to provide Critical Decision making capabilities.
  • Provide "Report Builder" capabilities to lesser technical IT workers group. 
  • Full administrative control through the web interface
  • Integration with SharePoint & ASP.NET websites
  • Support subscriptions to send out reports to intended recipients as per pre-defined schedules & formats.

SSAS :
Microsoft SQL Server Analysis Services provides Online Analytical Processing (OLAP) & data mining capabilities for Data Warehousing applications.
  •  Provides On-Line Analytical Processing (OLAP) Capabilities
  •  To build multi-dimensional structures called Cubes to store pre-calculated complex aggregations
  •  Provides Data Mining capabilities to analysis patterns, trends, relationships etc. within the data

Major releases of SQL Server since 2005.
SQL Server 2005 - October 2005
SQL Server 2008 - August 2008
SQL Server 2008 R2 - April 2010
SQL Server 2010 - Yes, you read it right. There is no released version of SQL Server 2010
SQL Server 2012 - March, 2012
SQL Server 2014 - March, 2014

This is just a quick introductory overview to explain how different components of MSBI fits together for any Data-centric Enterprise Level project. I will continue to add more blogs/information in coming days for each of these components which probably will be more detailed & extensive.

Please keep looking at this place for more updates & also do share your feedback & comments. I will be more than happy to amend/improve this posting further.

Thanks,

Rishi Saxena

View Rishi Saxena's profile on LinkedIn


References:
  • Microsoft 
  • Wikipedia