Pages

Thursday, June 26, 2014

Decoding Data Mapping Sheets - ETL Process

An ETL (Extract Transform Load) process is all about moving a variety of data from Source System to the destination Data Warehouse System by taking data through a number of extraction, transformation, data cleansing & data validation processes.

Just imagine how easy it will get for someone as an ETL developer if he gets a chance to visualize all the transformations & business rules up front in an easy to interpret format. This is where mapping sheets come into picture.  

A carefully designed mapping sheet up-front can save a lot of pain as handling mapping information increasingly gets difficult as application grows with time. Only downside of using them is it takes good effort to create them & then keeping them up-to-date. But, trust me, rewards of using them easily outnumber the pain of not maintaining one as the system grows overtime
Each data migration project uses mapping sheets in one form or the other but the one which I have used too often & has worked exceptionally well for me is what I am detailing in this here.

To start with a sample Mapping Sheet would look something like this for any given entity as represented in Figure 1. The sheet can be extended further by using multiple sheets in the Excel workbook to represent other entities of the system.













Fig 1 – Sample Mapping Sheet structure

Generally, the extreme left of the sheet represents data from the Source System or the staging area. The middle layer represents data validation, data verification, data cleansing & business rule validation with a number of data-centric transformation rules in place. There can be more than 1 transformation layer in the centre depending upon how complex the ETL process is. The extreme right mostly represent Data Warehouse system.

In the Figure 2 below, I have taken example of Employee table in staging area to represent how actual table structure from database gets represented in a mapping sheet.

















Fig 2 – How a database table structure is represented in a Mapping Sheet Document

Carefully looking at the snapshot in figure 2, it provides almost all the information related to a table in a database. The information includes everything like - Schema name (stg), Name of the Table [Employee] along with name of the columns, data types & whether they allows NULL or NOT NULLS values. For eg - EmployeeID is INT & NOT NULL whereas EmployeeName is going to be varchar(255) NULL column.

On a similar line, mapping sheets can be extended to further represent structure of the given tables across different layers. Figure 3 below further helps you visualize how a field gets mapped, validated & transformed through different layers of the ETL process while being migrated from Source System to ODS to DW.






Fig 3 – Transformation rules at the table/field level

Another efficient use of mapping sheet would be to document Business rules in layman’s term against respective field right next to them. Please refer to Figure 4 below. The figure represents only a selective few & a very basic level transformation rules. Please note, the transformation rules represented here are only from informative purpose & real time transformation could vary considerably depending upon project requirements. The mapping sheets only works as placeholders to store transformation information.

Some of the sample business rules could be like
1.  Use database defaults to set to Current Date & Time for CreatedOn, ModifiedOn fields.
2.  Use SQL Case statement to set a field to an integer value in warehouse depending upon text information coming in from source feed.
3.  Placeholder to store I, U, D flags in ODS layer to perform respective actions in warehouse for a given entity – just to name a few
4.  Other transformations would be something like these -






















Figure 4 – Sample Transformation Rules :

Some of the key points that must be considered while designing a mapping information document:    
1        Design staging area with minimal constraints/indexes and with no or minimum data integrity checks in place. Consideration for Extraction process must be to load data from Source System as quickly as possible with minimal data leakage. Ideally, data type in this layer should be minimally restrictive to allow full data to pass through to staging area without any data loss. Like allowing NULLs, data size sufficiently big enough to hold data from feeding source system. 

2        Operational Data Store Layer - In this layer most of the business rules are defined and data types are generally tightly coupled with data types in warehouse layer. Most of the data transformations, error handling & data filtering are done in this layer. An ideal ODS layer should be able to maintain audit trail information to keep track of I, U, D operations to Data Warehouse. In this example, “Action" field in ods.Employee table can be used to maintain current state of a record throughout the life cycle of ETL process.                         

3        Data warehouse Layer - This layer contains only current version of data. The records normally gets Inserted, Updated & Deleted in Data Warehouse depending upon incoming “Action” field from ODS source. Normally, only selective few transformation are performed in this layer, the ones which are specific to data available in data warehouse s.                          

With this, I will now sign off on this topic. I have tried keeping this information in its basic & simplest form so I hope this information will come handy in your respective projects. Please do share your inputs, feedback & comments & I will be more than happy to amend/improve this posting further.

Thanks,

Rishi Saxena
View Rishi Saxena's profile on LinkedIn

Thursday, June 12, 2014

Database Design Consideration

So, this is my 2nd blog which details various database design consideration that one should follow while developing database design. This is just a high-level overview of guidelines that I normally follow. The list is not exhaustive, so there will be a few more considerations here and there depending upon different personnel and environment under which databases are designed. What this blog doesn’t include is any in-depth SQL syntaxes or code walk-through as the sole intention behind blogging this topic is to share information around how different bits and pieces fits together from database development perspective. 

Database design is one of the most important aspects of any application development life cycle. Performance issues because of poor design & technical debts are not always visible when the system is small & early production days but as the system grows with time they become more apparent and increasingly difficult to manage. It takes far more effort to improve performance for a live-application when measured against identifying & fixing issues during early phase of development cycle. It is always recommended to spend quality time (at least a couple of iterative cycles) in thinking, discussing & designing database architecture up front rather than leaving it with “Let's see how it will perform attitude”.  

OLTP/OLAP Systems & Normalization : 

OLTP System, also known as transactional systems contains day to day data processing information. These systems are built on top of Relational Database modelling techniques with optimal use of normalization for faster I/U/D operations. In these systems, maintaining data integrity & consistency with minimum redundancy takes precedence so data is carefully normalized to make this task easier. Having said that, the tables still has to be optimally normalized because an overly normalized database does have its own performance bottlenecks when performing I/U/D operation.

OLTP System, also known as transactional systems contains day to day data processing information. These systems are built on top of Relational Database modelling techniques with optimal use of normalization for faster I/U/D operations. In these systems, maintaining data integrity & consistency with minimum redundancy takes precedence so data is carefully normalized to make this task easier. Having said that, the tables still has to be optimally normalized because an overly normalized database does have its own performance bottlenecks when performing I/U/D operation.

Online Analytical Processing (OLAP) applications are designed for reporting & analysis purpose & it is built using dimensional modelling techniques. Here preference is to get data as quickly as possible in a required format, even at the cost of some redundant data & extra storage space. So some degree of de-normalization is commonly observed in OLAP system which means lesser joins among tables & faster retrievals.

The data between OLAP & OLTP systems are not normally in sync and usually lags behind by a few hours or a day even in some cases. However, there are automated jobs that run in the backend to sync up OLTP/OLAP systems over nightly as these systems are designed to work that way anyways.

Based on potential usage of the source system, careful consideration is required to strike a right balance with level of normalization.


Maintaining Data Integrity: 

There are certain business rules that can be designed & controlled at the database level there by providing some flexibility to not to replicate them on the interfacing application or the feeder system. For eg – log current Date & Time as LastModifiedDate, log current logged-in user’s identity as LastModifiedBy, Age of an employee should be between 21 & 60 just to name a few. All these rules at the database level help in maintaining data integrity. Primary keys, foreign keys, & default constraints are some of the out of the box features that can be used to maintain data integrity.

Primary Keys:
 Primary key is something which uniquely identifies a record in a table. Spend good time on choosing a primary key for a table, giving future expansion plan a consideration. Once the primary key is decided and database architecture is designed around that, it will take great effort to change it to something else in future. Apart from this, primary keys also holds an index that speeds up searches requests, choosing one which doesn’t change too often is another criteria behind selecting the right key.

Foreign Keys
: Another feature of Data Integrity is to handle Referential Integrity. Foreign key constraints come handy to control these validations. Putting foreign key constraints brings quality control as well as restricts data entry or deletions at the database level if the operation could result in orphaned and/or unaccounted records.
Consider a Billing operation system where nobody would like to look at a Bill or an Order which doesn’t have a customer associated with it. Proper use of foreign key constraint as per business requirement will always take care of scenarios like this without much add-on validations.

Default Constraints: 
It is always advisable to put a default constraint on a database field, if it can derive some business value out of default value, instead of making field nullable & storing NULL values. This also takes burden off the feeding system to not to worry about keying in information in a NOT NULL fields for which information is not available at that very moment.

Optimal Use 0f Data Types : 

Most important of them all is choosing the right data type. Remember, every data type consumes some storage space as well as index takes time to re-arrange data with every I/U/D operation, so choosing data type with right size does make its own difference. Imagine how much difference it can make for every byte saved on choosing the right data type if we are dealing with millions & millions of records.

Considering future expansion plan is another criterion that should be taken care off while deciding data types. Choosing a data type which works at present but potentially could fail in future is not a good data type decision making. This is where data profiling comes into picture and has to be done carefully to analyse how the data is being used currently and how it may be used in future.
For eg - choosing int datatype to store 0s & 1s & choosing datetime data type for storing Date of Birth/Joining date is definitely not a good design approach. It would have been a wiser decision to use a bit and date respectively.


Indexing Decisions :  

Indexes does speed up data access as it cuts down the number of records that needs to be searched through to return searched data but it does have negative impact on Inserts, Updates & delete operations.

Indexes does speed up data access as it cuts down the number of records that needs to be searched through to return searched data but it does have negative impact on Inserts, Updates & delete operations.

Creating an index is more like choosing a trade-off between whether the performance gain that is observed by creating a few indexes outweighs the potential issues of extra storage requirement as well as slowed down I/U/D/ operations.

There is no golden rule to the number of indexes a table should have and it also varies with whether you are designing an OLAP or an OLTP system but trying out different combination is the best way to figure out what is the optimum level of indexing your database tables should have.
For creating primary keys, Microsoft recommends to use integer as Primary keys because SQL Server is optimized to index integer data types. Similarly, when joining multiple tables, it is advisable to create non-clustered indexes on the foreign key columns to enhance performance.

Utilizing Stored Procedures / Views/ Triggers : 
An enterprise database design is incomplete without use of any of these. There are always workarounds available to not to use any or all-of-these but these are integral part of any database project and must be used for respective cases because advantages of using them mostly outweighs the disadvantages of not using them.

Stored Procedures: Stored procedures are compiled only once & then they are stored in an executable form, so stored procedure calls are both efficient as well as quick. Apart from this, it also provides flexibility, scalability as well as easy maintainability. Remember if you don't use stored procedure and embed all database related business processing in the application, how difficult it will become to make even slightest of changes to the code block. This will call for code re-compilation & deployment of entire code every time. Encapsulating business rules in stored proc makes code easier to maintain as well as it provides unit testability.

Views: For an enterprise level application, no one wants to expose each and every table & database object to everyone. This is where views come handy. They provide flexibility to hide business complexity as well as marking out sensitive data of your database and expose only what you want others to see. It also provides code reusability. What if you pull up a long query which contains joins after joins to get some data. You don't have to right big SQL statement every time, just put it in one of the view & extend views wherever it is required.

Triggers: These can be used to maintain audit trail information. These are fully customizable & are executed automatically when the data is inserted/updated/deleted so in theory these are designed once and are done for all. Apart from this, these can be used to encapsulate business rules at the database level thereby preventing errors to bubble through to the presentation layer or the feeder system.
Triggers do have their own disadvantages as well like - putting some extra overhead on database server & being invisible to the outside world which means at times it becomes difficult to track what went wrong with the data in the database. So the trade-offs have to be carefully analysed to make the decision to use them or not to use them.

Naming Conventions & Standards: 
Working on somebody else’s code is almost always a nightmare if proper naming conventions & standards are not followed. Following proper naming conventions bring both consistency as well as it reflects quality of work being developed. In a multi-team environment where more often than not the team that support an application is different from the team that actually developed the application, just imagine how difficult it will become for the support personnel to dig deep into each table to figure out what each bits and pieces are doing. A proper documentation & comments do play their own part in successful implementation of any data warehouse/front-end application.

In general, the bare minimum requirement for any database system is to :
1. Provide access to data as per the agreed format as quickly as possible.
2. Maintain data in a consistent, accurate & reliable state.
3. Minimum data redundancy which results in easier maintainability as well as it requires lower storage space.
4. Scope for scalability to meet future business requirements with minimum impact to existing system.

There are a number of factors on which database design consideration depends, some of which are as follows.

Always remember to give unique & easily identifiable names to the constraints, it makes it easier to diagnose & understand where they are used when someone generates consolidated database scripts for any reason.

Considering future expansion plan is another criterion that should be taken care off while deciding data types. Choosing a data type which works at present but potentially could fail in future is not a good data type decision making. This is where data profiling comes into picture and has to be done carefully to analyse how the data is being used currently and how it may be used in future.
For eg - choosing int datatype to store 0s & 1s & choosing datetime data type for storing Date of Birth/Joining date is definitely not a good design approach. It would have been a wiser decision to use a bit and date respectively.

Like I mentioned above, this is not an exhaustive list of guidelines but something which I refers to too often to design my applications so please do share your feedback & comments. I will be more than happy to amend/improve this posting further.


Thanks,
Rishi Saxena
View Rishi Saxena's profile on LinkedIn



References:
  • Microsoft 
  • Wikipedia
  • Other Blogging sites

Tuesday, June 3, 2014

A quick introduction to BIDS & MSBI

So, after thinking about writing a blog for quite some time, this is my first attempt at blogging. I have gained a lot of technical information as I grew as a professional over the years, thanks to all the active bloggers out there :-), so I believe it is now time to give something back to the community. So here is my first blog as a quick Introduction to BIDS & MSBI. 

In the year 2005, Microsoft Released IDE by the name BIDS (Business Intelligence Development Studio) which was more like an extension of Visual Studio IDE to allow developers to design data-centric applications using MSBI technologies and SQL Server database as the backend. BIDS has been part of all the major SQL Server releases since then & is available as part of SQL Server installation. It has tightly integrated with Visual Studio .NET IDEs 2005, 2008 & 2010/2008 R2 installations. With SQL Server 2012 release, it is no longer known as BIDS but is now known as SQL Server Data Tools (SSDT) and is released as a separate add-on product.

MSBI
Stands for Microsoft Business Intelligence. It is a complete suite from Microsoft to design & develop high-performance Data Integration & Analysis solutions using Microsoft SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), SQL Server Analysis Services (SSAS) & SQL Server Database.

Typical Use of MSBI Components:

SSIS:
   SQL Server Integration Services is an ETL tool (Extract, Transform and Load) which can be used for a variety of tasks like -
  • High-performance Data Migration & Integration with Legacy Systems
  • Automating Extraction, Transformation & Loading capabilities from OLTP Source to DW Solutions & Data Marts 
  • Automating repetitive Administrative tasks like backing up a cube/database server, incremental data loads etc

SSRS:
 SQL Server Reporting Services provides out of the box tools & components to allow a developer to design, develop, deploy & maintain a variety of reports for critical decision making. It also extends following capabilities -  
  • Interface & tools to create complex Reports to provide Critical Decision making capabilities.
  • Provide "Report Builder" capabilities to lesser technical IT workers group. 
  • Full administrative control through the web interface
  • Integration with SharePoint & ASP.NET websites
  • Support subscriptions to send out reports to intended recipients as per pre-defined schedules & formats.

SSAS :
Microsoft SQL Server Analysis Services provides Online Analytical Processing (OLAP) & data mining capabilities for Data Warehousing applications.
  •  Provides On-Line Analytical Processing (OLAP) Capabilities
  •  To build multi-dimensional structures called Cubes to store pre-calculated complex aggregations
  •  Provides Data Mining capabilities to analysis patterns, trends, relationships etc. within the data

Major releases of SQL Server since 2005.
SQL Server 2005 - October 2005
SQL Server 2008 - August 2008
SQL Server 2008 R2 - April 2010
SQL Server 2010 - Yes, you read it right. There is no released version of SQL Server 2010
SQL Server 2012 - March, 2012
SQL Server 2014 - March, 2014

This is just a quick introductory overview to explain how different components of MSBI fits together for any Data-centric Enterprise Level project. I will continue to add more blogs/information in coming days for each of these components which probably will be more detailed & extensive.

Please keep looking at this place for more updates & also do share your feedback & comments. I will be more than happy to amend/improve this posting further.

Thanks,

Rishi Saxena

View Rishi Saxena's profile on LinkedIn


References:
  • Microsoft 
  • Wikipedia