An ETL (Extract
Transform Load) process is all about moving a variety of data from Source System
to the destination Data Warehouse System by taking data through a number of
extraction, transformation, data cleansing & data validation processes.
Just imagine how easy it will get for someone as an ETL
developer if he gets a chance to visualize all the transformations &
business rules up front in an easy to interpret format. This is where mapping
sheets come into picture.
A carefully designed mapping sheet up-front can save a
lot of pain as handling mapping information increasingly gets difficult as application
grows with time. Only downside of using them is it takes good effort to create them
& then keeping them up-to-date. But, trust me, rewards of using them easily
outnumber the pain of not maintaining one as the system grows overtime
Each data migration project uses mapping sheets in one
form or the other but the one which I have used too often & has worked
exceptionally well for me is what I am detailing in this here.
To start with a sample Mapping Sheet would look
something like this for any given entity as represented in Figure 1. The sheet can be extended further by using multiple
sheets in the Excel workbook to represent other entities of the system.
Fig 1 – Sample Mapping Sheet structure
Generally, the extreme left of the sheet represents data
from the Source System or the staging area. The middle layer represents data
validation, data verification, data cleansing & business rule validation with
a number of data-centric transformation rules in place. There can be more than
1 transformation layer in the centre depending upon how complex the ETL process
is. The extreme right mostly represent Data Warehouse system.
In the Figure 2
below, I have taken example of Employee table in staging area to represent how actual
table structure from database gets represented in a mapping sheet.
Fig 2 – How a database table structure is represented in a Mapping Sheet Document
Carefully looking at the snapshot in figure 2, it provides
almost all the information related to a table in a database. The information includes
everything like - Schema name (stg), Name of the Table [Employee] along with name
of the columns, data types & whether they allows NULL or NOT NULLS values.
For eg - EmployeeID is INT & NOT NULL whereas EmployeeName is going to be
varchar(255) NULL column.
On a similar line, mapping sheets can be extended to further
represent structure of the given tables across different layers. Figure 3 below further helps you visualize how a field gets mapped, validated &
transformed through different layers of the ETL process while being migrated
from Source System to ODS to DW.
Fig 3 – Transformation rules at the table/field level
Another efficient use of mapping sheet would be to document
Business rules in layman’s term against respective field right next to them.
Please refer to Figure 4 below. The figure represents only a selective few & a very
basic level transformation rules. Please
note, the transformation rules represented here are only from informative
purpose & real time transformation could vary considerably depending upon
project requirements. The mapping sheets only works as placeholders to
store transformation information.
Some of the sample business rules could be like
1. Use database defaults to set to Current
Date & Time for CreatedOn, ModifiedOn fields.
2. Use SQL Case statement to set a field to an
integer value in warehouse depending upon text information coming in from source
feed.3. Placeholder to store I, U, D flags in ODS layer to perform respective actions in warehouse for a given entity – just to name a few
4. Other transformations would be something like these -
Figure 4 – Sample Transformation Rules :
Some of the key points that must be considered while designing a mapping information document:
1 Design staging
area with minimal constraints/indexes and with no or minimum data integrity checks
in place. Consideration for Extraction process must be to load data from Source
System as quickly as possible with minimal data leakage. Ideally, data type in
this layer should be minimally restrictive to allow full data to pass through
to staging area without any data loss. Like allowing NULLs, data size
sufficiently big enough to hold data from feeding source system.
2 Operational Data Store Layer - In this layer most of the business rules are defined and data types are generally tightly coupled with data types in warehouse layer. Most of the data transformations, error handling & data filtering are done in this layer. An ideal ODS layer should be able to maintain audit trail information to keep track of I, U, D operations to Data Warehouse. In this example, “Action" field in ods.Employee table can be used to maintain current state of a record throughout the life cycle of ETL process.
2 Operational Data Store Layer - In this layer most of the business rules are defined and data types are generally tightly coupled with data types in warehouse layer. Most of the data transformations, error handling & data filtering are done in this layer. An ideal ODS layer should be able to maintain audit trail information to keep track of I, U, D operations to Data Warehouse. In this example, “Action" field in ods.Employee table can be used to maintain current state of a record throughout the life cycle of ETL process.
3 Data
warehouse Layer - This layer contains only current version of data. The records
normally gets Inserted, Updated & Deleted in Data Warehouse depending upon
incoming “Action” field from ODS source. Normally, only selective few
transformation are performed in this layer, the ones which are specific to data
available in data warehouse s.
With this, I will now sign off on this topic. I have
tried keeping this information in its basic & simplest form so I hope this
information will come handy in your respective projects. Please do share your inputs, feedback &
comments & I will be more than happy to amend/improve this posting further.
Rishi Saxena