Effective ETL (Extract Transform Load)

What is ETL?

Extract, transform, and load (ETL) in database usage and especially in data warehousing involves:

  • Extracting data from outside sources
  • Transforming it to fit operational needs (which can include quality levels)
  • Loading it into the end target (database or data warehouse)

ETL is most commonly related to the data warehouse implementations. So, it is important to take a brief overview of the data warehouse model structure.

A Data Warehouse and ETL

A Data warehouse is a backbone of the entire BI solution architecture. Moreover, the data warehouse is intended to be a reference repository of an organization's electronically stored data. Data warehouse static architecture is represented by several logical parts listed below:

  • Staging database – represents staging data interface layer between operational source systems and data warehouse
  • System of records and Summary area database – major part of data  warehouse static architecture, repository of organization’s detailed data
    • System of records – repository of most detailed (reference, profile  and transactional) data
    • Summary area – repository of aggregated data from System of records. Typical aggregations are by product, customer, branch, etc.
    • Data Marts - is a subset of an organizational data store, usually oriented to a specific purpose or major data subject that may be distributed to support business needs. Data marts are analytical data stores designed to focus on specific business functions for a specific community within an organization.
    • Data Exchange Interface - Data Interfaces represents data structures and application services required data exchange between data warehouse and other systems.
    • Metadata – one of the most important part of static architecture of data warehouse, consist of information required for existence of data warehouse (description of structure of DW and Source Systems, business rules definition, mapping expressions, etc). Quality of metadata model in terms of comprehensives and functionality supported could have significant impact on efficiency and effectives of Data Integration process.

Regarding the model architecture, there are consequently different layers of ETL (shown in picture below).

ETL architecture

ETL is most resource and time consuming part of the project and requires significant effort form implementation team. However, that effort has to be mitigated with appropriate input of previous phases of the implementation process, especially, phase of mapping data sources to the data model. Result of this model is in fact specification for ETL application.

ETL applications are divided into segments according to logical and physical definition of data warehouse model. In accordance with that fact, we should consider:

  • ETL applications that realize feeding  data from data sources to staging model
  • ETL applications that realize feeding  data from staging model to the data warehouse
  • ETL applications that realize feeding  data from the data warehouse to data mars
  • ETL applications that realize feeding  data from the data warehouse to third party applications

Technically, ETL applications can be developed in SQL or some specific purpose tool such as SQL Server System Integration Services (SSIS), Informatika, Data Stage, etc. One of the important features of ETL platform should be ability to extend. For instance in company I have worked for we have used  SSIS tool for ETL. Beside number of built-in SSIS components we have develop several custom made: surrogate key, classification resolver, etc. Custom developed components significantly improve our ETL application performance, speed up development and finally simplify applications. Picture below shows difference between application developed with original components and one developed with custom components.

 

Custom Developed SSIS Components

Guidelines For successful ETL

Since that ETL (Extract Transform Load) phase consume more about 70% of effort and time in data warehouse projects, therefore is reasonably important to define  factors required for successful completion of ETL process.

 There are top guidelines for successful ETL:

  1. Chose the right ETL tool – large volume of data and complex transformations requires powerful tool available to dealing with these challenges. Consider tool extension  according to your particular need (e.g. you can develop components that can improve efficiency of implementation IBM IFW model based data warehouse by developing components that’s fits the model).
  2. Use a lot metadata – since that essence of ETL is composed by mapping of operational systems to destination as well as data transformation, store all of that in metadata instead of applications. Over the time you will get benefits through maintenance lower costs.
  3. Engage experienced ETL developers – they are expensive but several times more efficient then inexperienced
  4. Make/Choose the right destination data model -  operational sources you cannot choose, but good destination models allows you less complex transformations and better information potential
  5. Formalize the process – use formal templates for  ETL application specification, mappings, classifications
  6. ETL project plan should consist of certain number of milestones – you have to know where you are in evry moment of the project
  7. During ETL development process is extremely important all-inclusive and timely testing of data in order to verify mappings simplify data cleansing process and achieve maximum of data accuracy.
  8. Establish Data Governance – during ETL process, it is extremely important and quite efficient to establish data governance within the organization if no any. A proper Data Governance policy and procedures is essential environment for the ETL process during the operational work.  
Rate:
 
3 comments

Comments

   Here is sample of ETL mapping:

Sample of Mapping

Hi Raj

for the moment I dont have a white paper. For the maping we are using a custom developed tool leveraging om metadata and using custom developed expression editor for mapping definitions. However, you can do mappings even in Excel (but in this case is  questionable how you will use these mappings in your ETL applications). In general, mapping cosist of source part (sorce tables and colums), destination part (data warehouse tables, columns) and transformation expressions.

Do you have a white paper on this?

How would you build data quality and data validation within the informatica processes?

How would a typical mapping look like?

In our company, we do similar things, but not too much. What we really want to drive data validation and metrics (could be an independent process) so that we can generate automatic report on tolerances.

Thanks.
Raj

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters (without spaces) shown in the image.