"Successful and unsuccessful people do not vary greatly in their abilities. They vary in their desires to reach their potential."
Inside of a Data Warehouse Model
A Data Warehouse Logical Parts
A Data warehouse is a backbone of the entire BI solution architecture. Moreover, the data warehouse is intended to be a reference repository of an organization's electronically stored data. Data warehouse static architecture is represented by several logical parts listed below:
- Staging database – represents staging data interface layer between operational source systems and data warehouse
- System of records and Summary area database – major part of data warehouse static architecture, repository of organization’s detailed data
- System of records – repository of most detailed (reference, profile and transactional) data
- Summary area – repository of aggregated data from System of records. Typical aggregations are by product, customer, branch, etc.
- Data Marts - is a subset of an organizational data store, usually oriented to a specific purpose or major data subject that may be distributed to support business needs. Data marts are analytical data stores designed to focus on specific business functions for a specific community within an organization.
- Data Exchange Interface - Data Interfaces represents data structures and application services required data exchange between data warehouse and other systems.
- Metadata – one of the most important part of static architecture of data warehouse, consist of information required for existence of data warehouse (description of structure of DW and Source Systems, business rules definition, mapping expressions, etc). Quality of metadata model in terms of comprehensives and functionality supported could have significant impact on efficiency and effectives of Data Integration process.
Let’s go to see in details each of the data warehouse logical parts listed above.
Staging database
Staging database represents data interface layer between operational source systems and the data warehouse. This layer is intended to accept data from operational source systems, including certain level of data transformation, and prepare data for loading into the data warehouse. A Staging area is mainly required in a Data Warehousing Architecture for timing reasons. In short, all required data must be available before data can be integrated into the Data Warehouse. Due to varying business cycles, data processing cycles, hardware and network resource limitations and geographical factors, it is not feasible to extract all the data from all Operational databases at exactly the same time.
One of the crucial activities in the data warehouse implementation is mapping operational data sources to the data warehouse entities. Consequently, design of staging database (in terms of defining entities, attributes, relationships) has significant importance. Well designed staging database could improve whole ETL (Extract, Transform, and Load) process and simplify mapping (operational systems to the data warehouse) process. There is a different approach and guidelines for staging database design. Here is described the original one, used in company I working for (Pexim solutions) for IBM IFW (Information Framework) BDW (Banking Data Warehouse) implementations.
Mapping of Operational data to the data warehouse model usually means mapping of hundreds of entities form both sides. Considering this fact, mapping process requires significant effort to be completed. So, the question is: can we make some improvement id design of staging area database in order to improve efficiency of that proces?
The answer is YES, we can!
In order to do that, we need to apply certain principles for data staging design.
Presumption for staging area design:
- Staging area should consist of relatively small number of entities, up to 20, enable to receive a whole informational potential form the operational sources. Those entities are named “messages”
- Each “message” entity represents subject related generalization of number of entities from the data warehouse. Logical foundation for generalization and grouping is subject areas in the data warehouse.
- Physical representation of staging database is relevant: could be relation database, files (txt, xml), in memory, etc.
- Data loaded into Staging Area entities (“messages”) must be in their native format or slightly transformed, in order to prepare data for the next ETL stage ( form Staging database to data warehouse)
- classification resolving is not considered at this stage - should be a part of the next ETL stage
Benefits of using this approach for Staging area design:
- Now, we have mapping of hundreds of entities form operational sources to up to 20 entities in the staging database, unlike of mapping hundreds of entities from both sides, That is much faster and less complex
- Usage of the data warehouse subject areas as a foundation for building staging area entities making mapping process much more effective and efficient
The benefits of such design are practically confirmed.
System of Records and Summary Area
System of Records is a heart of the data warehouse. This part of the data warehouse keeps most detail level of data including fundamental entities data, classification data and transactional data. Well designed System of Records is great fundamental for overall reporting and analytical capabilities of whole business intelligence system.
Presumptions for System of Records design:
- Should consist following types of entities: fundamental entities, housekeeping entities, classification entities, transactional entities, relationship entities
- Fundamental entities describe participants (any kind) in the business processes: products, customers, organization, geography, etc.
- Housekeeping entities are basic reference data such as time, currency, etc
- Classification entities classify other entities. This is one of the most important steps in data warehouse design and has to be proceeded carefully – well defined classification models means a great reporting and analytical potential.
- Transactional entities should include all kind of transactions (e.g. loan repayments, current account transactions, etc)
- Relationship entities keep data about relationship between all other entity types.
System of records does not contain any summarization – just pure, native, detail level data.
Summary area contains summarized data form System of records. Logical foundation for definition of Summary area entities are usually fundamental entities. For instance we have to define summary data by product, customer, organization, etc. Data is always summarized according to certain level of time dimension.
Data marts
Data marts are analytical data stores designed to focus on specific business functions for a specific community within an organization. They are usually represented as a star or snowflakes schemas which contains data from System of records or Summary area. Data Marts are foundation of multidimensional (OLAP) analysis. There are three major considerations regarding the design of data marts:
- Since that data marts are foundation for multidimensional analysis it is important to consider all dimensions, measures and hierarchies.
- OLAP applications are leveraging on data marts. Consequently data marts physical design cold be influenced by features of the tool used for OLAP development (e.g. Microsoft Analysis Services supports multiple measure groups within one Cube, so you should consider that in physical design of data mart)
- Data Marts are business focused – in design process you should consider all aspects of business processes that could be a subject of particular data mart
Data exchange Interface
A Data Warehouse has been central repository of organization’s information potential. Accordingly there is a substantial need for data exchange between the data warehouse and other systems (e.g. calculation engines for Basel II). For that purpose you should design data exchange interface represented by set of entities available to receive data from data warehouse or third party systems. The major consideration for data Exchange interface design is similar as for design of Staging area – it should be good to establish data exchange interface as derivation of a data warehouse model (or part of them) in order to simplify mapping between models.
Data exchange interface is also known as Feedback Area.
Metadata
The system leveraging on well designed metadata is much more efficient, effective and flexible in comparison with one does not. Besides standard “data about data” story, the data warehouse metadata should consist all kind of mappings between structures (e.g. between operational systems ad Staging Area, or between Staging Area and Data Warehouse), classification resolving expressions as well as data about data warehouse processes (e.g. ETL packages description, Control Flows, etc). Well defined metadata will increase flexibility of system an maintenance will be easier (e.g. if you change some of the mappings you should change only metadata, but no applications)
1 vote | 7 comments

Comments
Clarification for tdespirlet regarding the comment “I don't agree on "Since that data marts are foundation for multidimensional analysis it is important to consider all dimensions, measures and hierarchies.". It depends on your business needs.”
By my opinion, in an integral approach it is very important to realize all aspects of artifacts used across the whole organization in different applications, even in different business domains. For instance if you consider Customer dimension, it is very important to understand usage of customer attribute members across the whole enterprise. Such kind of input will make your Customer dimension design/ development much more flexible.
I see,
Within a Data Vault the Data Mart data layer (not the OLAP or datamining repositories) is sometimes called the Staging Out layer, because it stages the data required for the different OLAP cubes, semantical layers etc. It is a layer and as such not a set of datamarts. In this layer data re-use between data marts/cubes/whatever can be handled. This layer should not be shown siloed like you see on most pictures as a set of datamarts. If you want a kimball approach on top of your EDW/Data Vault, it should be implemented here also.
This is a classical datawarehouse schema. Note, in some cases, you may have to load complete data, not only changed data.
Regarding the location of the data mart, this is not absolutely false depending on the situation.
If you have a complex EDW, it may be difficult to limit the number of "messages" in staging.
I don't agree on "Since that data marts are foundation for multidimensional analysis it is important to consider all dimensions, measures and hierarchies.". It depends on your business needs.
I suggest to take a look at link bellow - I hope it could clarify position of Data Marts inside of DW architecture.
http://www.globaldataconsulting.net/open-lab-topics/poster-anatomy-data-...
The architecture you lay out looks like a good candidate for the Data Vault appreach www.datavaultinstitute.com created by Dan Linstedt www.danlinstedt.com
there are a few differences though. the Data Mart layer should not be part of the central EDW (repository), although it is part of the full EDW.
Another interesting issue is the use of staging messaging tables and clasifications because this is prohibited in a Data Vault.
The reason for this is that Data Vault compromises a model transaformation from source system model (or 1-1 staging model) to central EDW model, so when required the mappping can be generated/automated, and does not have to be manually maintained. Also, by doing specialisation instead of generalisation the datamodel and its data will be more stable.
But a good architecture nonetheless.
It is typing gaffe. I apologize for that. Article is updated now.
you mention "data warts" in a couple of places in the sub section data marts. am a little confused. can you clarify the term data warts?
Post new comment