Optimizing Data Quality

Signs of Unstable Data Foundation

There is relatively easy to do high level of data quality check examination in your organization. You just need to get right answers on the following questions:

  • Is there a single enterprise view of data?
  • Are you able to gather data for as yet unspecified reporting requirements?
  • In case of senior management requests for information does you require intensive a manual effort to respond?
  • Is there a common data “dictionary” across the enterprise?
  • Does you have a circumstances that multiple databases or spreadsheets storing similar data?
  • Is there defined ownership of data?
  • How difficult is for your organization to get compliance with regulatory requirements?
  • What is the role of timely, comprehensive and accurate of information in multi-million dollar decisions?
  • What effort you required in order to consolidate data from multiple diverse sources?
  • Do you require significant effort in building a single architecture to address both data consolidation and data aggregation requirements?

More questions with a bad omen means worse data quality level, requiring more effort in order to improve.

Data Profiling – More Detail Analysis of Data Quality status

Data profiling is the systematic analysis of the content of a operational data sources. Level of analysis should be very detailed, from counting the bytes and checking cardinalities up to the most thoughtful diagnosis of whether the data can meet the high level goals of the organization (or data warehouse).

Data profiling analysis can be divided in into a series of tests, starting with individual fields and ending with whole suites of tables comprising extended databases. Individual fields are checked to see that their contents agree with their basic data definitions and domain declarations. It is especially valuable to see how many rows have null values, or have contents that violate the domain definition. For example, if the domain definition is “telephone number” then alphanumeric entries clearly represents a problem. The best data profiling tools count, sort, and display the entries that violate data definitions and domain declarations. Moving beyond single fields, data profiling then describes the relationships discovered between fields in the same table. Fields that implements a key to the data table can be displayed, together with higher level many-to-1 relationships that implement hierarchies. Checking what should be the key of a table is especially helpful because the violations (duplicate instances of the key field) are either serious errors, or reflect a business rule that has not been incorporated into the ETL design. Relationships between tables are also checked in the data profiling step, including assumed foreign key to primary key relationships and the presence of parents without children.

Finally, data profiling can be custom programmed to check complex business rules unique to a business such as verifying that all the preconditions have been met for granting approval of a major funding initiative.

Key Roles: Data Ownership and Data Stewardship

As the name implies, data owners are those individuals or groups within the organization that are in the position to obtain, create, and have significant control over the content (and sometimes, access to and the distribution of) the data. Data owners often belong to a business rather than a technology organization. For example, an insurance agent may be the owner of the list of contacts of his or her clients and prospects.

The concept of data stewardship is different from data ownership. Data stewards do not own the data and do not have complete control over its use. Their role is to ensure that adequate, agreed-upon quality metrics are maintained on a continuous basis. In order to be effective, data stewards should work with data architects, database administrators, ETL (Extract-Transform-Load) designers, business intelligence and reporting application architects, and business data owners to define and apply data quality metrics. These cross-functional teams are responsible for identifying deficiencies in systems, applications, data stores, and processes that create and change data and thus may introduce or create data quality problems. One consequence of having a robust data stewardship program is its ability to help the members of the IT organization to enhance appropriate architecture components to improve data quality.

Data stewards must help create and actively participate in processes that would allow the establishment of business-context-defined, measurable data quality goals. Only after an organization has defined and agreed with the data quality goals can the data stewards devise appropriate data quality improvement programs.

These data quality goals and the improvement programs should be driven primarily by business units, so it stands to reason that in order to gain full knowledge of the data quality issues, their roots, and the business impact of these issues, a data steward should be a member of a business team. Regardless of whether a data steward works for a business team or acts as a "virtual" member of the team, a data steward has to be very closely aligned with the information technology group in order to discover and mitigate the risks introduced by inadequate data quality.

Extending this logic even further, we can say that a data steward would be most effective if he or she can operate as close to the point of data acquisition as technically possible. For example, a steward for customer contact and service complaint data that is created in a company's service center may be most effective when operating inside that service center.

Finally, and in accordance with data governance principles, data stewards have to be accountable for improving the data quality of the information domain they oversee. This means not only appropriate levels of empowerment but also the organization's willingness and commitment to make the data steward's data quality responsibility his or her primary job function, so that data quality improvement is recognized as an important business function required treating data as a valuable corporate asset.

Data Quality Management Process

In the previous sections we take a brief overview of some important aspects of data quality. Now, I will focus on the whole process.  The picture below represents activities required in order to improve data quality management process:

Data Quality Management Process

The whole process consist  of six major activities divided into certain number of sub activities.

The major activities are:

  • Data domain definition consists of following sub activities: Build Team, Build Common Definition, Build Data Quality Rules and Standards, Augment Data Quality Architecture.
  • System mapping consists of following sub activities: Identify Source systems, Map Data Elements To Source Systems
  • GAP analysis consists of following sub activities: Identify Initial Issues, Profile Data elements, Build Data Quality Dashboard Reports, Develop List of Data Quality Issues
  • Data Quality Remediation Strategy consists of following sub activities: Perform Root Cause Analysis and Build Reports, Develop remediation Options, Develop estimates and Business Case, Prioritize and Recommend Remediation Options
  • Remediate consists of following sub activities: Rationalize Remediation Portfolio, Initiate Remediation Project, Cleanse and Correct Data, Implement remediation Plan and Track Status
  • Monitor and Control consists of following sub activities: Build Data Quality Scorecards/Reports Assess Remediation results, Provide feedback

In general there are a few dimensions of effective data management process:

  • Evaluate the current state of data quality in the organization.
  • Identify and Resolve issues
  • Establish data quality environment (people, policies and procedures, common definitions) capable to provide required level of data quality after initial data quality remediation
  • Data quality activities requires some costs (time, money, people) – be aware about that

References:

  • IBM Research
  • www.ralphkimball.com
  • Master data management and customer data integration for a global enterprise By Alex Berson, Larry Dubov, Lawrence Dubov
Rate:
 
4 votes

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters (without spaces) shown in the image.