Columnar Databases and Data Warehouse
Columnar Databases Overview
Columnar databases are an old idea that becomes actually again. The firs introduction of columnar databases was in 1970s. However in the 1980s and 1990s the huge improvement of hardware puts relational databases in a front page. Nowadays, a huge expansion of analytical (business intelligence/data warehouse) solutions working with large volume of data allow columnar databases to back in the game - as approach that could improve efficiency of analytical systems.
A columnar DBMS stores its content by column rather that by row. However, a database program must show data in a format of two-dimensional tables, but store it as one-dimensional strings. For instance look at following table:
|
Id |
Last Name |
First Name |
Salary |
|
1 |
Smith |
John |
10,000 |
|
2 |
Bolton |
Nick |
50,000 |
|
3 |
Jones |
Leanne |
40,000 |
A row-oriented database serializes all of the values in a row together, then the values in the next row, and so on. In a row-oriented database, accessing specific columns requires reading all records.
Here is the example of row serialization:
1, Smith, John, 10000, 2, Bolton, Nick, 50000, 3, Jones, Leanne, 40000
A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on. In a column-oriented database, only the columns in the query need to be retrieved.
Here is the example of row serialization:
1, 2, 3, Smith, Bolton, Jones, John, Nick, Leanne, 10000, 50000, 40000
The major, simplified advantage of column oriented databases over row oriented databases is in the efficiency of hard-disk access.
Columnar Databases Solving Data Warehouse Challenges
Columnar databases approach is especially important in data warehouse domain. Data warehouse are facing with large volume of data, complex data transformations and huge data aggregation. Consequently, appropriate data warehouse systems performance could be achieved by using an adequate approach in data management.
Why is row oriented databases are less performing?
In a typical relational database management system, data values are collected and managed as individual rows and events containing related rows (customer and order for example). This reflects the history wherein most data begins life in transactional applications which generally create or modify one or a few records at a time for performance reasons. Conversely, business intelligence and analytic applications, generated reports, and ad hoc queries often call upon the database to analyze selected attributes of vast number of rows or records, needing only those columns or aggregates of those columns to support the user’s needs.
Because of their row-based functions, a row-oriented database must read the entire record or “row” in order to access the needed attributes or column data. As a result, analytic and BI queries most often end up reading significantly more data than is needed to satisfy the request. This creates very large I/O burdens. In addition, the row-oriented RDBMS, having been designed for transactional activities, is most often built for optimum retrieval and joining of small data sets rather than large ones, further burdening the I/O subsystems that support the analytic store.
In response, system architects and DBAs often tune the environment for the different queries by building additional indexes, pre-aggregating data, and creating special materialized views and cubes. These require yet more processing time and consume additional persistent data storage. Because they are often quite query-specific, these tunings only address the performance of the queries that are known, and do not even touch upon general performance of ad hoc queries.
Another approach: Columnar databases
If the issues associated with poor performance can be attributed to the row-oriented, horizontal layout of the data, then consider the alternative: organizing the data vertically along the columns. As the name implies, a column-oriented database has its data organized and stored by columns. Because each column can be stored separately, for any query, the system can evaluate which columns are being accessed and retrieve only the values requested from the specific columns. Instead of requiring separate indexes for optimally tuned queries, the data values themselves within each column form the index, reducing I/O, enabling rapid access to the data without the need for expanding the database footprint, all while simultaneously and dramatically improving query performance.
Here is the list of particular benefits of such approach:
Better analytic performance: row oriented approach allows better performance in running a large number of simultaneous queries
Rapid joins and aggregation: data access streaming along column-oriented data allows for incrementally computing the results of aggregate functions, which is critical for data warehouse applications. In addition, there is no requirement for different columns of data to be stored together; allocating columnar data across multiple processing units and storage allows for parallel accesses and aggregations as well, increasing the overall query performance.
Suitability for compression: The columnar storage of data not only eliminates storage of multiple indexes, views and aggregations, but also facilitates vast improvements in compression, which can result in an additional reduction in storage while maintaining high performance.
Rapid data loading: The typical process for loading data into a data warehouse involves extracting data into a staging area, performing transformations, joining data to create denormalized representations and loading the data into the warehouse as fact and dimension tables, and then creating the collection of required indexes and views. In a row-based arrangement, all of the data values in each row need to be stored together, and then indexes must be constructed by reviewing all the row data. In a columnar arrangement the system effectively allows one to segregate storage by column. This means that each column is built in one pass, and stored separately, allowing the database system to load columns in parallel using multiple threads. Further, related performance characteristics of join processing built atop a column store is often sufficiently fast that the load-time joining required to create fact tables is unnecessary, shortening the latency from receipt of new data to availability for query processing. Finally, since columns are stored separately, entire table columns can be added and dropped without downing the system, and without the need to re-tuning the system following the change
Column Databases Challenges
Columnar databases have very clearly defined advantages against row oriented databases in a large volume of data. But, let’s see some other aspects:
Load time: Converting the data source into columnar format can be unbearably slow where tens or hundreds of gigabytes of data are involved.
Incremental loads: Incremental loads can be performance problematic.
Data compression: Some columnar systems greatly compress the source data. However, uncompressing the data to read it can slow performance.
Structural limitations: Columnar databases use different techniques to simulate a relational structure. Some require the same primary key on all tables, meaning the database hierarchy is limited to two levels. The limits imposed by a particular system may not seem to matter, but remember that your needs may change tomorrow. Constraints that seem acceptable now could prevent you from expanding the system in the future.
Scalability: Columnar databases major advantage is to get good performance on large databases. However, is there is reasonable to use columnar databases in case you are dealing with common size database?
Conclusion
Columnar databases are an interesting and actual approach that is quite suitable for data warehouse implementations. However, every pro have a con. There is no “one approach fit all”. Choosing the right approach, architecture, platform and technology for particular data warehouse implementation depends of number of various factors. Only careful evaluation can result with optimal choice for particular case.
References:
Wikipedia
David M. Raab , How To Judge A Columnar Database, Information Management Magazine, 2007
Lou Agosta, Columnar databases, appliances, cloud computing top BI trends, SearchDataManagement, 2009
David Loshin,Gaining the Performance Edge Using a Column-Oriented Database Management System, Sybase whitepaper
2 votes | 2 comments
Comments
Mr. Ciric:
Thanks for your post - it is a great summary of column-orientation, something we produce at Sybase.
I did find that your comments about compression and potential cost might be a bit misleading, so I have addressed them in the Sybase IQ blog at:
http://blogs.sybase.com/sybaseiq/2010/01/compression_versus_enumeration/
I'd welcome your comments on my attempt at explaining how we achieve data size reduction without explicit decompression costs.
Best regards,
Bill Jacobs
Director, Technical Marketing, Sybase IQ
Sometimes colums are great in BI, you are right!
You have probably never heard of APLDI in the mid 80's that used this approach....
From Grand daddy
Post new comment