BI / ETL / DWH: Column Oriented Database Technologies

My recent blog (Big Data & NoSQL Technologies) discussed various NoSQL technologies and market vendors. Today let’s dive into column-oriented databases and why they should play an important role in any data warehouse whose focus is on aggregations or metrics (and whose isn’t?).

So you are all probably familiar with row-oriented databases. Tables of data where rows of fields (also called columns) represent the structural storage and the corresponding SQL queries that select, insert, update, and delete that data. Most database vendors like Oracle, Microsoft, Sybase, Informix, and many others all base their technology on this ANSI standard. Column-oriented databases are indeed what you might surmise; tables of data where columns of data values represent the structural storage. What you might not expect is that on the surface many column-oriented databases look and feel like row oriented databases also using SQL queries in much the same way. Creating tables, storing data, querying them are all pretty much identical. They may appear similar, but two principal things to understand is that the significant differences under the hood, in particular, physical storage and query optimization.

As noted in my previous blogs on NoSQL, there is also a column-store technology out there. Let’s not confuse that with column oriented databases. They are different. Since several NoSQL column-store vendors were highlighted before, we will focus instead on the column oriented database vendors here.

First, some key benefits to column oriented databases:

High performance on aggregation queries (like COUNT, SUM, AVG, MIN, MAX)
Highly efficient data compression and/or partitioning
True scalability and fast data loading for Big Data
Accessible by many 3^rd party BI analytic tools
Fairly simple systems administration

Due to their aggregation capabilities which compute large numbers of similar data items, column oriented databases offer key advantages for certain types of systems, including:

Data Warehouses and Business Intelligence
Customer Relationship Management (CRM)
Library Card Catalogs
Ad hoc query systems

Column oriented database technology has actually been around for many years originating in 1969 with an application called TAXIR which provided abstracts for mathematical biosciences. In 1976, Canada implemented the RAPID system for processing and retrieval of population and housing census statistics. Sybase IQ was the only commercially available column-oriented database for many years, yet that has changed rapidly in the last few years. Let’s take a quick look at some of today’s key players:

SAP Sybase IQ (www.sybase.com)

A highly optimized analytics server designed specifically to deliver superior performance for mission-critical business intelligence, analytics and data warehousing solutions on any standard hardware and operating system. Its column oriented grid-based architecture, patented data compression, and advanced query optimizer delivers high performance, flexibility, and economy in challenging reporting and analytics environments.

Essentially a data partitioned, index based storage technology, Sybase IQ’s engine offers several key features which include:

Web enabled analytics
Communications & Security
Fast Data Loading
Query Engine supporting Full Text Search
Column Indexing Sub System
Column Storage Processor
User Friendly CUI based Administration & Monitoring
Multiplex Grid Architecture
Information Live-cycle management

The Sybase IQ Very Large Data Base (VLDB) option provides partitioning and placement where a table can have a specified column partition key with value ranges. This partition allows data that should be grouped together to be grouped together and separates data where they should be separated. The drawback to this methodology is that it is not always known which is which.

Infobright (www.infobright.com)

Offering both a commercial (IEE) and a free community (ICE) edition, the combination of a column oriented database with their Knowledge Grid architecture delivers a self-managed, scalable, high performance analytics query platform. Allowing 50Tb using a single server, their industry-leading data compression (10:1 up to 40:1) significantly reduces storage requirements and expensive hardware infrastructures. Delivered as a MySQL engine, Infobright runs on multiple operating systems and processors needing only a minimum of 4Gb of RAM (however 16Gb is a recommended starting point).

Avoiding partition schemes, Infobright data is stored in data packs, each node containing pre-aggregated statistics about the data stored within them. The Knowledge Grid above provides related metadata providing a high level view of the entire content of the database. Indexes, projections, partitioning or aggregated tables are not needed as these metadata statistics are managed automatically. The granular computing engine processes queries using the Knowledge Grid information to optimize query processing eliminating or significantly reducing the amount of data required for decompressing and access to answer a query. Some queries may not need to access the data at all, finding instead the answer in the Knowledge Grid itself.

The Infobright Data Loader is highly efficient so data inserts are very fast. This performance gain does come at a price so avoid updates unless absolutely necessary, design de-normalized tables, and don’t plan on any deletes. New features to the data loader include a reject option which allows valid rows to commit while invalid rows are logged. This is highly useful when loading millions of rows and only having a few rows with bad data. Without this feature the entire data load would be rolled back.

Vertica (HP) (www.vertica.com)

Recently acquired by Hewlett Packard, this platform was purpose built from the ground up to enable data values having high performance real-time analytics needs. With extensive data loading, queries, columnar storage, MPP architecture, and data compression features, diverse communities can develop and scale with a seamless integration ecosystem.

Claiming elasticity, scale, performance, and simplicity the Vertica analytics platform uses transformation partitioning to specify which rows belong together and parallelism for speed. Several key features include:

Columnar Storage & Execution
Real-Time Query & Loading
Scale-out MPP Architecture
Automatic High Availability
Aggressive Data Compression
Extensible In-Database Analytics Framework
In-Database Analytics Library
Database Designer & Administration Tools
Native BI & ETL support for MapReduce & Hadoop

The Vertica Optimizer is the brains of the analytics platform producing optimal query execution plans where several choices exist. It does this through traditional considerations like disk I/O and further incorporates CPU, memory, network, concurrency, parallelism factors and the unique details of the columnar operator and runtime environment.

ParAccel (www.paraccel.com)

Analytic-driven companies need a platform, not just a database where speed, agility, and complexity drive the data ecosystem. The ParAccel Analytic Platform streamlines the delivery of complex business decisions through its high performance analytic database. Designed for speed, its extensible framework supports on-demand integration and embedded functions.

The ParAccel Database (PADB) present four main components: the ‘Leader’ node, the ‘Compute’ node, the Parallel Communications Fabric, and an optional Storage Area Network (SAN). The ‘Leader’ controls the execution of the ‘Compute’ nodes and all nodes communicate with each other via the ‘Fabric’ running on standard x86 Linux servers. Each ‘Compute’ node is subdivided into a set of parallel processes called ‘slices’ that include a CPU core, and thier allocation of memory, and local disk storage. The ‘Communication Fabric’ provides a low-level MPP network protocol for increased performance.

Key PADB features include:

High Performance & Scalability
Columnar Orientation
Extensible Analytics
Query Compilation
High Availability
Solution Simplicity

ParAccel Integrated Analytics Library and Extensibility Framework incorporates advanced functions along with an API to add your own functions to help address complex business problems right in the core database enabling customers to focus upon their specific data complexities.

Microsoft SQL Server 2012 (www.microsoft.com)

Released this year, Microsoft has now embraced the columnar database idea. The latest SQL Server release 2012 includes xVelocity, a column-store index feature that stores data similar to a column-oriented DBMS. While not a true column oriented database, this technique allows for the creation of a memory optimized index that groups and stores data for each column then and joins them together to complete the index. For certain types of queries, like aggregations, the query processor can take advantage of the column-store index to significantly improve execution times. Column store indexes can be used with partitioned tables providing a new way to think about how to design and process large datasets.

The column-store index can be very useful on large fact tables in a Star schema improving overall performance, however the cost model approach utilized may choose the column-store index for a table when a row based index would have been better. Using the IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX query hint will work around this if it occurs. When data is stored with a column-store index, data can often be compressed more effectively over a row based index. This is accomplished as typically there is more redundancy within a column than within a row. Higher compression means less IO is required to retrieve data into memory which can significantly reduce response times.

There are several restrictions and limitation in using a column-store index. For example, which data types are supported or not and that you can only create one column-store index on any table can be problematic. Become familiar with what it can do and where best to use it. Currently the column-store index is not supported on Microsoft Azure.

Column-oriented databases provide significant advantages over traditional row oriented system applied correctly; In particular for data warehouse and business intelligence environments where aggregations prevail. It would not be fair however to ignore the disadvantages. Let’s look at these two:

Column-Oriented Advantages
- Efficient storage and data compression
- Fast data loads
- Fast aggregation queries
- Simplified administration & configuration

‘Column-Oriented Disadvantages
- Transactions are to be avoided or just not supported
- Queries with table joins can reduce high performance
- Record updates and deletes reduce storage efficiency
- Effective partitioning/indexing schemes can be difficult to design

The real value in using column-oriented database technology comes from high performance, scalable storage and retrieval of large to massive datasets (Big Data) focused on aggregation queries. Simply put: Reports! You can design Star schema’s or Data Vaults (The Data Vault – What is it? – Why do we need it?) incorporating these technologies and you will find that column-oriented databases provide a clear solution in data warehouse and business intelligence.

Look for future blogs on Hadoop/Hive/HBase and Extract-Transform-Load (ETL) technologies, and don’t be afraid to comment, question, or debate, there is always room to learn new things…

BI / ETL / DWH

Friday, 13 November 2015

Column Oriented Database Technologies

No comments:

Post a Comment