My recent blog (Big Data & NoSQL Technologies)
discussed various NoSQL technologies and market vendors. Today let’s
dive into column-oriented databases and why they should play an
important role in any data warehouse whose focus is on aggregations or
metrics (and whose isn’t?).
So you are all probably familiar with
row-oriented databases. Tables of data where rows of fields (also
called columns) represent the structural storage and the corresponding
SQL queries that select, insert, update, and delete that data. Most
database vendors like Oracle, Microsoft, Sybase, Informix, and many
others all base their technology on this ANSI standard. Column-oriented
databases are indeed what you might surmise; tables of data where
columns of data values represent the structural storage. What you might
not expect is that on the surface many column-oriented databases look
and feel like row oriented databases also using SQL queries in much the
same way. Creating tables, storing data, querying them are all pretty
much identical. They may appear similar, but two principal things to
understand is that the significant differences under the hood, in
particular, physical storage and query optimization.
As noted in my previous blogs on NoSQL,
there is also a column-store technology out there. Let’s not confuse
that with column oriented databases. They are different. Since several
NoSQL column-store vendors were highlighted before, we will focus
instead on the column oriented database vendors here.
First, some key benefits to column oriented databases:
- High performance on aggregation queries (like COUNT, SUM, AVG, MIN, MAX)
- Highly efficient data compression and/or partitioning
- True scalability and fast data loading for Big Data
- Accessible by many 3rd party BI analytic tools
- Fairly simple systems administration
Due to their aggregation capabilities
which compute large numbers of similar data items, column oriented
databases offer key advantages for certain types of systems, including:
- Data Warehouses and Business Intelligence
- Customer Relationship Management (CRM)
- Library Card Catalogs
- Ad hoc query systems
Column oriented database technology has
actually been around for many years originating in 1969 with an
application called TAXIR which provided abstracts for mathematical
biosciences. In 1976, Canada implemented the RAPID system for
processing and retrieval of population and housing census statistics.
Sybase IQ was the only commercially available column-oriented database
for many years, yet that has changed rapidly in the last few years.
Let’s take a quick look at some of today’s key players:
SAP Sybase IQ (www.sybase.com)
A highly optimized
analytics server designed specifically to deliver superior performance
for mission-critical business intelligence, analytics and data
warehousing solutions on any standard hardware and operating system.
Its column oriented grid-based architecture, patented data compression,
and advanced query optimizer delivers high performance, flexibility, and
economy in challenging reporting and analytics environments.
Essentially a data partitioned, index based storage technology, Sybase IQ’s engine offers several key features which include:
- Web enabled analytics
- Communications & Security
- Fast Data Loading
- Query Engine supporting Full Text Search
- Column Indexing Sub System
- Column Storage Processor
- User Friendly CUI based Administration & Monitoring
- Multiplex Grid Architecture
- Information Live-cycle management
The Sybase IQ Very
Large Data Base (VLDB) option provides partitioning and placement where a
table can have a specified column partition key with value ranges.
This partition allows data that should be grouped together to be grouped
together and separates data where they should be separated. The
drawback to this methodology is that it is not always known which is
which.
Infobright (www.infobright.com)
Offering both a
commercial (IEE) and a free community (ICE) edition, the combination of a
column oriented database with their Knowledge Grid architecture
delivers a self-managed, scalable, high performance analytics query
platform. Allowing 50Tb using a single server, their industry-leading
data compression (10:1 up to 40:1) significantly reduces storage
requirements and expensive hardware infrastructures. Delivered as a
MySQL engine, Infobright runs on multiple operating systems and
processors needing only a minimum of 4Gb of RAM (however 16Gb is a
recommended starting point).
Avoiding partition
schemes, Infobright data is stored in data packs, each node containing
pre-aggregated statistics about the data stored within them. The
Knowledge Grid above provides related metadata providing a high level
view of the entire content of the database. Indexes, projections,
partitioning or aggregated tables are not needed as these metadata
statistics are managed automatically. The granular computing engine
processes queries using the Knowledge Grid information to optimize query
processing eliminating or significantly reducing the amount of data
required for decompressing and access to answer a query. Some queries
may not need to access the data at all, finding instead the answer in
the Knowledge Grid itself.
The Infobright Data
Loader is highly efficient so data inserts are very fast. This
performance gain does come at a price so avoid updates unless absolutely
necessary, design de-normalized tables, and don’t plan on any deletes.
New features to the data loader include a reject option which allows
valid rows to commit while invalid rows are logged. This is highly
useful when loading millions of rows and only having a few rows with bad
data. Without this feature the entire data load would be rolled back.
Vertica (HP) (www.vertica.com)
Recently acquired by
Hewlett Packard, this platform was purpose built from the ground up to
enable data values having high performance real-time analytics needs.
With extensive data loading, queries, columnar storage, MPP
architecture, and data compression features, diverse communities can
develop and scale with a seamless integration ecosystem.
Claiming elasticity,
scale, performance, and simplicity the Vertica analytics platform uses
transformation partitioning to specify which rows belong together and
parallelism for speed. Several key features include:
- Columnar Storage & Execution
- Real-Time Query & Loading
- Scale-out MPP Architecture
- Automatic High Availability
- Aggressive Data Compression
- Extensible In-Database Analytics Framework
- In-Database Analytics Library
- Database Designer & Administration Tools
- Native BI & ETL support for MapReduce & Hadoop
The Vertica
Optimizer is the brains of the analytics platform producing optimal
query execution plans where several choices exist. It does this through
traditional considerations like disk I/O and further incorporates CPU,
memory, network, concurrency, parallelism factors and the unique details
of the columnar operator and runtime environment.
ParAccel (www.paraccel.com)
Analytic-driven
companies need a platform, not just a database where speed, agility, and
complexity drive the data ecosystem. The ParAccel Analytic Platform
streamlines the delivery of complex business decisions through its high
performance analytic database. Designed for speed, its extensible
framework supports on-demand integration and embedded functions.
The ParAccel
Database (PADB) present four main components: the ‘Leader’ node, the
‘Compute’ node, the Parallel Communications Fabric, and an optional
Storage Area Network (SAN). The ‘Leader’ controls the execution of the
‘Compute’ nodes and all nodes communicate with each other via the
‘Fabric’ running on standard x86 Linux servers. Each ‘Compute’ node is
subdivided into a set of parallel processes called ‘slices’ that include
a CPU core, and thier allocation of memory, and local disk storage.
The ‘Communication Fabric’ provides a low-level MPP network protocol for
increased performance.
Key PADB features include:
- High Performance & Scalability
- Columnar Orientation
- Extensible Analytics
- Query Compilation
- High Availability
- Solution Simplicity
ParAccel Integrated
Analytics Library and Extensibility Framework incorporates advanced
functions along with an API to add your own functions to help address
complex business problems right in the core database enabling customers
to focus upon their specific data complexities.
Microsoft SQL Server 2012 (www.microsoft.com)
Released this year,
Microsoft has now embraced the columnar database idea. The latest SQL
Server release 2012 includes xVelocity, a column-store index feature
that stores data similar to a column-oriented DBMS. While not a true
column oriented database, this technique allows for the creation of a
memory optimized index that groups and stores data for each column then
and joins them together to complete the index. For certain types of
queries, like aggregations, the query processor can take advantage of
the column-store index to significantly improve execution times. Column
store indexes can be used with partitioned tables providing a new way
to think about how to design and process large datasets.
The column-store
index can be very useful on large fact tables in a Star schema improving
overall performance, however the cost model approach utilized may
choose the column-store index for a table when a row based index would
have been better. Using the IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX query
hint will work around this if it occurs. When data is stored with a
column-store index, data can often be compressed more effectively over a
row based index. This is accomplished as typically there is more
redundancy within a column than within a row. Higher compression means
less IO is required to retrieve data into memory which can significantly
reduce response times.
There are several
restrictions and limitation in using a column-store index. For example,
which data types are supported or not and that you can only create one
column-store index on any table can be problematic. Become familiar
with what it can do and where best to use it. Currently the
column-store index is not supported on Microsoft Azure.
Column-oriented databases provide
significant advantages over traditional row oriented system applied
correctly; In particular for data warehouse and business intelligence
environments where aggregations prevail. It would not be fair however
to ignore the disadvantages. Let’s look at these two:
- Column-Oriented Advantages
- Efficient storage and data compression
- Fast data loads
- Fast aggregation queries
- Simplified administration & configuration
- ‘Column-Oriented Disadvantages
- Transactions are to be avoided or just not supported
- Queries with table joins can reduce high performance
- Record updates and deletes reduce storage efficiency
- Effective partitioning/indexing schemes can be difficult to design
The
real value in using column-oriented database technology comes from high
performance, scalable storage and retrieval of large to massive
datasets (Big Data) focused on aggregation queries. Simply put:
Reports! You can design Star schema’s or Data Vaults (The Data Vault – What is it? – Why do we need it?)
incorporating these technologies and you will find that column-oriented
databases provide a clear solution in data warehouse and business
intelligence.
Look for future blogs on
Hadoop/Hive/HBase and Extract-Transform-Load (ETL) technologies, and
don’t be afraid to comment, question, or debate, there is always room to
learn new things…