Showing posts with label Database. Show all posts
Showing posts with label Database. Show all posts

Friday, 13 November 2015

Column Vs Row Data Storage

 Overview of Row Data Storage and Column Data Storage

Relational databases typically use row-based data storage. However Column-based storage is more suitable for many business applications. SAP HANA supports both row-based and column-based storage, and is particularly optimized for column-based storage.

As shown in the figure below, a database table is conceptually a two-dimensional structure composed of cells arranged in rows and columns.

Because computer memory is structured linearly, there are two options for the sequences of cell values stored in contiguous memory locations:

Row Storage - It stores table records in a sequence of rows.
Column Storage - It stores table records in a sequence of columns i.e. the entries of a column is stored in contiguous memory locations.

Traditional databases store data simply in rows. The HANA in-memory database stores data in both rows and columns. It is this combination of both storage approaches that produces the speed, flexibility and performance of the HANA database.

Advantages of column-based tables:

Faster Data Access:
Only affected columns have to be read during the selection process of a query. Any of the columns can serve as an index.

Better Compression:
Columnar data storage allows highly efficient compression because the majority of the columns contain only few distinct values (compared to number of rows).

Better parallel Processing:
In a column store, data is already vertically partitioned. This means that operations on different columns can easily be processed in parallel. If multiple columns need to be searched or aggregated, each of these operations can be assigned to a different processor core

Advantages and disadvantages of row-based tables:

Row based tables have advantages in the following circumstances:
  • The application needs to only process a single record at one time (many selects and/or updates of single records).
  • The application typically needs to access a complete record (or row).
  • Neither aggregations nor fast searching are required.
  • The table has a small number of rows (e. g. configuration tables, system tables).

Row based tables have dis-advantages in the following circumstances:
  • In case of analytic applications where aggregation are used and fast search and processing is required. In row based tables all data in a row has to be read even though the requirement may be to access data from a few columns.

Which type of tables should be preferred - Row-based or Column-based?

In case of analytic applications where aggregations are used and fast search and processing is required row-based storage are not good. In row based tables all data in a row has to be read even though the requirement may be to access data from a few columns. Hence these queries on huge amounts of data take a lot of time.

In columnar tables, this information is stored physically next to each other, significantly increasing the speed of certain data queries.

The following example shows the different usage of column and row storage, and positions them relative to row and column queries. Column storage is most useful for OLAP queries (queries using any SQL aggregate functions) because these queries get just a few attributes from every data entry. But for traditional OLTP queries (queries not using any SQL aggregate functions), it is more advantageous to store all attributes side-by-side in row tables. HANA combines the benefits of both row- and column-storage tables.

Conclusion:
To enable fast on-the-fly aggregations, ad-hoc reporting, and to benefit from compression mechanisms it is recommended that transaction data is stored in a column-based table.

The SAP HANA data-base allows joining row-based tables with column-based tables. However, it is more efficient to join tables that are located in the same row or column store. For example, master data that is frequently joined with transaction data should also be stored in column-based tables. 

Column Oriented Database Technologies

My recent blog (Big Data & NoSQL Technologies) discussed various NoSQL technologies and market vendors.  Today let’s dive into column-oriented databases and why they should play an important role in any data warehouse whose focus is on aggregations or metrics (and whose isn’t?).
So you are all probably familiar with row-oriented databases.  Tables of data where rows of fields (also called columns) represent the structural storage and the corresponding SQL queries that select, insert, update, and delete that data.  Most database vendors like Oracle, Microsoft, Sybase, Informix, and many others all base their technology on this ANSI standard.  Column-oriented databases are indeed what you might surmise; tables of data where columns of data values represent the structural storage.  What you might not expect is that on the surface many column-oriented databases look and feel like row oriented databases also using SQL queries in much the same way.  Creating tables, storing data, querying them are all pretty much identical.  They may appear similar, but two principal things to understand is that the significant differences under the hood, in particular, physical storage and query optimization.
As noted in my previous blogs on NoSQL, there is also a column-store technology out there.  Let’s not confuse that with column oriented databases.  They are different.  Since several NoSQL column-store vendors were highlighted before, we will focus instead on the column oriented database vendors here.
First, some key benefits to column oriented databases:

  • High performance on aggregation queries (like COUNT, SUM, AVG, MIN, MAX)
  • Highly efficient data compression and/or partitioning
  • True scalability and fast data loading for Big Data
  • Accessible by many 3rd party BI analytic tools
  • Fairly simple systems administration
Due to their aggregation capabilities which compute large numbers of similar data items, column oriented databases offer key advantages for certain types of systems, including:
  • Data Warehouses and Business Intelligence
  • Customer Relationship Management (CRM)
  • Library Card Catalogs
  • Ad hoc query systems
Column oriented database technology has actually been around for many years originating in 1969 with an application called TAXIR which provided abstracts for mathematical biosciences.  In 1976, Canada implemented the RAPID system for processing and retrieval of population and housing census statistics.  Sybase IQ was the only commercially available column-oriented database for many years, yet that has changed rapidly in the last few years.  Let’s take a quick look at some of today’s key players:
SAP Sybase IQ  (www.sybase.com)
A highly optimized analytics server designed specifically to deliver superior performance for mission-critical business intelligence, analytics and data warehousing solutions on any standard hardware and operating system.  Its column oriented grid-based architecture, patented data compression, and advanced query optimizer delivers high performance, flexibility, and economy in challenging reporting and analytics environments.
Essentially a data partitioned, index based storage technology, Sybase IQ’s engine offers several key features which include:
    • Web enabled analytics
    • Communications & Security
    • Fast Data Loading
    • Query Engine supporting Full Text Search
    • Column Indexing Sub System
    • Column Storage Processor
    • User Friendly CUI based Administration & Monitoring
    • Multiplex Grid Architecture
    • Information Live-cycle management
The Sybase IQ Very Large Data Base (VLDB) option provides partitioning and placement where a table can have a specified column partition key with value ranges.  This partition allows data that should be grouped together to be grouped together and separates data where they should be separated.  The drawback to this methodology is that it is not always known which is which.
Infobright  (www.infobright.com)
Offering both a commercial (IEE) and a free community (ICE) edition, the combination of a column oriented database with their Knowledge Grid architecture delivers a self-managed, scalable, high performance analytics query platform.  Allowing 50Tb using a single server, their industry-leading data compression (10:1 up to 40:1) significantly reduces storage requirements and expensive hardware infrastructures.  Delivered as a MySQL engine, Infobright runs on multiple operating systems and processors needing only a minimum of 4Gb of RAM (however 16Gb is a recommended starting point).
Avoiding partition schemes, Infobright data is stored in data packs, each node containing pre-aggregated statistics about the data stored within them.  The Knowledge Grid above provides related metadata providing a high level view of the entire content of the database.  Indexes, projections, partitioning or aggregated tables are not needed as these metadata statistics are managed automatically.  The granular computing engine processes queries using the Knowledge Grid information to optimize query processing eliminating or significantly reducing the amount of data required for decompressing and access to answer a query.  Some queries may not need to access the data at all, finding instead the answer in the Knowledge Grid itself.
The Infobright Data Loader is highly efficient so data inserts are very fast.  This performance gain does come at a price so avoid updates unless absolutely necessary, design de-normalized tables, and don’t plan on any deletes.  New features to the data loader include a reject option which allows valid rows to commit while invalid rows are logged.  This is highly useful when loading millions of rows and only having a few rows with bad data.  Without this feature the entire data load would be rolled back.
Vertica (HP)  (www.vertica.com)
Recently acquired by Hewlett Packard, this platform was purpose built from the ground up to enable data values having high performance real-time analytics needs.  With extensive data loading, queries, columnar storage, MPP architecture, and data compression features, diverse communities can develop and scale with a seamless integration ecosystem.
Claiming elasticity, scale, performance, and simplicity the Vertica analytics platform uses transformation partitioning to specify which rows belong together and parallelism for speed.  Several key features include:
    • Columnar Storage & Execution
    • Real-Time Query & Loading
    • Scale-out MPP Architecture
    • Automatic High Availability
    • Aggressive Data Compression
    • Extensible In-Database Analytics Framework
    • In-Database Analytics Library
    • Database Designer & Administration Tools
    • Native BI & ETL support for MapReduce & Hadoop
The Vertica Optimizer is the brains of the analytics platform producing optimal query execution plans where several choices exist.  It does this through traditional considerations like disk I/O and further incorporates CPU, memory, network, concurrency, parallelism factors and the unique details of the columnar operator and runtime environment.
ParAccel  (www.paraccel.com)
Analytic-driven companies need a platform, not just a database where speed, agility, and complexity drive the data ecosystem.  The ParAccel Analytic Platform streamlines the delivery of complex business decisions through its high performance analytic database.  Designed for speed, its extensible framework supports on-demand integration and embedded functions.
The ParAccel Database (PADB) present four main components: the ‘Leader’ node, the ‘Compute’ node, the Parallel Communications Fabric, and an optional Storage Area Network (SAN).  The ‘Leader’ controls the execution of the ‘Compute’ nodes and all nodes communicate with each other via the ‘Fabric’ running on standard x86 Linux servers.  Each ‘Compute’ node is subdivided into a set of parallel processes called ‘slices’ that include a CPU  core, and thier allocation of memory, and local disk storage.  The ‘Communication Fabric’ provides a low-level MPP network protocol for increased performance.
Key PADB features include:
    • High Performance & Scalability
    • Columnar Orientation
    • Extensible Analytics
    • Query Compilation
    • High Availability
    • Solution Simplicity
ParAccel Integrated Analytics Library and Extensibility Framework incorporates advanced functions along with an API to add your own functions to help address complex business problems right in the core database enabling customers to focus upon their specific data complexities.
Microsoft SQL Server 2012  (www.microsoft.com)
Released this year, Microsoft has now embraced the columnar database idea. The latest SQL Server release 2012 includes xVelocity, a column-store index feature that stores data similar to a column-oriented DBMS.  While not a true column oriented database, this technique allows for the creation of a memory optimized index that groups and stores data for each column then and joins them together to complete the index.  For certain types of queries, like aggregations, the query processor can take advantage of the column-store index to significantly improve execution times.  Column store indexes can be used with partitioned tables providing a new way to think about how to design and process large datasets.
The column-store index can be very useful on large fact tables in a Star schema improving overall performance, however the cost model approach utilized may choose the column-store index for a table when a row based index would have been better.  Using the IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX query hint will work around this if it occurs.  When data is stored with a column-store index, data can often be compressed more effectively over a row based index.  This is accomplished as typically there is more redundancy within a column than within a row.  Higher compression means less IO is required to retrieve data into memory which can significantly reduce response times.
There are several restrictions and limitation in using a column-store index.  For example, which data types are supported or not and that you can only create one column-store index on any table can be problematic.  Become familiar with what it can do and where best to use it.  Currently the column-store index is not supported on Microsoft Azure.
Column-oriented databases provide significant advantages over traditional row oriented system applied correctly; In particular for data warehouse and business intelligence environments where aggregations prevail.  It would not be fair however to ignore the disadvantages.  Let’s look at these two:
  • Column-Oriented Advantages
    • Efficient storage and data compression
    • Fast data loads
    • Fast aggregation queries
    • Simplified administration & configuration
  • ‘Column-Oriented Disadvantages
    • Transactions are to be avoided or just not supported
    • Queries with table joins can reduce high performance
    • Record updates and deletes reduce storage efficiency
    • Effective partitioning/indexing schemes can be difficult to design
The real value in using column-oriented database technology comes from high performance, scalable storage and retrieval of large to massive datasets (Big Data) focused on aggregation queries.  Simply put: Reports!  You can design Star schema’s or Data Vaults (The Data Vault – What is it? – Why do we need it?) incorporating these technologies and you will find that column-oriented databases provide a clear solution in data warehouse and business intelligence.
Look for future blogs on Hadoop/Hive/HBase and Extract-Transform-Load (ETL) technologies, and don’t be afraid to comment, question, or debate, there is always room to learn new things…