Friday, 13 November 2015

Column Vs Row Data Storage

 Overview of Row Data Storage and Column Data Storage

Relational databases typically use row-based data storage. However Column-based storage is more suitable for many business applications. SAP HANA supports both row-based and column-based storage, and is particularly optimized for column-based storage.

As shown in the figure below, a database table is conceptually a two-dimensional structure composed of cells arranged in rows and columns.

Because computer memory is structured linearly, there are two options for the sequences of cell values stored in contiguous memory locations:

Row Storage - It stores table records in a sequence of rows.
Column Storage - It stores table records in a sequence of columns i.e. the entries of a column is stored in contiguous memory locations.

Traditional databases store data simply in rows. The HANA in-memory database stores data in both rows and columns. It is this combination of both storage approaches that produces the speed, flexibility and performance of the HANA database.

Advantages of column-based tables:

Faster Data Access:
Only affected columns have to be read during the selection process of a query. Any of the columns can serve as an index.

Better Compression:
Columnar data storage allows highly efficient compression because the majority of the columns contain only few distinct values (compared to number of rows).

Better parallel Processing:
In a column store, data is already vertically partitioned. This means that operations on different columns can easily be processed in parallel. If multiple columns need to be searched or aggregated, each of these operations can be assigned to a different processor core

Advantages and disadvantages of row-based tables:

Row based tables have advantages in the following circumstances:
  • The application needs to only process a single record at one time (many selects and/or updates of single records).
  • The application typically needs to access a complete record (or row).
  • Neither aggregations nor fast searching are required.
  • The table has a small number of rows (e. g. configuration tables, system tables).

Row based tables have dis-advantages in the following circumstances:
  • In case of analytic applications where aggregation are used and fast search and processing is required. In row based tables all data in a row has to be read even though the requirement may be to access data from a few columns.

Which type of tables should be preferred - Row-based or Column-based?

In case of analytic applications where aggregations are used and fast search and processing is required row-based storage are not good. In row based tables all data in a row has to be read even though the requirement may be to access data from a few columns. Hence these queries on huge amounts of data take a lot of time.

In columnar tables, this information is stored physically next to each other, significantly increasing the speed of certain data queries.

The following example shows the different usage of column and row storage, and positions them relative to row and column queries. Column storage is most useful for OLAP queries (queries using any SQL aggregate functions) because these queries get just a few attributes from every data entry. But for traditional OLTP queries (queries not using any SQL aggregate functions), it is more advantageous to store all attributes side-by-side in row tables. HANA combines the benefits of both row- and column-storage tables.

Conclusion:
To enable fast on-the-fly aggregations, ad-hoc reporting, and to benefit from compression mechanisms it is recommended that transaction data is stored in a column-based table.

The SAP HANA data-base allows joining row-based tables with column-based tables. However, it is more efficient to join tables that are located in the same row or column store. For example, master data that is frequently joined with transaction data should also be stored in column-based tables. 

No comments:

Post a Comment