BI / ETL / DWH

Tuesday, 24 November 2015

How-to: Mapviewer Integration with OBIEE 11g (11.1.1.6 and higher)

I've seen quite a few articles on OTN and Google which outline how to configure Mapviewer for your 11g solution. The problem is that many of these articles are:

1) overly complex
2) out of date

With the release of OBIEE 11.1.1.6, mapviewer comes pre-configured with weblogic and the only installation steps required are:

installation of navteq mapdata to your oracle database
establish a column based relationship between the map and a subject area

You do not have to modify any weblogic XML files, install any mapviewer jar files, or do any configuration within weblogic. Below is a step by step guide on how to configure and use mapviewer on your 11.1.1.6.x box:

Step 1: Download the mapviewer 'mvdemo' data into your Oracle database

Oracle provides pre-configured maps, coordinates, and navteq data that use can use in your reports. You need to download the MVDemo Sample Data Set .

Step 2: Create the mvdemo schema in your Oracle database

We're going to import the navteq map data into an mvdemo schema in step 3. Let's go ahead and create the mvdemo schema by typing the following command into your SQLPlus prompt:

grant connect, resource, create view to mvdemo identified by mvdemo

Note that you will need the appropriate privileges to create this account. If denied, try logging in as sysdba by typing the following command into sqplus:

CONNECT / AS sysdba

Step 3: Import navteq data dump to your Oracle database

Unzip the MV Demo Sample Data Set you just downloaded, and note the location of the 'mvdemp.dmp' file. This is the file we will import to the database.

Step 3.1)
Find the imp utility on your machine. It is usually located in your $ORACLE_HOME\product\11.x.x\dbhome_1\BIN folder

Step 3.2) Navigate to that folder via command line and run the following command:

imp mvdemo/mvdemo@ORCL file=mvdemo.dmp full=y ignore=y
where ORCL is your database SID
and file=mvdemo.dmp is the path (including mvdemo.dmp) to the dump file

You should get the following result:

Step 4) Import the map meta data

The Map Viewer Sample Data Set comes with city, county, state, highway, and other geographical map based data that you can impose onto the map. We're going to import this data set by running the following command in sqlplus:
:
@mvdemo.sql

Note that you must include the path of the mvdemo.sql file e.g. @C:\folder\folder1\mvdemo\mvdemo.sql

Step 5) Add your Oracle Database as a MapViewer Data Source

No, we're not doing this in weblogic - Mapviewer data source configuration is still done in your http://localhost:9704/mapviewer location

You'll arrive at a landing page like below, where you must click the 'Admin' button to log in:

5.1) login user your weblogic username/password

5.2) You should arrive at a home page with a link to view Datasources. Click it and you'll appear at:

Name = name/description of your data source

Host = hostname/ip address of your database

Port = database port number

SID = service ID of your oracle database (by default it is orcl)

user/password: If you followed my above steps, it will be mvdemo/mvdemo

# Mappers and Max Connections specify how many simultaneous users can connect to the MapViewer db. For diagnostic purposes I would make this relatively high, and once development is complete you can adjust as needed.

Step 6) Modify the mapViewerConfig.xml file to include the new data source

I'm a little surprised as to why this must be done manually, if anyone has any insight - please feel free to leave feedback. After you add the data source as outlined in step 5, you must then modify the mapViewerConfig.XML file to include said datasource, otherwise when the BI Server is rebooted, your datasource connection will be removed!

Luckily, this step is not too difficult

6.1) In :7001/mapviewer , log into your Admin screen and navigate to Management -> Configuration

Then add the following XML to the bottom of the config file, right above the </MappperConfig> line.

<map_data_source name="mvdemo"

jdbc_host="db1.my_corp.com"

jdbc_sid="orcl"

jdbc_port="1521"

jdbc_user="scott"

jdbc_password="!tiger"

jdbc_mode="thin"

number_of_mappers="3"

allow_jdbc_theme_based_foi="false"

Modify each line using the inputs you provided in step 5, but in the jdbc_password input should have a ! infront of it as that is Oracle's indicator to encrypt the password upon restart.

Step 7) Import a map layer into Answers

We've completed all of the back end work required to create a map. Now we'll go into Answers -> Administration -> Manage Map Data and import a layer (theme) that we'll use for our map.

A theme is a visual representation representation of the data, and arguably the most important component in creating a map. In this example let's use the "THEME_DEMO_COUNTIES" layer, which will give us the ability to impose a dataset over various Counties in the USA.

Step 8) Specify the BI Column used to associate the THEME_DEMO_COUNTIES layer to a dataset

The theme we're using, "THEME_DEMO_COUNTIES" stores attributes of Counties (county name, county lines, etc) which we can visualize on a map. We need to identify a way to 'join' the data set in OBIEE to the dataset of THEME_DEMO_COUNTIES.

After saving the layer you just created, click the 'Edit' button (pencil) to bring up the screen below.

Notice there is 'Layer Key' with the following column values; County, Fips, Population, State. We are going to use 'County' as the map column to join to the subject area.

Next we need to specify a column from our subject area which contains 'County'.

Step 9) Specify the Background Map

In steps 7 and 8 we specified a theme (visual representation of the data) and identified how to join the map data to our subject area (via County column). Now we need to specify which map we're the theme will use.

In the 'Background' tab, create a new background map and specify 'DEMO_MAP' as the background map.

After saving, edit the map ensure the THEME_DEMO_COUNTIES has successfully been applied:

It will default to the middle of the USA but I decided to zoom into California :)

Step 10) Create a Report using the County Column

Now we're ready to create the report! Create a new analysis, select the County column you specified in step 7, and a fact column which joins to the county dimension. Click the results tab, then New View -> Maps.

The result below outlines only California because the dataset I created uses only California Counties.

Note that I did not go into the MapBuilder tool, which you can use if you want to create custom themes and maps (e.g. map of a building, school, casino, etc). But this works great for a proof of concept!

keywords: obiee mapviewer, rendering maps, obiee configuration, obiee 11g maps, obiee navteq maps, obiee mapviewer integration

How-to: Bridge Tables and Many to Many Relationships Demystified in OBIEE 11g

Bridge tables - entire books have been devoted to this concept, countless blogs write about it, and organizations offer entire classes dedicated to demystifying this idea. Ralph Kimball , creator of Kimball Dimensional Modeling and founder of the Kimball Group has written quite a few great articles discussing the theory of bridge tables.

Yet when researching for comprehensive guides on how to actually implement a bridge table in OBIEE 11g, the documentation available is either:

Out of date

Contains implementation steps for OBIEE 10g which has since been deprecated
Does not contain adequate detail

e.g. missing key steps

This guide is going to outline the basic use case of a many to many relationship, how OBIEE 11g resolves this dilemma and how to successfully implement a bridge table model within the 11g platform.

First thing's first - what is a bridge table and why do we need it?

At its core, bridge table solve the many to many relationship we encounter in many datasets. Many to many relationships in itself are not "bad", but when attempting to conform a data set to a star schema - many to many relationships just do not work. Star schemas assume a one to many (1:N) cardinality from the dimension to the fact. This means "one attribute of a dimension row can be found in many rows of the fact table".

For Example:

One job (job dimension) can be performed by many people

You would see the same JOB_WID repeating in the fact table

One employee (employee dimension) can have many jobs

You would see the same EMPLOYEE_WID repeating in the fact table

One call at a call center(ticket dimension) can have many ticket types

You would see the same CALL_WID repeating in the fact table

One patient (patient dimension) can have many diagnosis

You would see the same PATIENT_WID repeating in the fact table

This 1:N cardinality is represented in OBIEE as (using call center/employee example) :

"Cardinality of '1' applied to the dimension and cardinality of 'N' applied to the fact'

But what happens when in the above examples, the cardinality is actually N:N?

For Example:

Many employees can have multiple jobs and each job can be performed by multiple employees
Many patients can have multiple diagnosis and each diagnosis can be 'assigned' to many patients
Many calls can have multiple call ticket types and each ticket type can belong to multiple calls

This many to many relationship is initially (and incorrectly) represented in OBIEE 11g as:

'Cardinality of '1' is applied to the two dimension tables and cardinality of 'N' is applied to the fact'

Any BI Architect should recognize the above model - it's a traditional star schema! If you stop here and decided to ignore the issue with your dataset and 'hope' OBIEE aggregates the model correctly, you're about to be disappointed.

Why star schemas dont work for N:N cardinality

Consider the following scenario: You're a call center manager and you want to capture the number of calls with positive feedback per employee. You also want to capture the type of calls an employee answers in any given day.

Upon analysis of the requirements you conclude that "each call received by an employee can have many call types and each call type can be answered by multiple employees".

For example:

I answer a take a call that is classified as a 'new call', 'urgent', and 'out of state transfer' (three different call types) - this is the "each call received by an employee can have many call types".
A colleague also received a phone call that is classified as 'out of state transfer' - this is the 'each call type can be answered by multiple employees"

Now let's put this data in a traditional star schema fact table as modeled below:

ID	EMPLOYEE_WID	CALL_TYPE_WID	NUMBER_OF_GOOD_CALLS
1	1	1	300
2	1	2	300
3	1	3	300
4	2	2	500
5	2	3	500
6	3	1	200

You can see in the above data set that:

EMPLOYEE 1:

Has 3 different call types
Has 300 positive reviews (NUMBER_OF_GOOD_CALLS)

This metric is at the EMPLOYEE level and not the call type level!

EMPLOYEE 2:

Has 2 different call types
Has 500 positive reviews (NUMBER_OF_GOOD_CALLS)

This metric is at the EMPLOYEE level and not the call type level

EMPLOYEE 3:

Has 1 different call type

Has 200 positive reviews (NUMBER_OF_GOOD_CALLS)

Now you receive a requirement to create a KPI that displays the Number of Good Calls as a stand alone widget.

PROBLEM 1 - Aggregation :

The number of good calls you received based on the above fact table is not 2100 - it's 300 + 500 + 200 = 1000

Employee 1 received 300 good calls
Employee 2 received 500 good calls
Employee 3 received 200 good calls

but due to the many to many cardinality of the data, the star schema duplicates the measures because each employee can take calls of many call types and each call type can be assigned to many employees!

PROBLEM 2 - Grand Totaling:

What if you don't care about aggregates? What if you just want a report that contains the employee, call type and a summation/grand total?

Notice how NUMBER_OF_GOOD_CALLS is repeated across multiple call types and the grand total is still incorrect. It's being duplicated due to the many to many relationship that exists between call type and employee. Furthermore, it paints an incorrect picture that NUMBER_OF_GOOD_CALLS is some how related to CALL_TYPE

How do we resolve this many to many cardinality with a bridge table?

When all is said and done, the incorrectly built star schema:

should be modified to:

Let's break this down:

The bridge table:

This the purpose of the bridge table is to resolve the many to many relationship between the call type and employee. It will contain, at a minimum, the following four columns:

The primary key of the table
The EMPLOYEE_WID
The CALLTYPE_WID
The weight factor

The weight factor is what's going to resolve the issue of double counting.

If an employee has 3 call types, there would be 3 rows and the weight factor of each row would be .33
If an employee has 10 call types, there would be 10 rows and the weight factor of each row would be .1

In our bridge table data set, we're going to use the same 3 EMPLOYEE_WIDs and create the following:

ID	CALL_TYPE_WID	EMPLOYEE_WID	WEIGHT
11	1	1	0.33
12	2	1	0.33
13	3	1	0.33
23	2	2	0.5
24	3	2	0.5
31	1	3	1

You can see from this example that we've taken the N:N dataset in the fact table and moved it into this bridge.

The dimension that is joined to both the fact and bridge

This is a generic dimension that contains the unique EMPLOYEE IDs in your organization's dataset.

For example:

ID	EMPLOYEE_ID
1	1
2	2
3	3
4	4
5	5
6	6
7	7
8	8
9	9
10	10

The dimension that is joined to only the bridge table

This dimension contains all of the possible call types. Note how this table is not physically joined to the fact. This is because this specific dimension (CALL_TYPE) is what's causing the N:N cardinality

For example:

ID	DESC
1	Call Type 1
2	Call Type 2
3	Call Type 3
4	Call Type 4
5	Call Type 5
6	Call Type 6
7	Call Type 7
8	Call Type 8
9	Call Type 9
10	Call Type 10

The Fact Table

We've moved the N:N cardinality from the original fact table to the bridge table so the new fact table now contains exactly one row per employee and does not have the CALL_TYPE_WID.

ID	EMPLOYEE_WID	NUMBER_OF_GOOD_CALLS
1	1	300
2	2	500
3	3	200

How do we implement this model in OBIEE 11g?

Step 1: Import Tables into Physical Layer

This is always the first step performed when creating a model regardless of its type. In the above example i'm importing four tables:

Step 2: Create the Physical Data Model

Based on our data set above the join conditions would be implemented as follows:

1:N relationship from employee dimension to fact table
1:N relationship from employee dimension to bridge
1:N relationship from call type dimension to bridge

Notice how employee_demo_d is the only dimension that is joined to the fact. w_call_type_d is not joined to the fact because that is the dimension that is causing the many to many relationship issue.

Step 3: Create the Logical Data Model
The creation of the BMM is where we deviate from our standard build steps of a traditional star schema:

All associated dimension tables referencing the bridge table will be stored in a single BMM table
The single BMM table will have two logical table source

Step 3.1 : Drag the fact table and dimension table that is connected to the fact table into the BMM.

In our example, we are dragging w_calls_f and w_employee_demo_d into the BMM:

Step 3.2: Create a 2nd LTS in the existing dimension table

Right click W_EMPLOYEE_DEMO_D -> New Object -> New Logical Table Source
Name it 'Bridge'
Add W_BRIDGE_D and W_CALLTYPE_DEMO_D (the two dimensions not directly joined to the fact table) under the 'Map to these tables' section

Next add the remaining dimension columns from W_CALLTYPE_DEMO_D and W_BRIDGE_DEMO_D to the Dimension table in the BMM

Step 3.3: Create a level-based dimension hierarchy for the dimension BMM

This step should be completed whether or not the schema is a star or bridge

Step 3.4: Confirm the BMM model has a 1:N relationship from the dimension to fact

Step 3.5: Set aggregation rule of NUMBER_OF_GOOD_CALLS to sum

All measures in the BMM must have a mathematical operation applied to the column

Step 3.5: Set the Content level of the dimension table to 'detail' in within the LTS of the fact table

Again, this is something that should always take place regardless of the type of model

Step 4: Create the Presentation Layer

This part is straight forward, just drag the folders from the BMM into the new subject area:

The moment of truth

So why did we go through this elaborate exercise again? To fix the aggregation issues we were having with NUMBER_OF_GOOD_CALLS due to the N:N cardinality of the data set. Let's create that 'standalone KPI' Number of Good Calls:

Notice how the metric correctly sums to 1000. Let's check the back end physical query to confirm:

Notice how it's hitting the fact table and not the bridge or the call type dimension.

But what about the weight factor?

Let's go back to the scenario where we want to compare across dimensions joined via the bridge table (EMPLOYEE and CALL_TYPE):

When creating a report that uses a measure from the fact table, a dimension value from the the employee table, and a dimension value from the table that causes the N:N cardinality - you need to use the weight factor to make sure your measure isn't getting double or triple counted:

Notice column is using the the NUMBER_OF_GOOD_CALLS multiplied by the WEIGHT factor in column 2
Each row in column 1 correctly represents the NUMBER_OF_GOOD_CALLS in the fact table despite having the repeated values of multiple call types
Note the aggregation of grand total sums to 997. This is because the weight factor is rounded to the 2nd decimal for EMPLOYEE_WID = 1 (.33%)

In order for grand totaling to work correctly with bridge table measures that use weight facts you must set the aggregation rule of the column (in this case column 1) to sum within Answers:

So what did we accomplish in this guide?

A basic understanding of many to many (N:N) cardinality
A basic understanding of why the star schema won't work for N:N cardinality
How to resolve the cardinality issue with a bridge table
How to implement a bridge table in OBIEE 11g

Friday, 13 November 2015

Column Vs Row Data Storage

Overview of Row Data Storage and Column Data Storage

Relational databases typically use row-based data storage. However Column-based storage is more suitable for many business applications. SAP HANA supports both row-based and column-based storage, and is particularly optimized for column-based storage.

As shown in the figure below, a database table is conceptually a two-dimensional structure composed of cells arranged in rows and columns.

Because computer memory is structured linearly, there are two options for the sequences of cell values stored in contiguous memory locations:

Row Storage - It stores table records in a sequence of rows.
Column Storage - It stores table records in a sequence of columns i.e. the entries of a column is stored in contiguous memory locations.
Column Based Table and Row Based Table

Traditional databases store data simply in rows. The HANA in-memory database stores data in both rows and columns. It is this combination of both storage approaches that produces the speed, flexibility and performance of the HANA database.

Advantages of column-based tables:

Faster Data Access:

Only affected columns have to be read during the selection process of a query. Any of the columns can serve as an index.

Better Compression:

Columnar data storage allows highly efficient compression because the majority of the columns contain only few distinct values (compared to number of rows).

Better parallel Processing:

In a column store, data is already vertically partitioned. This means that operations on different columns can easily be processed in parallel. If multiple columns need to be searched or aggregated, each of these operations can be assigned to a different processor core

Advantages and disadvantages of row-based tables:

Row based tables have advantages in the following circumstances:

The application needs to only process a single record at one time (many selects and/or updates of single records).
The application typically needs to access a complete record (or row).
Neither aggregations nor fast searching are required.
The table has a small number of rows (e. g. configuration tables, system tables).

Row based tables have dis-advantages in the following circumstances:

In case of analytic applications where aggregation are used and fast search and processing is required. In row based tables all data in a row has to be read even though the requirement may be to access data from a few columns.

Which type of tables should be preferred - Row-based or Column-based?

In case of analytic applications where aggregations are used and fast search and processing is required row-based storage are not good. In row based tables all data in a row has to be read even though the requirement may be to access data from a few columns. Hence these queries on huge amounts of data take a lot of time.

In columnar tables, this information is stored physically next to each other, significantly increasing the speed of certain data queries.

The following example shows the different usage of column and row storage, and positions them relative to row and column queries. Column storage is most useful for OLAP queries (queries using any SQL aggregate functions) because these queries get just a few attributes from every data entry. But for traditional OLTP queries (queries not using any SQL aggregate functions), it is more advantageous to store all attributes side-by-side in row tables. HANA combines the benefits of both row- and column-storage tables.
Column Based Table and Row Based Table

Conclusion:

To enable fast on-the-fly aggregations, ad-hoc reporting, and to benefit from compression mechanisms it is recommended that transaction data is stored in a column-based table.

The SAP HANA data-base allows joining row-based tables with column-based tables. However, it is more efficient to join tables that are located in the same row or column store. For example, master data that is frequently joined with transaction data should also be stored in column-based tables.

OBIEE - Selecting columns dynamically - part4, working with and selecting none

In this part I will talk about how to remove column from analysis with column selector or prompt. That means, selecting a column I don't want to present (or actually influence the analysis at all). Why do I want such a stupid thing? When working with dynamic columns, I want the ability to control the number of columns. So if the user wants only 3 columns, out of possible 5, only 3 columns would appear in the analysis. The other 2 columns will not show.

How do I add such "None" column?
I will describe 2 options. First is created at the User Interface level, the second in RPD only and works better with session variables. Usually the second should be your default option and it is the one to be used in the next post.

Solution 1

I add a column in the repository with empty value and default this column to be hidden, system wide. Now I can use the column anywhere I want. Lets have a detailed explanation.

I'm working with the extended sample available as part of the VM here. So I'm working with real Oracle DB tables in this case. (The following doesn't work well with the XML files of the default Sample).
In the Business layer I randomly selected the Time dimension and added a column with '' value. I named it None.

As a next step I created a new Presentation Table under the Sample Sales and named the table None, then dragged the none column under it.
Now I have the None column. OBIEE enables me to create analysis with this column:

What is nice here, is the fact that the SQL doesn't join unnecessary tables:
This is the left one (only the fact table in the original From):

WITH 
SAWITH0 AS (select sum(T42433.Revenue) as c1,
     '' as c2
from 
     BISAMPLE.SAMP_REVENUE_F T42433 /* F10 Billed Rev */ ),
SAWITH1 AS (select D1.c1 as c1,
     D1.c2 as c2,
     D1.c3 as c3
from 
     (select 0 as c1,
               D1.c2 as c2,
               D1.c1 as c3,
               ROW_NUMBER() OVER (PARTITION BY D1.c2 ORDER BY D1.c2 ASC) as c4
          from 
               SAWITH0 D1
     ) D1
where  ( D1.c4 = 1 ) )
select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3 from ( select D1.c1 as c1,
     D1.c2 as c2,
     D1.c3 as c3
from 
     SAWITH1 D1
order by c1, c2 ) D1

In a similar way, the right one is correct as well, with no unnecessary joins to date dimension.
SAWITH0 AS (select sum(T42433.Units) as c1,
     T42412.Department as c2,
     '' as c3,
     T42412.Dept_Key as c4
from
     BISAMPLE.SAMP_OFFICES_D T42412 /* D30 Offices */ ,
     BISAMPLE.SAMP_REVENUE_F T42433 /* F10 Billed Rev */
where ( T42412.Office_Key = T42433.Office_Key )
group by T42412.Department, T42412.Dept_Key)...

Next step, as administrator I will create an Analysis with the None column and select it's Column Properties.

In Column Format I'll select "Hide".

And at the bottom of that screen I will select Save As default / Save as system-wide default for "None"."None":

the same with Zoom-In:

That's all. Now I can use the None column in any other analysis as Column Selector, described in the first post, or as Presentation variable, described in second and third.

For example the same analysis with Column selector, selecting once the None column and then some other column:

When None is in the column of Column selector:

When Per Name Year is in the column of Column selector:

Solution 2

Just as in solution 1, I add a column in the repository with empty value and default this column to be hidden, system wide. Now I can use the column anywhere I want. I'll name it None1. The process of creating this table is the same as in Solution 1, with the formula '':

With one critical difference, the hiding will be done at the repository (presentation level) and not UI, as solutions 1.

At the presentation level I will go to the None properties and set the "Hide object if "condition to 1=1. This way the column is always hidden.

Now when going to Analysis we don't see the column None1:

But it doesn't mean it's not there; You just have to know it's name.

For example an analysis with Revenue and the new None1 column:

Results:

(Thank You Dov Ziskin, from VisionBI for reminding me the second solution).

Now we are ready to do some real Dynamic Columns example, in the next post

Column Oriented Database Technologies

My recent blog (Big Data & NoSQL Technologies) discussed various NoSQL technologies and market vendors. Today let’s dive into column-oriented databases and why they should play an important role in any data warehouse whose focus is on aggregations or metrics (and whose isn’t?).

So you are all probably familiar with row-oriented databases. Tables of data where rows of fields (also called columns) represent the structural storage and the corresponding SQL queries that select, insert, update, and delete that data. Most database vendors like Oracle, Microsoft, Sybase, Informix, and many others all base their technology on this ANSI standard. Column-oriented databases are indeed what you might surmise; tables of data where columns of data values represent the structural storage. What you might not expect is that on the surface many column-oriented databases look and feel like row oriented databases also using SQL queries in much the same way. Creating tables, storing data, querying them are all pretty much identical. They may appear similar, but two principal things to understand is that the significant differences under the hood, in particular, physical storage and query optimization.

As noted in my previous blogs on NoSQL, there is also a column-store technology out there. Let’s not confuse that with column oriented databases. They are different. Since several NoSQL column-store vendors were highlighted before, we will focus instead on the column oriented database vendors here.

First, some key benefits to column oriented databases:

High performance on aggregation queries (like COUNT, SUM, AVG, MIN, MAX)
Highly efficient data compression and/or partitioning
True scalability and fast data loading for Big Data
Accessible by many 3^rd party BI analytic tools
Fairly simple systems administration

Due to their aggregation capabilities which compute large numbers of similar data items, column oriented databases offer key advantages for certain types of systems, including:

Data Warehouses and Business Intelligence
Customer Relationship Management (CRM)
Library Card Catalogs
Ad hoc query systems

Column oriented database technology has actually been around for many years originating in 1969 with an application called TAXIR which provided abstracts for mathematical biosciences. In 1976, Canada implemented the RAPID system for processing and retrieval of population and housing census statistics. Sybase IQ was the only commercially available column-oriented database for many years, yet that has changed rapidly in the last few years. Let’s take a quick look at some of today’s key players:

SAP Sybase IQ (www.sybase.com)

A highly optimized analytics server designed specifically to deliver superior performance for mission-critical business intelligence, analytics and data warehousing solutions on any standard hardware and operating system. Its column oriented grid-based architecture, patented data compression, and advanced query optimizer delivers high performance, flexibility, and economy in challenging reporting and analytics environments.

Essentially a data partitioned, index based storage technology, Sybase IQ’s engine offers several key features which include:

Web enabled analytics
Communications & Security
Fast Data Loading
Query Engine supporting Full Text Search
Column Indexing Sub System
Column Storage Processor
User Friendly CUI based Administration & Monitoring
Multiplex Grid Architecture
Information Live-cycle management

The Sybase IQ Very Large Data Base (VLDB) option provides partitioning and placement where a table can have a specified column partition key with value ranges. This partition allows data that should be grouped together to be grouped together and separates data where they should be separated. The drawback to this methodology is that it is not always known which is which.

Infobright (www.infobright.com)

Offering both a commercial (IEE) and a free community (ICE) edition, the combination of a column oriented database with their Knowledge Grid architecture delivers a self-managed, scalable, high performance analytics query platform. Allowing 50Tb using a single server, their industry-leading data compression (10:1 up to 40:1) significantly reduces storage requirements and expensive hardware infrastructures. Delivered as a MySQL engine, Infobright runs on multiple operating systems and processors needing only a minimum of 4Gb of RAM (however 16Gb is a recommended starting point).

Avoiding partition schemes, Infobright data is stored in data packs, each node containing pre-aggregated statistics about the data stored within them. The Knowledge Grid above provides related metadata providing a high level view of the entire content of the database. Indexes, projections, partitioning or aggregated tables are not needed as these metadata statistics are managed automatically. The granular computing engine processes queries using the Knowledge Grid information to optimize query processing eliminating or significantly reducing the amount of data required for decompressing and access to answer a query. Some queries may not need to access the data at all, finding instead the answer in the Knowledge Grid itself.

The Infobright Data Loader is highly efficient so data inserts are very fast. This performance gain does come at a price so avoid updates unless absolutely necessary, design de-normalized tables, and don’t plan on any deletes. New features to the data loader include a reject option which allows valid rows to commit while invalid rows are logged. This is highly useful when loading millions of rows and only having a few rows with bad data. Without this feature the entire data load would be rolled back.

Vertica (HP) (www.vertica.com)

Recently acquired by Hewlett Packard, this platform was purpose built from the ground up to enable data values having high performance real-time analytics needs. With extensive data loading, queries, columnar storage, MPP architecture, and data compression features, diverse communities can develop and scale with a seamless integration ecosystem.

Claiming elasticity, scale, performance, and simplicity the Vertica analytics platform uses transformation partitioning to specify which rows belong together and parallelism for speed. Several key features include:

Columnar Storage & Execution
Real-Time Query & Loading
Scale-out MPP Architecture
Automatic High Availability
Aggressive Data Compression
Extensible In-Database Analytics Framework
In-Database Analytics Library
Database Designer & Administration Tools
Native BI & ETL support for MapReduce & Hadoop

The Vertica Optimizer is the brains of the analytics platform producing optimal query execution plans where several choices exist. It does this through traditional considerations like disk I/O and further incorporates CPU, memory, network, concurrency, parallelism factors and the unique details of the columnar operator and runtime environment.

ParAccel (www.paraccel.com)

Analytic-driven companies need a platform, not just a database where speed, agility, and complexity drive the data ecosystem. The ParAccel Analytic Platform streamlines the delivery of complex business decisions through its high performance analytic database. Designed for speed, its extensible framework supports on-demand integration and embedded functions.

The ParAccel Database (PADB) present four main components: the ‘Leader’ node, the ‘Compute’ node, the Parallel Communications Fabric, and an optional Storage Area Network (SAN). The ‘Leader’ controls the execution of the ‘Compute’ nodes and all nodes communicate with each other via the ‘Fabric’ running on standard x86 Linux servers. Each ‘Compute’ node is subdivided into a set of parallel processes called ‘slices’ that include a CPU core, and thier allocation of memory, and local disk storage. The ‘Communication Fabric’ provides a low-level MPP network protocol for increased performance.

Key PADB features include:

High Performance & Scalability
Columnar Orientation
Extensible Analytics
Query Compilation
High Availability
Solution Simplicity

ParAccel Integrated Analytics Library and Extensibility Framework incorporates advanced functions along with an API to add your own functions to help address complex business problems right in the core database enabling customers to focus upon their specific data complexities.

Microsoft SQL Server 2012 (www.microsoft.com)

Released this year, Microsoft has now embraced the columnar database idea. The latest SQL Server release 2012 includes xVelocity, a column-store index feature that stores data similar to a column-oriented DBMS. While not a true column oriented database, this technique allows for the creation of a memory optimized index that groups and stores data for each column then and joins them together to complete the index. For certain types of queries, like aggregations, the query processor can take advantage of the column-store index to significantly improve execution times. Column store indexes can be used with partitioned tables providing a new way to think about how to design and process large datasets.

The column-store index can be very useful on large fact tables in a Star schema improving overall performance, however the cost model approach utilized may choose the column-store index for a table when a row based index would have been better. Using the IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX query hint will work around this if it occurs. When data is stored with a column-store index, data can often be compressed more effectively over a row based index. This is accomplished as typically there is more redundancy within a column than within a row. Higher compression means less IO is required to retrieve data into memory which can significantly reduce response times.

There are several restrictions and limitation in using a column-store index. For example, which data types are supported or not and that you can only create one column-store index on any table can be problematic. Become familiar with what it can do and where best to use it. Currently the column-store index is not supported on Microsoft Azure.

Column-oriented databases provide significant advantages over traditional row oriented system applied correctly; In particular for data warehouse and business intelligence environments where aggregations prevail. It would not be fair however to ignore the disadvantages. Let’s look at these two:

Column-Oriented Advantages
- Efficient storage and data compression
- Fast data loads
- Fast aggregation queries
- Simplified administration & configuration

‘Column-Oriented Disadvantages
- Transactions are to be avoided or just not supported
- Queries with table joins can reduce high performance
- Record updates and deletes reduce storage efficiency
- Effective partitioning/indexing schemes can be difficult to design

The real value in using column-oriented database technology comes from high performance, scalable storage and retrieval of large to massive datasets (Big Data) focused on aggregation queries. Simply put: Reports! You can design Star schema’s or Data Vaults (The Data Vault – What is it? – Why do we need it?) incorporating these technologies and you will find that column-oriented databases provide a clear solution in data warehouse and business intelligence.

Look for future blogs on Hadoop/Hive/HBase and Extract-Transform-Load (ETL) technologies, and don’t be afraid to comment, question, or debate, there is always room to learn new things…