BI / ETL / DWH

Friday, 13 November 2015

Column Oriented Database Technologies

My recent blog (Big Data & NoSQL Technologies) discussed various NoSQL technologies and market vendors. Today let’s dive into column-oriented databases and why they should play an important role in any data warehouse whose focus is on aggregations or metrics (and whose isn’t?).

So you are all probably familiar with row-oriented databases. Tables of data where rows of fields (also called columns) represent the structural storage and the corresponding SQL queries that select, insert, update, and delete that data. Most database vendors like Oracle, Microsoft, Sybase, Informix, and many others all base their technology on this ANSI standard. Column-oriented databases are indeed what you might surmise; tables of data where columns of data values represent the structural storage. What you might not expect is that on the surface many column-oriented databases look and feel like row oriented databases also using SQL queries in much the same way. Creating tables, storing data, querying them are all pretty much identical. They may appear similar, but two principal things to understand is that the significant differences under the hood, in particular, physical storage and query optimization.

As noted in my previous blogs on NoSQL, there is also a column-store technology out there. Let’s not confuse that with column oriented databases. They are different. Since several NoSQL column-store vendors were highlighted before, we will focus instead on the column oriented database vendors here.

First, some key benefits to column oriented databases:

High performance on aggregation queries (like COUNT, SUM, AVG, MIN, MAX)
Highly efficient data compression and/or partitioning
True scalability and fast data loading for Big Data
Accessible by many 3^rd party BI analytic tools
Fairly simple systems administration

Due to their aggregation capabilities which compute large numbers of similar data items, column oriented databases offer key advantages for certain types of systems, including:

Data Warehouses and Business Intelligence
Customer Relationship Management (CRM)
Library Card Catalogs
Ad hoc query systems

Column oriented database technology has actually been around for many years originating in 1969 with an application called TAXIR which provided abstracts for mathematical biosciences. In 1976, Canada implemented the RAPID system for processing and retrieval of population and housing census statistics. Sybase IQ was the only commercially available column-oriented database for many years, yet that has changed rapidly in the last few years. Let’s take a quick look at some of today’s key players:

SAP Sybase IQ (www.sybase.com)

A highly optimized analytics server designed specifically to deliver superior performance for mission-critical business intelligence, analytics and data warehousing solutions on any standard hardware and operating system. Its column oriented grid-based architecture, patented data compression, and advanced query optimizer delivers high performance, flexibility, and economy in challenging reporting and analytics environments.

Essentially a data partitioned, index based storage technology, Sybase IQ’s engine offers several key features which include:

Web enabled analytics
Communications & Security
Fast Data Loading
Query Engine supporting Full Text Search
Column Indexing Sub System
Column Storage Processor
User Friendly CUI based Administration & Monitoring
Multiplex Grid Architecture
Information Live-cycle management

The Sybase IQ Very Large Data Base (VLDB) option provides partitioning and placement where a table can have a specified column partition key with value ranges. This partition allows data that should be grouped together to be grouped together and separates data where they should be separated. The drawback to this methodology is that it is not always known which is which.

Infobright (www.infobright.com)

Offering both a commercial (IEE) and a free community (ICE) edition, the combination of a column oriented database with their Knowledge Grid architecture delivers a self-managed, scalable, high performance analytics query platform. Allowing 50Tb using a single server, their industry-leading data compression (10:1 up to 40:1) significantly reduces storage requirements and expensive hardware infrastructures. Delivered as a MySQL engine, Infobright runs on multiple operating systems and processors needing only a minimum of 4Gb of RAM (however 16Gb is a recommended starting point).

Avoiding partition schemes, Infobright data is stored in data packs, each node containing pre-aggregated statistics about the data stored within them. The Knowledge Grid above provides related metadata providing a high level view of the entire content of the database. Indexes, projections, partitioning or aggregated tables are not needed as these metadata statistics are managed automatically. The granular computing engine processes queries using the Knowledge Grid information to optimize query processing eliminating or significantly reducing the amount of data required for decompressing and access to answer a query. Some queries may not need to access the data at all, finding instead the answer in the Knowledge Grid itself.

The Infobright Data Loader is highly efficient so data inserts are very fast. This performance gain does come at a price so avoid updates unless absolutely necessary, design de-normalized tables, and don’t plan on any deletes. New features to the data loader include a reject option which allows valid rows to commit while invalid rows are logged. This is highly useful when loading millions of rows and only having a few rows with bad data. Without this feature the entire data load would be rolled back.

Vertica (HP) (www.vertica.com)

Recently acquired by Hewlett Packard, this platform was purpose built from the ground up to enable data values having high performance real-time analytics needs. With extensive data loading, queries, columnar storage, MPP architecture, and data compression features, diverse communities can develop and scale with a seamless integration ecosystem.

Claiming elasticity, scale, performance, and simplicity the Vertica analytics platform uses transformation partitioning to specify which rows belong together and parallelism for speed. Several key features include:

Columnar Storage & Execution
Real-Time Query & Loading
Scale-out MPP Architecture
Automatic High Availability
Aggressive Data Compression
Extensible In-Database Analytics Framework
In-Database Analytics Library
Database Designer & Administration Tools
Native BI & ETL support for MapReduce & Hadoop

The Vertica Optimizer is the brains of the analytics platform producing optimal query execution plans where several choices exist. It does this through traditional considerations like disk I/O and further incorporates CPU, memory, network, concurrency, parallelism factors and the unique details of the columnar operator and runtime environment.

ParAccel (www.paraccel.com)

Analytic-driven companies need a platform, not just a database where speed, agility, and complexity drive the data ecosystem. The ParAccel Analytic Platform streamlines the delivery of complex business decisions through its high performance analytic database. Designed for speed, its extensible framework supports on-demand integration and embedded functions.

The ParAccel Database (PADB) present four main components: the ‘Leader’ node, the ‘Compute’ node, the Parallel Communications Fabric, and an optional Storage Area Network (SAN). The ‘Leader’ controls the execution of the ‘Compute’ nodes and all nodes communicate with each other via the ‘Fabric’ running on standard x86 Linux servers. Each ‘Compute’ node is subdivided into a set of parallel processes called ‘slices’ that include a CPU core, and thier allocation of memory, and local disk storage. The ‘Communication Fabric’ provides a low-level MPP network protocol for increased performance.

Key PADB features include:

High Performance & Scalability
Columnar Orientation
Extensible Analytics
Query Compilation
High Availability
Solution Simplicity

ParAccel Integrated Analytics Library and Extensibility Framework incorporates advanced functions along with an API to add your own functions to help address complex business problems right in the core database enabling customers to focus upon their specific data complexities.

Microsoft SQL Server 2012 (www.microsoft.com)

Released this year, Microsoft has now embraced the columnar database idea. The latest SQL Server release 2012 includes xVelocity, a column-store index feature that stores data similar to a column-oriented DBMS. While not a true column oriented database, this technique allows for the creation of a memory optimized index that groups and stores data for each column then and joins them together to complete the index. For certain types of queries, like aggregations, the query processor can take advantage of the column-store index to significantly improve execution times. Column store indexes can be used with partitioned tables providing a new way to think about how to design and process large datasets.

The column-store index can be very useful on large fact tables in a Star schema improving overall performance, however the cost model approach utilized may choose the column-store index for a table when a row based index would have been better. Using the IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX query hint will work around this if it occurs. When data is stored with a column-store index, data can often be compressed more effectively over a row based index. This is accomplished as typically there is more redundancy within a column than within a row. Higher compression means less IO is required to retrieve data into memory which can significantly reduce response times.

There are several restrictions and limitation in using a column-store index. For example, which data types are supported or not and that you can only create one column-store index on any table can be problematic. Become familiar with what it can do and where best to use it. Currently the column-store index is not supported on Microsoft Azure.

Column-oriented databases provide significant advantages over traditional row oriented system applied correctly; In particular for data warehouse and business intelligence environments where aggregations prevail. It would not be fair however to ignore the disadvantages. Let’s look at these two:

Column-Oriented Advantages
- Efficient storage and data compression
- Fast data loads
- Fast aggregation queries
- Simplified administration & configuration

‘Column-Oriented Disadvantages
- Transactions are to be avoided or just not supported
- Queries with table joins can reduce high performance
- Record updates and deletes reduce storage efficiency
- Effective partitioning/indexing schemes can be difficult to design

The real value in using column-oriented database technology comes from high performance, scalable storage and retrieval of large to massive datasets (Big Data) focused on aggregation queries. Simply put: Reports! You can design Star schema’s or Data Vaults (The Data Vault – What is it? – Why do we need it?) incorporating these technologies and you will find that column-oriented databases provide a clear solution in data warehouse and business intelligence.

Look for future blogs on Hadoop/Hive/HBase and Extract-Transform-Load (ETL) technologies, and don’t be afraid to comment, question, or debate, there is always room to learn new things…

Wednesday, 11 November 2015

Security Module – Main concepts and use

Hi Friends,
We wrote this article to try to send some light on Security Module (10g for now) once several emails asked us for do it.
I hope that it can be useful!
The creation of users is relatively simple. Just open the Security module, go to the tab “Users” (Users) clicking the right mouse button and choose “Add User”

After opening the window only enough to fill the required information.
IMPORTANT: IF YOU CHECKED THE CHECK BOX OF “SUPERVISOR” CREATED YOU WILL HAVE NO RESTRICTION ON USE OF ENVIRONMENTAL ODI. This option enables the user CALLED “superuser”
At the end of user creation, you must add this to the desired profiles. I will use as an example, a developer can access any object in the repository development.
To do so, simply drag (drag and drop) the desired profiles to the newly created user:

IMPORTANT: THE PROFILE “CONNECT” IS MANDATORY FOR ALL USERS
Completed these steps the user is able to connect the repository with rights that were signed.

Granting privileges on projects and their components.

Granting privileges to a specific project can only be assigned to users have “Supervisor” and also with the profile of the “NG” (not generic).
In our example we will use a user named “Developer” which has only the profiles of “Connect” and “Designer NG

To determine if access should open side by side on screen, both modules, and Security Developer and then drag the desired design for the user who is entitled to access it as:

It will be shown a confirmation window
After confirming opens a window where rights of association have the following concepts:
a. The mark “Checked” means that the user can perform that action in any repository.

The mark of “Denied” means that the user can not use that method:

It is the mark of “Interrogation” means that the user can only use that method in the specified repository:

Established these conditions must then use them in the security window to set the user access to each object of that project:

Note that the methods “Edit Project” and “Duplicate Project” will only be available in the Repository “REPWORK.”
Only the method “Edit in new Window” will be allowed to all repositories
All other methods are denied (this is the default, should be configured as needed.)
We can see that after connecting the User “Developer” he has no rights other than those set:

It is interesting to inform you that the “rights” controlled by the security window can be understood as the options that appear when you right click on an object.
Other relevant information and that in this case example, there were signed on Direct Folders , Packages, Interfaces, etc.. And with that, the project does not bring any of your folders and if there is an attempt to create, an error such as:

The configuration of users without the standards should be made at the level of each method on that object.
These are the main tasks and concepts on Security Module.
Thanks for visit us!

Friday, 23 October 2015

OBIEE - Selecting columns dynamically - part5, the single prompt solution

If you read posts 2,3,4 of this "OBIEE - Selecting columns dynamically" saga, you can create a set of prompts where each prompt controls a column and enables the user to select the column formula or decide not to show it at all.

Something like this:

But the users might ask you to have a single prompt with friendly column names, in which they can select all the relevant columns for the analysis (or 2 prompts, one for attributes and one for measures).
Something like one of those 2:

This is what this post talks about.

What I intend to do?

Create the above prompt to allow the user to select desired columns.
Create a hidden set of prompts where I split the above prompt to column names and have additional hidden prompts for their actual formula.
Create an analysis where the column values are either the selected column or the None / Hidden column from solution 2 of post 4 of this series.

As a result while user doesn't select any columns, the analysis is mostly empty and as columns are selected it grows.

For example in my analysis there is only one default measure column, so all I see is:

But if I select few values in the prompt:

I get the following result of the same analysis:

How to achieve it?

To get such a list I will create a table of prompt values, in this case, column names. Since I was lazy, I created it by a set of "select from dual", instead of actual table (don't do it in real life):

With 4 columns:

Column Name - A user friendly representation of the column (example 'Year')
Column Code - The formal Column definition in Analysis formula (example ( "Time"."T05 Per Name Year")
Column Type - In case I decide to split the prompt between attributes and measures and deal with special formats such as date.
Column Sort - This will allow me to control the selected values in prompt. (better explanation later in this post).

Next I created a specific Business Mode just for the Prompt table and use one of the options for single table Model described here. In my case, I duplicated the table at the Business layer and defined the copy as Dimension (after adding a key). Later I used only the "Fact" part in the Subject Area (See presentation layer on the left, Business Model on the right):

With that, I can create a prompt based on the column col_name. It can be a check boxes:

Or a choice list:

Now I have a Presentation Variable (P_col_list) with all the column "friendly" names I want.
It's value might be something like this:'Company, Customer Name, Department'.
What I need is to find columns that are in that string and control how do I get them in some controlled method.

The naive approach is treating the P_col_list Variable as one long string and cut it to peaces. I don't recommend it. The functions are long and unfriendly. It's much easier to filter the column table by this presentation Variable and then select a specific row, using the Rank function (with different value of Rank for each final Column).
So I need to check for each potential column, it's in the P_Col_list and select it's rank to get them in an orderly manner.

To check if the column is in the P_col_list I will use the function Locate: LOCATE("Prompt List"."Col_name",'@{P_col_list}') .To get the Ranking I will use RANK("Prompt List"."col_sort") . (The "col_sort' column from above allows me to control the columns. I can set numeric values there and use them. You might notice my values of col_sort column goes from higher to lower. Since the Ranking goes, by default, from larger to smaller values, that is the way to control the order of columns.)

(There is another / better option instead of the Locate above. See P.S 2 why I didn't use it)

For example, the first selected column formal column name is:

SELECT "Prompt List"."Col_code" saw_0
FROM "ZZ_Prompt_list"
WHERE (LOCATE("Prompt List"."Col_name",'@{P_col_list}') > 0)
AND RANK("Prompt List"."col_sort") = 1

and the first "friendly" name is:

SELECT "Prompt List"."Col_name" saw_0
FROM "ZZ_Prompt_list"
WHERE (LOCATE("Prompt List"."Col_name",'@{P_col_list}') > 0)
AND RANK("Prompt List"."Col_sort") = 1

In a similar way the second pair would be:

SELECT "Prompt List"."Col_code" saw_0
FROM "ZZ_Prompt_list"
WHERE (LOCATE("Prompt List"."Col_name",'@{P_col_list}') > 0)
AND RANK("Prompt List"."col_sort") = 2

SELECT "Prompt List"."Col_name" saw_0
FROM "ZZ_Prompt_list"
WHERE (LOCATE("Prompt List"."Col_name",'@{P_col_list}') > 0)
AND RANK("Prompt List"."Col_sort") = 2

So I decided that my case would have up to 5 dynamic columns (you can do as many as you like) and my second (soon to be hidden) prompt is, named one_list_actual:

Where P_C1A is the friendly Prompt name and P_C1 is the formal formula.

P_C1:

P_CA1:

The analysis would be:
1. A set of 5 columns with formula like:IFNULL(@{P_C1}, "None"."None") and Column Header like: @{P_C1A}:

2. The original '@{P_col_list}' prompt, defined as hidden in the column properties. (I need it so the analysis "knows" that prompt changed values and refresh).
3. As many columns I want that would be constant in the analysis. It should be at least one measure from a fact for the system to always work properly. It's "Base Facts"."2- Billed Quantity" in my case.

This is how the criteria looks:

The last step i did was to put the 2 Dashboard Prompts and the analysis in a single dashboard and see it all works fine and then hide the second prompt (one_list_actual) with all the P_C... variables.

After selecting few values:

Last part is to hide the second prompt.I already explained this step in the post: OBIEE - Creating dependent - implicitly declared presentation variable: In Dashboard Properties, under Filters and Variables I'll add the Dashboard Prompt one_list_actual.

and delete it from the dashboard.
this is how it looks with no values selected.

thi sis after selecting several values in the prompt:

You might want to split the Prompt that select column to 2 prompts, one for measures and one for attributes. The "Null" column for measures should probably be default as 0.0, to force it to be float.

P.S. 1

In case I wanted the Prompt solution to be reusable in many dashboards, I don't need many tables of column lists. I could add a column / columns with filtering values and filter the prompt by it, each time.

P.S. 2

LOCATE("Prompt List"."Col_name",'@{P_col_list}') will return numbers greater than zero if the column name is in the Presentation Variable. So we can use the condition: (LOCATE("Prompt List"."Col_name",'@{P_col_list}') > 0). It is problematic in cases where one value is contained in another, for example 'Product' and 'Product Type'. To prevent this problem you might add a comma string before and after the P_col_list an search for values with comma before and after.
(LOCATE(','||"Prompt List"."Col_name"||',',','||'@{P_col_list}'||',') > 0

So the select statement for P_C1 should be actually:
SELECT "Prompt List"."Col_code" saw_0
FROM "ZZ_Prompt_list"
WHERE locate(','||"Prompt List"."Col_name"||',', ','||'@{P_col_list}'||',')>0
AND (RANK("Prompt List"."Col_sort") = 1)

Why did I use it and not a better option of looking for specific value in multivalued Presentation variable,cover here by Deepak:
If I want to check whether the column "Prompt List"."Col_name" is part of the values in presentation Variable P_col_list, I can run the following: WHERE "Prompt List"."Col_name" IN (@{P_col_list}['@']) .
Well I had some voodoo problem with. When I didn't select any value in the initial P_col_list variable I had an error:

Error Codes: OPR4ONWY:U9IM8TAC:OI2DL65P:OI2DL65P

State: HY000. Code: 10058. [NQODBC] [SQL_STATE: HY000] [nQSError: 10058] A general error has occurred. [nQSError: 43113] Message returned from OBIS. [nQSError: 27002] Near <,>: Syntax error [nQSError: 26012] . (HY000)

SQL Issued: {call NQSGetQueryColumnInfo('SELECT IFNULL(, "None"."None") FROM "A - Sample Sales"')}

SQL Issued: SELECT IFNULL(, "None"."None") FROM "A - Sample Sales"

It happens only if no columns at all where selected. One default was enough to prevent this error.

P.S. 3

I would like to thank Rakefet Barad from Oracle ACS and Dov Ziskin from Vision BI for the chats we had on the issue.

OBIEE - Selecting columns dynamically - part4, working with and selecting none

In this part I will talk about how to remove column from analysis with column selector or prompt. That means, selecting a column I don't want to present (or actually influence the analysis at all). Why do I want such a stupid thing? When working with dynamic columns, I want the ability to control the number of columns. So if the user wants only 3 columns, out of possible 5, only 3 columns would appear in the analysis. The other 2 columns will not show.

How do I add such "None" column?
I will describe 2 options. First is created at the User Interface level, the second in RPD only and works better with session variables. Usually the second should be your default option and it is the one to be used in the next post.

Solution 1

I add a column in the repository with empty value and default this column to be hidden, system wide. Now I can use the column anywhere I want. Lets have a detailed explanation.

I'm working with the extended sample available as part of the VM here. So I'm working with real Oracle DB tables in this case. (The following doesn't work well with the XML files of the default Sample).
In the Business layer I randomly selected the Time dimension and added a column with '' value. I named it None.

As a next step I created a new Presentation Table under the Sample Sales and named the table None, then dragged the none column under it.
Now I have the None column. OBIEE enables me to create analysis with this column:

What is nice here, is the fact that the SQL doesn't join unnecessary tables:
This is the left one (only the fact table in the original From):

WITH 
SAWITH0 AS (select sum(T42433.Revenue) as c1,
     '' as c2
from 
     BISAMPLE.SAMP_REVENUE_F T42433 /* F10 Billed Rev */ ),
SAWITH1 AS (select D1.c1 as c1,
     D1.c2 as c2,
     D1.c3 as c3
from 
     (select 0 as c1,
               D1.c2 as c2,
               D1.c1 as c3,
               ROW_NUMBER() OVER (PARTITION BY D1.c2 ORDER BY D1.c2 ASC) as c4
          from 
               SAWITH0 D1
     ) D1
where  ( D1.c4 = 1 ) )
select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3 from ( select D1.c1 as c1,
     D1.c2 as c2,
     D1.c3 as c3
from 
     SAWITH1 D1
order by c1, c2 ) D1

In a similar way, the right one is correct as well, with no unnecessary joins to date dimension.
SAWITH0 AS (select sum(T42433.Units) as c1,
     T42412.Department as c2,
     '' as c3,
     T42412.Dept_Key as c4
from
     BISAMPLE.SAMP_OFFICES_D T42412 /* D30 Offices */ ,
     BISAMPLE.SAMP_REVENUE_F T42433 /* F10 Billed Rev */
where ( T42412.Office_Key = T42433.Office_Key )
group by T42412.Department, T42412.Dept_Key)...

Next step, as administrator I will create an Analysis with the None column and select it's Column Properties.

In Column Format I'll select "Hide".

And at the bottom of that screen I will select Save As default / Save as system-wide default for "None"."None":

the same with Zoom-In:

That's all. Now I can use the None column in any other analysis as Column Selector, described in the first post, or as Presentation variable, described in second and third.

For example the same analysis with Column selector, selecting once the None column and then some other column:

When None is in the column of Column selector:

When Per Name Year is in the column of Column selector:

Solution 2

Just as in solution 1, I add a column in the repository with empty value and default this column to be hidden, system wide. Now I can use the column anywhere I want. I'll name it None1. The process of creating this table is the same as in Solution 1, with the formula '':

With one critical difference, the hiding will be done at the repository (presentation level) and not UI, as solutions 1.

At the presentation level I will go to the None properties and set the "Hide object if "condition to 1=1. This way the column is always hidden.