Showing posts with label Database. Show all posts
Showing posts with label Database. Show all posts

Friday, July 01, 2016

Redshift Cloud-based data warehousing solution


Architecture


Redshift is implemented using the ParAccel Analytical Database or PADB (ref). From ParAccel definition:

PADB is architected as a columnar, compressed and massively parallel relational database management system that is capable of all-in-memory, if desired.

The main component is the Leader Node (interface to the outside world) whose responsibilities are to manage communication between nodes, to control session, to parse and optimize queries and to schedule execution of workload (excluding participation to any data operations).

The second component are the Compute Nodes responsible in processing, storing data and sharing workload evenly among themselves. These nodes receive query compiled code from Leader node (based on the execution plan) and send back result after execution is finished.

Each node work with its own disks against its own memory and CPU, but may depend on each other when intermediate result sets are transferred during join/aggregations operation. To increase parallelization (through multi-core capability) these nodes are furthermore partitioned into slice, with each having a dedicated portion of memory and disk space, where queries and database processing are finally executed.

The other component is the Communication Fabric using custom Gigabit Ethernet and allowing for communication between Compute and Leader to be high-bandwidth, private, close proximity and very high speed. The Compute Nodes are located on an isolated network not accessible to client applications.

SAN backend is not mandatory but useful for backup, snapshot and restore when in High Availability (HA) mode. In fact, Redshift is actually optimized to work with fast direct-attached storage (DAS) as typical or Cloud-based cluster based on commodity hardware.

Source: Amazon Redshift Database Developer Guide

Key features


Redshift main selling points are

  1. Shared-nothing MPP high performance DB not requiring to build redundant data access structure. No need to create and maintain redundant structures like indexes, materialized view, physical aggregations and summary tables, ..;
  2. Suited to a broad range of schema model. No need to denormalize or build dimensional data model for improved performance;
  3. Columnar storage offering high compression ratio and efficient utilization of I/O. Only data block needed are read, as opposed to row-oriented storage DB (BI application's queries usually involve a fraction of available attributes and thus penalized row-oriented DB where all attributes are read for each block I/O).
  4. Leverage low-cost commodity computer with DAS storage (either mechanical HDD or onboard flash SSD) common to all Cloud-based environments

Schema model


The Redshift/PADB is supposedly schema agnostic, in the sense that you don't need to favor one modeling technique over another. Let me comment on that a bit.

Data warehouse implementation using RDBMS uses a variety of techniques built solely to improve performance. We add physical-based structure such as indexes and Materialized Views, also schema-based structure like denormalized Dimensional models, and potential external one using complete different physical schema (like MOLAP). All these add up both in complexity and storage requirement. Redshift moves away from this trend by only storing the needed data (typically modeled in 3rd NF) without redundancy.

The most performance gain from Redshift is coming mainly from 1) columnar orientation and 2) MPP features. First point discourages the use of highly normalized models. These models tend to have narrow tables (with few attributes) and consequently will not gain much I/O performance (for each block read) compared to row-oriented databases. Furthermore the MPP features induces locality sensitivity during join and aggregation operations. Joining rows locally is best, but performance will degrade quickly once these rows get distributed across different nodes. This again discourages highly normalized model, especially for relationships/association modeled as m-to-n structures where rows are distributed using only one foreign key.

Personal recommendation: I would favor some level of de-normalization of the data model (especially if your model was designed for integration purposes) but without going to the extreme of defining only a few wide denormalized dimensions.

Redshift Construct


There are numerous constructs in Redshift not commonly found in typical relational DB. These need to be considered during physical data modeling and are determinant for performance aspects.

Data Distribution style

As with any cluster-based MPP database, Redshift scalability is obtained by distributing data among cluster nodes. The goal is to distribute workload uniformly among all nodes and to have data collocated (data from different tables participating in joins and aggregate stored on same computer node).

Rows can be distributed in different ways:

1. EVEN Distribution: rows are distributed evenly in a round-robin fashion.
2. KEY Distribution: rows are distributed according to a specified key (known as sharding)
3. ALL Distribution: rows are simply replicated on all nodes

During query execution, the goal is to minimize the (costly) redistribution of rows needed to perform joins and aggregation. So KEY distribution for two larges tables frequently joined together should be defined with their FK/PK pairs. Smaller and rarely updated table will benefit greatly from ALL distribution without much impact on storage space. Volatility is critical as updating tables to all computer nodes can be costly. EVEN distribution is the default style used when no clear a-priori information can dictate which distribution to use.

Column Compression

The columnar storage format implies that data belonging to same column are stored sequentially on disk. This offers good compression potential (column-data has same format and often similar in content) and consequently, better read performance as more data (RS store data in 1Mb disk block) is read per I/O.

The appropriate compression encoding must be chosen for each data column. Besides the basic raw encoding that simply stores data in raw format (no compression), one can choose between:
  1. Byte-dictionary encoding (BYTEDICT)
    • Use a BYTE dictionary where each value is mapped to (up to 256) unique value
    • Applicable to all types except BOOLEAN
    • Ideal for long characters type having less than 256 values (per block)
  2. Delta encoding (DELTA, DELTA32)
    • Store only the delta difference from preceding row
    • Applicable to types with natural ordering (INT, DATE, TIMESTAMP,..)
    • Ideal for datetime or sequential number columns as delta between rows can be represented more compactly
  3. LZO encoding (LZO)
    • Compress data using LZO scheme frequently used with text
    • Applicable to all character types (CHAR, VARCHAR,..)
    • Ideal for long free form text
  4. Mostly encoding (MOSTLY8, MOSTLY16, MOSTLY32)
    • Compress "small size" values while leaving larger one in raw format
    • Applicable to number types (INT, BIGINT, DECIMAL,..)
    • Ideal when data type used is larger than most values stored (only a small fraction needs the extra space)
  5. Run-length encoding (RUNLENGTH)
    • Store values as token plus a count giving the number of consecutive repetition
    • Applicable to all types
    • Ideal when data is often repeated in sequential rows
  6. Text encoding (TEXT255, TEXT32K)
    • Store a separate dictionary of unique words (up to 245 unique values) and store other words uncompressed
    • Applicable to VARCHAR types
    • Ideal when data contains the same words frequently

Sort key

Redshift can define one or more columns per table as sort keys. Rows will be sorted during initial load accordingly. This is especially useful during read where data block can be skipped entirely to minimize I/O, typical with range-restricted predicate (where population between 1000 and 2000). This is possible for every column as Redshift keep track of minimum/maximum values for each data block.

Sort keys can be either 1) Compound or 2) Interleaved. The first is useful when we use the same prefix keys as predicate (or a subset of), i.e. if index: K1, K2, K3 then where K1 = .. and K2 = .. is ok, but not where K2 = ... The second one is used when any combination of keys are found as predicate, without worrying about which one is used.

You define sort keys based on the query that frequently same columns as equality/range predicate. For example, if recent rows are more frequently accessed, then you'd sort on the Timestamp column. If some tables are very frequently joined together, you can even define their Sort key to be also their Distribution key and let the query optimizer use sort merge join (very fast as sorting is skipped) instead of slower hash join.

Best Practices

From the Redshift developer guide, there are a few important practices to keep in mind:
  • Make sure you define a Distribution style for each table, unless the default one EVEN is satisfactory
  • Make sure you define a good Sort key
  • Let COPY command decide on the best Compression encodings
  • Define data relationship integrity constraint (will not be enforced but still help query optimizer)
  • Use smallest format size for data type (temporary table created during long queries uses uncompress data with format defined in DDL)

Martin

Note: I’m now using Markdown for all my notes, so I can convert any part of them into html using a conversion tool such as text-to-HTML tool. Writing blog post this way is such a big time saver!

Wednesday, February 24, 2016

Neo4J, the Graph database

At work, we are confronted with the implementation of a database involving rich and densely interconnected data (.. more relationships than entities) and involving complex recursive query to run.  We thought it could be a good opportunity to test out Graph database.

In a nutshell, Graph database are a good candidate with:
  • highly connected dataset
  • data access through a multi-steps traversals (finding stuff in a network/hierarchy of nodes)
  • data access that is both flexible and unknown (path is changed according to nodes encounter in the way)
  • ultimate flexibility with thousands of classes (node type) having different properties with new classes/properties defined everyday!
We decided to go with Neo4J which seems to be the most stable implementation having the longest track record.  The following are notes that are relevant to Neo4J but some general advice are probably applicable to other implementations. 

I've taken some notes that I share here... (I've also worked more on the Cypher language and will try to post my Python notebook experimentation later).

1) Nodes

These are our entities characterised by attributes, called properties.  It is schema-less, so entity may have different properties. Things to avoid:

  • Nodes with rich properties.  Ex. Country node type with language spoken, currency, etc..  (it's a smell indicating these properties should probably correspond to Nodes) 
  • Nodes should be fine-grained, so avoid nodes representing multiple concept.
2) Relationship

The relationship are for access.  Access using global index is not typical, we do traversing instead using relationship are the primary means for nodes access.
  • Properties on relationship that represent entity (frequent with relationship like "WorkFor", "GrantedTo"..)
  • Used to access neighbouring nodes starting from one nodes and traverse many using the Relationship type and properties.
3) Model

The data model is fun and easy to do simply with whiteboard drawing: you model by Examples and not by Class! So quite suitable to do with business and used a communication tool.
The model is typically normalised and any properties appearing in multiple places should probably be nodes.


4) Anti-Patterns

- Avoid hefty nodes: User nodes with many properties like names, email, address, job, ... a lot of times, these correspond to nodes.

- missing nodes: John emailed Tim, make it 3 nodes with email a node also, and not a relationship between node John and Tim.

- Avoid hot nodes: nodes having a very large number of incoming relationships.


5) Graph Querying

2 approaches are possible for querying a Graph DB:
  1. Graph traversing
    • Start off with some nodes (the roots)
    • Spread out from these according to given path
    • Yield some side-effect in the process (intermediate computation, ..)
  2. Graph pattern matching 
    • Provide an abstract (sub)graph with any number of defined instances
    • Graph engine determines all subgraphs of the larger one fitting the pattern

The first category (imperative language) is used by generic RDF language like SPARQL and Gremlin (this one can support the two). The second one (declarative language like SQL) and is used by language like Cypher.

6) Loading Data 

There are a few ways we can load data into Neo4J: 1) transactional (live update through its REST-API), 2) batch (for high volume like initial load), 3) Cypher (similar to suing DML commands in SQL), or 4) through some API based on Java or other language.

Various tools can also be used for data loading, whether they are XLS-based, Neo4J-shell based, CLI-based (command line programs like batch importer), Neo4J browser application, or ETL-based.


Martin 






Sunday, July 27, 2014

PostgreSQL database


We have been deploying a database service at work which is based on PostgreSQL.    Here's some notes I gathered for the installation and basic configuration of the service.  So far, we are quite impressed and satisfied by the stability of this "World's most advanced open source Database"  (or maybe the quote should be "Oracle wannabe") !

Memory Setting:


Global Wide Server Setting
Shared_buffers* :
This global setting defines amount of memory shared across all connections/process where all data in/out the DB pass through.  Unlike other DBMS, PS should not grab all physical RAM available, as most read/write occurs at the OS level and thus depend on allowing OS's cache to work as usual.   The only exception could be to force WAL writes to bypass the OS cache.   
There is no easy rule exist to get proper sizing of this buffer.   Too large values will exacerbate some issues (ex. "Doubly cached Data", "checkpoint I/O spike" due to large quantity of dirty data regularly stored in cache, and having too small value mean we won't benefit from PS smart eviction buffer cache rule (see clock-sweep algorithm).   But the default value set by initdb is definitely too low as define to guarantee a proper PS start-up and must be changed.
We should start with a value of at least 25% of your on- board memory for server where OS overhead is small (typically the case for any modern system with 1Gb or more), to make sure we avoid "Doubly cached Data" data.
* updating this global setting requires a database restart 

Wal_Buffers : 
Another shared memory setting which controls how much memory used for buffering write-ahead log data.  Default suggests a value of 64 KB, but this is only sufficient as long as no single transaction exceeds that value, and in practice it is recommended to set higher values (not too high as this takes up shared memory).   With current modern system, a general guideline propose to set this to : Wal_buffers = 16 MB.

Maintenance_work_mem: 
This is the total memory allocated for housekeeping activities like vacuuming (getting rid of dead records), as well as a few operations requiring larger memory than normal sorting Work_mem (ex of operations: VACUUM, CREATE INDEX, ALTER TABLE ADD FOREIGN KEY) .   There shouldn’t be many concurrent operations like these, so this value can be set to 5% of total server RAM  (50 Mb of maintenance_work_mem per GB of server RAM). 

Default_statistics_target:
This controls how much statistics are gathered about each table (ANALYZE statement) so that the Query optimization produce better execution plan.  Increasing this value only makes sense if it improves some long queries since it also increases the background overhead of databse maintenance.  The value is now fixed to 100, before increasing it, it is worthwhile to fine-tune specific Table/Column where we know Query plan could be improved (ex. indexes using the LIKE operator) with ALTER TABLE SET STATISTICS statement.

Maximum_connections:
This must be estimated generously as it represent a hard-limit and new client connections will simply be refused once reached (the maximum value estimated after on new installation is = 100).   Each connection takes a small amount of shared memory, so the memory footprint related to this will be quite small in relation to the shared_buffers and work_mem values.  However, setting this to too large value will limit the value we can set for the per-client setting of work_mem value, so we must balance the two.   

Per-client Connection Setting

Work_mem :
This controls the maximum amount of memory allocated per client connection for operation such as sorting, hash join, and others.  Based on the estimate of memory footprint of these operations, PS will more frequently decide to operate on disk instead of memory ! To avoid this one should set large enough value, however be careful that this setting is defined per Sort and not at client connection.   If you have many users connecting, but fairly simple queries, you want this to be relatively low. If you do lots of intensive processing, like building a data warehouse with few users, you want this to be high. (see http://www.depesz.com/2011/07/03/understanding-postgresql-conf-work_mem/).  
This non-shared setting can always be adjusted by client after connecting, but a rule of thumb is to estimate the free RAM after shared_buffers allocation, divided by the max_connections and take a fraction of that figure (ex. 1/3, to remain moderate since it can always be increased on specific client connection where we know we'll need more memory):
work_mem = 1/3 * (free memory - shared buffers ) / max_connection  

Effective_cache_size:
This is an estimate of how much memory you expect to be available in the OS and PostgreSQL buffer caches. It has no affect on actual allocation, but is used only by the PostgreSQL query planner to figure out whether plans under consideration would fit in RAM or not. If it’s set too low, indexes may be underutilized. If you have a dedicated PostgreSQL server, then setting this to 50% of your on-board memory would be a good start, assuming the shared_buffers is set to less than 50% of total memory.  It must be set according to the shared_buffers global setting, as the two cannot exceed total physical memory of server (a too high value may result in both the database and OS cached being disrupted by resulting large number of blocks created to satisfy some query).  

File Systems:

Tuning:

* Read-ahead 
Linux may read ahead disk-block of various size.   Default disk blocks read-ahead is too low: 265K for most DB use-case.  Increasing this value will benefit all sequential read done by PS... so should be set to 4096 or more (4K):
# blockdev --getra /dev/sda
256
# blockdev --setra 4096 /dev/sda  
(to be set for each disk device, and added in the boot script: rc.local)

* File access time 
Disable "last access time" (atime) overhead by adding: noatime to the volume mount options in /etc/fstab:   
/dev/sda1 / ext3 noatime, errors=remount-ro 0 1

* Swapping
For better predictable DB perf, should not let OS swap inactive disk pages out when running low on memory (Linux has caching for its filesystem).  This is done permanently by adding:
vm.swappiness=0 to /etc/sysctl.conf
Forcing the OS to shrink filesystem cache rather than swap.  

* Memory overallocation
Disable Linux behavior to overcommit memory by process.  This is done by adding: 
vm.overcommit_memory=2 to /etc/sysctl.conf

* Write-back Cache:
Write-cache happens at these levels:  1) fileOS cache, 2) Disk controller (or RAID cards) and 3) Disk drive, and to ensure durability of Transactions, we must:
A. Force OS to use fsync call mechanisms to write data securely
B. Use a BBWC (battery-backed write cache) at controller and monitor its health state, so you can take advantage of fast write non-volatile cache!
C. Disable write cache set at the disk drive level (which is always volatile) 
Disabling write-cache at disk drive:
sudo hdparm -W 0 /dev/sda

* Write cache sizing:
We could change both dirty_background_ratio and dirty_ratio parameter (kernel param driving how aggressively cache is written out to disk), but it seems that it is more important when the FS is under ext3.   

Tablespace definition:
We should define different tablespace for these components:
    • 1- System catalog data  
    • 2- Application data
    • 3- Index data
For application table's data (stored within file (max 1Gig) in tablespace, i.e. directory), the goal being to separate Data from Index on different drive, as the two are frequently accessed concurrently.   Furthermore, very large table (many Gbs) could leverage Table partition which stored various chunks of table into separate Tablespace.   Use temp_tablespace for temporary storage are needed to some Query evaluation (eg. when sorting is required).  This tablespace should be designed with space efficiency rather than speed and subject to disk fragmentation. 

WAL (write-ahead logging):
We should add archiving to these WAL redo log files, unless we work in failover mode.

Another proposal to define finer grain Tablespace file systems (partition) for more flexibility: 
/pgarchiveDB Archive location containing the archive log files.
/pgbackupDB Backup location containing the physical and logical backups. For logical backups (pg_dump), use EXPORTS as a sub directory. For physical backups, use FULL{date}, convention for the sub directory. However, physical backups could be handled with file system snapshots. More on this later.
/pgclusterPostgreSQL Cluster data directory (PGDATA environment variable). This will contain all of the configuration files and directories of a PostgreSQL cluster. Note: the on-line WAL files would be located in /pgcluster/pg_xlog.
/pglogThe location of the server log files.
/pgdata-systemThe location of a database’s default tablespace. This is to be used at the creation of the database. The database catalog information will be stored.
/pgdata-tempThe location of the database’s default temporary tablespace. This is needed for temporary sort information. Note: The pgsql_tmp directory within the default tablespace will be used if a temporary tablespace is not defined.
/pgdata-app_tblspcApplication tablespaces. For every schema there should be a minimum of two tablespaces. One for tables and one for indexes.

Disk Minimal Recommendation 


There are no general one-size-fit-all solution as it comes with disk layout, as it will be very much application specific.  The general recommendation is that the more Disk we have the better we can spread/parallelize the workload into separate spindles, hence increasing throughput.  

The current hardware has only two disks (in RAID-1) and cannot be considered as a viable longterm solution.   So presuming, the PostgreSQL service will host many concurrent and highly demanding applications in future, here are the minimal disk requirements we should keep in mind for future improvement.

Disk Layout
1)  Have the OS on separate disk (ex. 2 disks on RAID-1)
2)  Have the WAL transaction-log on separate disk (optimally designed for sequential write)
3)  Have the temp tablespace (used for large user sorting) on a separate disk (optimally designed for random access, here SSD could be a viable solution)
4)  Have all database data on as many disks as possible with RAID 1+0
5)  Have all index-related data on separate disk with RAID 1+0

The principles behind these recommendations are:
  1. Avoiding putting the WAL onto the same drive as OS since it has a completely different access pattern (WAL are purely sequential write)
  2. If there will be large Sorting operation, the Temporary files (temporary_tablespace) should not be stored along the DB files (table + index)
  3. Application data is accessed concurrently with index-data, so separating the two will boost perf on all occasions.


Configuration Scenario
Now how do we configure all these drives is more challenging.   

For example, for a 14 Disk scenario, we could have the following configuration per disk:
1) Scenario 1:
Location
#Disks
RAID level
Purpose
/ (root)21OS
$PGDATA/data6+1+0Database data only
$PGDATA/index2+1+0
Database index only
$PGDATA/pg_xlog2
1

WAL
Tablespace-temp2NoneTemporary files

But, we could as well have the simple configuration:
2) Scenario 2
Location
#Disks
RAID level
Purpose
/ (root)141+0OS, DB, WAL

What scenario is going to perform better is impossible to say without benchmark on the target DB with real data access pattern.
It is only possible to predict that scenario 2 will perform better in workload with more Random access, while scenario 1 will perform better with more Sequential access
However the exact mix of sequential versus random seek access is simply not predictable, so optimization of disk layout in advance is equivalent to guessing!  

Martin

Wednesday, May 08, 2013

NoSQL: how disruptive for BI?


It seems the "NoSQL" term is now seen and used everywhere.  This post discusses this new technology, along with its impact in relation to BI world.


Definition

Trying to find a clear and consensual definition is always a challenge with over-hyped and buzz words more  used as a marketing advantage.  Recently, well respected authors Martin Folwer and Pramod Sadalage wrote a clear and informative book on the full spectrum of noSQL technologies called "NoSQL Distilled: a brief guide to the emerging world of polyglot persistence".

According to the book author, no well accepted definition exists for NoSQL, the original name from the NoSQL Meetup was "open source distributed nonrelational databases".  Rather than focusing on definition, the authors rather focus on a few common points all shared by NoSQL databases:

  • Often run off commodity cluster computing
  • Support of large or extremely large data volumes on clusters
  • Query language is not necessarily SQL (usually a close derivative language is used)
  • Usually Open-source and prominent in Web-based applications ecosystems
  • consistency is not done relying on relational ACID principles  (an effect of running off independent cluster nodes)
  • Use Schema-less, i.e. no static data model structure mandatory

What about Cluster-based Solution proposed by Relational vendor
The commodity cluster models lead to a much cheaper and more scalable solution (scaling-out with much small nodes) than the scale-up alternative (ever increasing size of big monolithic and expansive server).  The cluster version of relational databases (ex. Oracle RAC or Microsoft SQL-server) is not designed to be run off multiple independent nodes, as it actually relies on a common and share disk subsystem.   Their model leads to fast and performant throughput with cluster-aware filesystem writing to a single highly available disk subsystem.   This single point of failure is avoided in robust commodity cluster where a server node may go down (both server hardware internals  & disk)) without impacting the cluster on-going execution (full failover and high availability is guaranteed).


NoSQL database:  Metamodel classification 

The author proposed a neat and convenient way to classify current NoSQL database according to how they store data, i.e. the metamodel used.   The relational metamodel is based on relations (tables) and tupple (rows), current NoSQL uses rather these forms:

      1. Key-value: Data stored at Aggregation* level using an "opaque" form
        • Each aggregate has a key or ID to get at the data
        • Each aggregate is opaque as data is stored using unstructured blob
        • (+) use to store whatever we like with just a constraint on size 
        • (-) less flexible in querying & updating (cannot be done on part of a value) 
      2. Document: Data stored at Aggregation level using a visible and explicit semi-structured form (typical of JSON)
        • Each aggregate may have a key or ID to get at the data, but not necessarily as  query can be done on any fields within the aggregate
        • Each aggregate is stored within a defined structure imposing a constraint on what's data is allowed 
        • (+) more flexible in querying (part of a value can be queried or updated)
      3. Columnar: Data is stored using a two-level map with grouping of columns including all rows.
        • Data is identified with a row-ID followed by a column-name normally part of a column-family
        • Query sequence pattern:  Row-key -->  Column-key --> "values"
        • (+) optimal with read-oriented app (ex. BI)
        • (-) not ideal for writing-oriented app 
      4. Graph: Special database using graph metamodel (i.e. nodes & edges) to store records with complex interconnections
        • Each entity or nodes can be queried in relation to interconnected nodes typical of social networks, product preferences..
        • (+) querying is optimal compared to  the relational model which require FK relationship and expansive join operation
        • (-) focus on querying results in more complex & less performant data insertion 
        • (-) less suited with storage across distributed cluster nodes 

*The term aggregation is used to refer to data unit more complex than just a list of values or tuples.  Typically, data unit is consisting of a complex set of records with associations and nested structure, all of which must be manipulated atomically (ex. think about one customer with all related addresses, or Order with all item details, payment info and customer attached to it).


Technologies

Looking at impressive number of open source implementations available (the list just names a few out of the 150 databases currently listed here), we have an indication of the potential growth of this technology, but also an indication of its maturity level... only a few important implementations could likely survive!

A lot of NoSQL database stores may not have clear-cut storage metamodel fitting exactly the definition of M. Fowler et al. metamodel taxonomy.  So depending on interpretation, we may find some stores in various categories, please refer to NoSQL online reference for other interpretation and up-to-date data on this active field.

    1. Key-value Store: 
      • Redis
      • Voldemort
      • Oracle BDB 
      • Amazon SimpleDB 
      • Riak
    2. Document Store:
      • Apache CouchDB
      • MongoDB
      • BaseX
    3. Columnar or Column store: 
      • Cassandra
      • HBase (Hadoop)
      • HyperTable 
      • Apache Accumulo
      • MonetDB
    4. Graph Database: 
      • Neo4J
      • InfoGrid
      • Infinite Graph



BI perspective

I consider Graph database and Column-store to have the greatest impact in BI and to have the potential of being disruptive to current BI market.

For Column-store, this is easily understandable as BI is lot about scanning and other read-only data access where often just a few fields may be fetched out of much wider and flat record.  As a testimony of this, we can look at the number of commercial vendors purchasing existing solution and/or integrating column-store into their own platform :

    • Sybase IQ now part of SAP, probably one of the original pioneer of columnstore in commercial scene
    • Vertica (purchased by HP), the 5-year old vendor specialized and fully dedicated to column-store
    • Greenplum (purchased by EMC) with their "polymorphic" data storage  
    • Paraccel (very recently purchased by Actian) combine column-store with an in-memory engine  
    • Oracle with a twist that they refer to as Hybrid columnar (rows still being used) 
    • Microsoft SQL-Server proposes a compromised approach called "columnstore index" where data update is no longer permitted!  

As for Graph database, I think great opportunity is possible by mining the relationship and interactions contained explicitly or implicitly within these graph content.  Lots of research are currently done on work aiming at providing BI-style analysis against Graph-based data.  Here're two great examples:  http://www.edbt.org/Proceedings/2012-Berlin/papers/workshops/beweb2012/a3-bleco.pdf or else http://perso.ecp.fr/~aufaurema/Graphs%20for%20BI%20(printable).pdf


Martin

Saturday, September 17, 2011

Oracle choices for Multidimensional Analysis

Here’s a quick assessment highlighting the main differences between Hyperion Essbase vs Oracle OLAP. Knowing that both products are now under the same ownership, I thought this should be archived before it gets completely outdated!

Oracle Essbase Oracle OLAP
What-is? clip_image001 Standalone OLAP Server: separate to the Oracle database
clip_image001[1] End-user focused, popular among business users as data access done via Excel
clip_image001[2] Fully multidimensional OLAP engine with support of MDX and XML/A (rather than SQL)
clip_image001[3] Oracle OLAP option: in-database solution available in Oracle Enterprise (latest version 11g)
clip_image001[4] Aggregation management solution for SQL-based BI applications
clip_image001[5] Alternative to table-based materialized views solution, offering better query perf and fast incremental load
clip_image001[6] A “real” multidimensional OLAP server paradigm and not a simple SQL relational hack
Underlying Technology Storage Used: Server MOLAP proprietary storage
Concept Used : Full OLAP capabilities leveraging business rules and names, with Dimensions, facts, hierarchies outlines, consolidation rules, write-back, annotate, calculated metric, etc
Data Access : MDX access exposing full OLAP capabilities model, access API choices between Java and XML/A, Essbase query tools.
Storage Used: multidimensional arrays in DB
Concept Used : Structured around business rules and naming, with Dimensions, facts, hierarchies, aggregation rules etc
Data Access : Fast data access through “cells” (direct address) storing either details and summary data but leveraging SQL access mode
OLAP Engine : advanced analytics like time series analysis, non-additive calculations, financial and statistical models
Key Benefits Ø Renown and proven technology with important customer base and long history background

Ø Use to power many of the Hyperion performance management applications
Ø Oracle Backend-Only Solution, no need for:
a. external metadata
b. exporting data
c. separate server
d. other backend application
Ø Easier than Materialized View Solution,
a. Avoid the complexity of choosing which MV to create and how many possibly creatable
b. All summaries stored in a single OLAP multidimensional Cube
c. CBO treats OLAP cubes as MVs: rewrites queries to access cubes transparently; refresh using MVs standard

Ø Easy access to detail transaction as stored directly in Cube (transparently)

Drawbacks Standalone application requiring:
1. Dedicated server hardware
2. External metadata, hence business rules, def.. duplications
3. Extraction and import of DWH data
Currently, not fully integrated within Oracle infrastructure
Does not offer a full compliant multi-dimensional access to OLAP data :
1. No MDX, XML/A
2. No real OLAP-aware query, only support SQL language (olap extension)
3. No best suited for planning & forecasting tools

Martin

Monday, August 08, 2011

QlikView: introduction to "In-memory BI technology"



I've been introduced to QlikView lately and it has been somewhat a refreshing experience compared to more traditional BI solution.   So this post will give some of the notes taken out of training and after experimentation of the tool.

It turns out I do enjoy it today as a data explorer tool.   This allows me to easily navigate through some unknown data model and quickly understand data relationship and hidden rules (not enforced explicitly with defined database constraint).    I find this a faster way to explore new data stored in database than going through a lot of SQL request/result analysis iteration.  

Here's some additional notes I've gathered:

At 10,000 feet level, there are two major key differentiators :
    1. The associative nature of the data architecture domain*
    2. The all in-memory principle

*The associative term is somewhat abusive as it should not be confused with associate data structure (ref).  It seems to be rather related to the way front-end dynamically associates data element under selection, and also probably to the automatic association between elements having identical filed name in the data model.   It seems that the data is linked using Vector-based association with pointer-type reference.

The underlying data model is actually quite simple:  it is built around the familiar concept of datasets with related tables having a number of fields (tabular format).    However, there are fundamental rules to respect in regard to that simple model:

    1. Limit all relation key to a single field (any PK combined fields also used as FK will generate a so-called Synthetic table by QV....which must be avoided)
    2. Eliminate all circular reference within the whole Data model
    3. The field key used for relationship between tables must all have identical name (automatic relation generated by QV)  

The number of tables can easily reach hundreds while total number of fields reaching thousands.  Datasets are always pre-loaded into memory (mandatory) in a compressed format (apparently not using columnar data storage but rather record-based table representation) with very good compression ratio near 10:1**.   (from what I could find, the in-memory data is stored through direct compiled machine code ... explaining why only Intel-based processor -multi-core included- are supported as different port would involve intense engineering re-coding and re-optimization).

**As a sidenote: anyone familiar with the de-normalized structure of WH data will not be surprised by this ratio, considering that all data redundancy will be eliminated with the use of pointer in the data structure.

As stated in the 3rd rule, associations are generated automatically against the entire dataset by following this simple convention: all pairs of fields in different table having identical name are automatically part of an association.   No particular data model (like star-schema) is assumed, but to remove possible data incoherence/ambiguity, each pair of table must have a single join path (as stated in 2nd rule above).  Actually, QV will simply break all redundant links (randomly) to avoid the closed-loop within the data model.  This implies that for star-schema model it is practically impossible to traverse multiple fact table within the same QV dataset, except for the rare case where only one common dimension is found between two star-schema...   

After having configured all your input data and loaded up in QV, you get a data model similar to:
Figure 1.  Example of Data model defined in QlickView.

All associations are managed at the engine-level and not at the application-level (a typical BI app would  need to explicitly manage, store and maintain associations between individual queries association).  With QlikView, you always get the associated data as well as the un-associated data elements well highlighted (black and grey-out respectively).  That's highlighted by their commercial pitch :  QlikView is not a query-based tool!  Meaning that data is always linked or associated during analysis instead of having go through iterative query-based analysis in more or less isolated steps:

business question --> query definition --> launch query --> analysis query result -->refine business question --> redefine query definition --> etc...


The data access architecture is not well documented, and it seems to be always fetching data via memory scan, avoiding to have indexes, pre-computed aggregate typically found in other BI-tool in some form or another  (see reference The underlying techno..).  .  

The dataset maximum size that can fit within your host will not be straightforward to calculate as it will depend on the server memory size,  compression factor and obviously on other parameters like number of dashboard pages and objects, number of concurrent users connecting to Qlikview server, etc..   Typically, the solution is currently designed for gigabytes-scale databases, but obviously this keeps improving (thanks to Moore's law) especially with the 64-bit server architecture.   


Some drawbacks:
    • as always with this sort of tool, you need to extract the data outside your main data repository (whatever that may be) and loaded up inside other server host
    • this extract/re-hosting step opens the door to KPI rules duplications/inconsistencies, and uncontrolled data rule transformations that will likely  leak outside your central data transfo rules metadata (if you're lucky tro have one)
    • scaling and data volume limitations: I came across quite a few Memory error during loading process when dealing with bigger database ... happens especially when loading wider tables (many attributes).  The major drawback is that you learn this the hardway...yon only know your database will not fit inside your hardware after spending a considerable amount of time trying to load it into memory!!   On some occasions, you actually will need to kill the ‘Not Responding’ app before you can get any response:



Other miscellaneous notes and observations :
    • All fields within the data model end-up being an axis of analysis!  Even fact-type measure can be queried and searched by.value (more commonly by rinterval of values).
    • A file *.QVW is a self-contained QV dataset which potentially include : all data scripting (for data loading into memory), the data itself stored natively in-memory and the presentation stuff (all dashboards and report views)
    • Make use of the export native file format QVD, as it offers much better loading performance and good compression compared to the legacy data source.
    • we should limit the number of association as it will always be quicker to fetch data within same table.   In theory, the complete dataset could fit into a single wide table, however this has some drawbacks like data model understandability.
    • Ideally replace Distinct count formula (e.g. Count(distinct EmployeeId)) by adding a simple counter on the table on which we need to count distinct value (distinct involved creating a temporary structure with unique field value on-the-fly).
    • Limit the number of tables... so if you have lots of snowflaking inside your WH, try to merge during Data loading into QV.
    • When concatenating combined-primary keys into a single field, use Autonumber()  [e.g. Autonumber(OrderId & '-' & lineItemNo) ] function to end up with number-based key (instead of String-based key taking more memory)
    • All data storage is either :  1- String-based or 2-Number-based (only possible when all values within the table can be interpreted as valid number)
    • Number values are read in sequence... i.e. if we have these identical sequence of number:  1.0, 1, 1.000, then only the first value read will be kept (i.e. 1.0)

Martin



Tuesday, April 01, 2008

SQL Analytic


In my current consulting activity, I’m involved in a project relying on Oracle 10g for most of its data warehouse infrastructure. I thought it would be a nice opportunity for me to brush-up on my SQL skills.

The OLAP SQL functions (also referred to as Analytic Functions) come in very handy when dealing with data reporting or analytical needs. So this post will summarize the SQL analytic functions available in Oracle (some of them since 8i and other since 9i).

Although reports produced using these functions could very well be done using SQL or PL/SQL workarounds, these would be more clumsy and potentially less than optimal solution (depending on how well Oracle Optimize process these functions).


These functions can be roughly classified into these categories:

1. Ranking (show top 5 or bottom 5?),

2. Reporting Aggregation (compare values at different level of aggregation, e.g. region sales versus national, market share?),

3. Window Aggregation (moving average, cumulative sum?)

4. Row comparison (studying change or variation over sequential row, LAG/LEAD, or get first and last value of sorted groups with FIRST/LAST?)

It is important to note that these functions operate just before the ORDER BY clause of a query (which occurs last on the SQL process sequence order), or at the very last step when no such clause is present. Also important, these functions are not allowed within the WHERE, FROM and HAVING clauses, and are really meaningful within the result set of a query (i.e. in the SELECT clause) and more rarely in the ORDER BY clause.

The follow the following generic form:

function(<arguments>) OVER(<analytic clause>)

where

+function is the analytical function (e.g. AVG, CORR, COVAR_POP, COVAR_SAMP, COUNT, CUME_DIST, DENSE_RANK, FIRST, FIRST_VALUE, LAG, LAST, LAST_VALUE, LEAD, MAX, MIN, NTILE, PERCENT_RANK, PERCENTILE_CONT, PERCENTILE_DISC, RANK, RATIO_TO_REPORT, STDDEV, STDDEV_POP, STDDEV_SAMP, SUM, VAR_POP, VAR_SAMP, VARIANCE)

+arguments is optional and could include more than one (<argument1, argument2, argument3>)

+analytic clause can be of one the following: Query Partition clause (to break a result set into multiple partitions according to a criteria set expression); Order-By clause (for sorting data into each partition); and Windowing clause (to define a window of data).


1. Ranking functions

These functions role is to provide ordering info to records based on some given attribute. The basic three functions available (i.e. ROW_NUMBER, RANK and DENSE_RANK) only differ in the way they manage ordering ties. Here’s a very simple sample to illustrate the differences:


RANK and DENSE_RANK rules allow for the top-n query, which is not possible with ROW_NUMBER as it does not treat tie.


Note that we can also control the orderings of nulls value by specifying the self-explanatory clauses NULLS LAST or NULLS FIRST (default) after the “ORDER BY attr” clause.

Other functions can be classified within the Ranking category : PERCENT_RANK, CUME_DIST. They operate very similarly as the previous ones (e.x. PERCENT_RANK() OVER(ORDER BY attr), but compute the cumulative fraction typical of histograms.

In conclusion, the ranking functions offer a better alternative than relying on the well-known hack (creating a subselect virtual table with ORDER BY used in combination with the ROWNUM function) that can handle tie correctly.


2. Reporting Aggregation functions

Most common aggregation SQL functions (like SUM, COUNT, AVG, MAX, MIN, VARIANCE, etc) can be used as analytic functions as well, eliminating the constraint of having the aggregate value to be at the same level as the row set (mandatory when using GROUP BY). We can then have a global aggregated value presented with all data rows, as well as an aggregated value at a different partitioned window level all in a single query statement. Here’s a self-explanatory example with AVG:


Also belonging in this category is the RATIO_TO_REPORT which gives ratio of a value versus an aggregated comparator. Actually, this function is simply a shortcut version of : value / SUM(value) OVER() == RATIO_TO_REPORT (value) OVER().

We also include in this category the function ROLLUP and CUBE which both provide aggregated values at higher level (the difference being that CUBE will provide aggregate values throughout all combination of attributes) as illustrated by this simple example:


Notice the use of GROUPING function (which return 1 when aggregation was done over a specified attribute and 0 otherwise) to help differentiate between NULL values in the table and artificial NULL generated for reporting purposes.

Window Partitioning Operator

We often used in combination with the Ranking and Reporting aggregation functions, an operator known as the window partitioning operator. This operator role is to provide a way to separate data into grouping and perform the functions on these groupings. It relies on the use of PARTITION BY placed in the analytic clause. Here’s a simple example illustration the window partitioning used with the RANK() function:


We can see that the second RANK function now operates on a defined grouping (here defined by Grouping column) and not the overall set of rows.


3. Window Aggregation functions

These functions permit the calculation of moving average on a set of ordered data records. For example, to get a moving average using three data points window (one in the past, one current and one in future), we could use :

AVG(value) OVER(ORDER BY t ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)

As found in the following sample:


Note that for missing values on the first and last record, the preceding and following records are considered NULL, and thus not used in the AVG function (see the first moving AVG value: 150 which is calculated from only 100 and 200).

This is a general rule applicable for treating NULL values when doing aggregation function, i.e. always ignored in the calculation.

The window aggregate function can also be used to calculate running or cumulative sum. This is possible using the ROWS UNBOUNDED PRECEDING and CURRENT ROW clauses for the window boundary definition. This gives:


As a final comment about this category of functions, let’s note that as in the case of ranking and report aggregation, it is possible to partition their calculation using the PARTITION BY clause.


4. Row Comparison

The row comparison functions are very useful for showing previous and subsequent records data on the same row as the current one within an ordered sequence of records. These functions are LAG and LEAD. These again are very simple to use as demonstrated in this small example:


The more generic format allow us to define offset and default values for NULL, i.e.: LAG(attr, offset, default_null).

Again, these functions can also be partitioned similarly as the previous ones. Anyone who has hacked a solution for this type of report by relying on self-join query can certainly appreciate the clearness and compactness of these functions.

Martin