Friday, July 01, 2016

Redshift Cloud-based data warehousing solution


Architecture


Redshift is implemented using the ParAccel Analytical Database or PADB (ref). From ParAccel definition:

PADB is architected as a columnar, compressed and massively parallel relational database management system that is capable of all-in-memory, if desired.

The main component is the Leader Node (interface to the outside world) whose responsibilities are to manage communication between nodes, to control session, to parse and optimize queries and to schedule execution of workload (excluding participation to any data operations).

The second component are the Compute Nodes responsible in processing, storing data and sharing workload evenly among themselves. These nodes receive query compiled code from Leader node (based on the execution plan) and send back result after execution is finished.

Each node work with its own disks against its own memory and CPU, but may depend on each other when intermediate result sets are transferred during join/aggregations operation. To increase parallelization (through multi-core capability) these nodes are furthermore partitioned into slice, with each having a dedicated portion of memory and disk space, where queries and database processing are finally executed.

The other component is the Communication Fabric using custom Gigabit Ethernet and allowing for communication between Compute and Leader to be high-bandwidth, private, close proximity and very high speed. The Compute Nodes are located on an isolated network not accessible to client applications.

SAN backend is not mandatory but useful for backup, snapshot and restore when in High Availability (HA) mode. In fact, Redshift is actually optimized to work with fast direct-attached storage (DAS) as typical or Cloud-based cluster based on commodity hardware.

Source: Amazon Redshift Database Developer Guide

Key features


Redshift main selling points are

  1. Shared-nothing MPP high performance DB not requiring to build redundant data access structure. No need to create and maintain redundant structures like indexes, materialized view, physical aggregations and summary tables, ..;
  2. Suited to a broad range of schema model. No need to denormalize or build dimensional data model for improved performance;
  3. Columnar storage offering high compression ratio and efficient utilization of I/O. Only data block needed are read, as opposed to row-oriented storage DB (BI application's queries usually involve a fraction of available attributes and thus penalized row-oriented DB where all attributes are read for each block I/O).
  4. Leverage low-cost commodity computer with DAS storage (either mechanical HDD or onboard flash SSD) common to all Cloud-based environments

Schema model


The Redshift/PADB is supposedly schema agnostic, in the sense that you don't need to favor one modeling technique over another. Let me comment on that a bit.

Data warehouse implementation using RDBMS uses a variety of techniques built solely to improve performance. We add physical-based structure such as indexes and Materialized Views, also schema-based structure like denormalized Dimensional models, and potential external one using complete different physical schema (like MOLAP). All these add up both in complexity and storage requirement. Redshift moves away from this trend by only storing the needed data (typically modeled in 3rd NF) without redundancy.

The most performance gain from Redshift is coming mainly from 1) columnar orientation and 2) MPP features. First point discourages the use of highly normalized models. These models tend to have narrow tables (with few attributes) and consequently will not gain much I/O performance (for each block read) compared to row-oriented databases. Furthermore the MPP features induces locality sensitivity during join and aggregation operations. Joining rows locally is best, but performance will degrade quickly once these rows get distributed across different nodes. This again discourages highly normalized model, especially for relationships/association modeled as m-to-n structures where rows are distributed using only one foreign key.

Personal recommendation: I would favor some level of de-normalization of the data model (especially if your model was designed for integration purposes) but without going to the extreme of defining only a few wide denormalized dimensions.

Redshift Construct


There are numerous constructs in Redshift not commonly found in typical relational DB. These need to be considered during physical data modeling and are determinant for performance aspects.

Data Distribution style

As with any cluster-based MPP database, Redshift scalability is obtained by distributing data among cluster nodes. The goal is to distribute workload uniformly among all nodes and to have data collocated (data from different tables participating in joins and aggregate stored on same computer node).

Rows can be distributed in different ways:

1. EVEN Distribution: rows are distributed evenly in a round-robin fashion.
2. KEY Distribution: rows are distributed according to a specified key (known as sharding)
3. ALL Distribution: rows are simply replicated on all nodes

During query execution, the goal is to minimize the (costly) redistribution of rows needed to perform joins and aggregation. So KEY distribution for two larges tables frequently joined together should be defined with their FK/PK pairs. Smaller and rarely updated table will benefit greatly from ALL distribution without much impact on storage space. Volatility is critical as updating tables to all computer nodes can be costly. EVEN distribution is the default style used when no clear a-priori information can dictate which distribution to use.

Column Compression

The columnar storage format implies that data belonging to same column are stored sequentially on disk. This offers good compression potential (column-data has same format and often similar in content) and consequently, better read performance as more data (RS store data in 1Mb disk block) is read per I/O.

The appropriate compression encoding must be chosen for each data column. Besides the basic raw encoding that simply stores data in raw format (no compression), one can choose between:
  1. Byte-dictionary encoding (BYTEDICT)
    • Use a BYTE dictionary where each value is mapped to (up to 256) unique value
    • Applicable to all types except BOOLEAN
    • Ideal for long characters type having less than 256 values (per block)
  2. Delta encoding (DELTA, DELTA32)
    • Store only the delta difference from preceding row
    • Applicable to types with natural ordering (INT, DATE, TIMESTAMP,..)
    • Ideal for datetime or sequential number columns as delta between rows can be represented more compactly
  3. LZO encoding (LZO)
    • Compress data using LZO scheme frequently used with text
    • Applicable to all character types (CHAR, VARCHAR,..)
    • Ideal for long free form text
  4. Mostly encoding (MOSTLY8, MOSTLY16, MOSTLY32)
    • Compress "small size" values while leaving larger one in raw format
    • Applicable to number types (INT, BIGINT, DECIMAL,..)
    • Ideal when data type used is larger than most values stored (only a small fraction needs the extra space)
  5. Run-length encoding (RUNLENGTH)
    • Store values as token plus a count giving the number of consecutive repetition
    • Applicable to all types
    • Ideal when data is often repeated in sequential rows
  6. Text encoding (TEXT255, TEXT32K)
    • Store a separate dictionary of unique words (up to 245 unique values) and store other words uncompressed
    • Applicable to VARCHAR types
    • Ideal when data contains the same words frequently

Sort key

Redshift can define one or more columns per table as sort keys. Rows will be sorted during initial load accordingly. This is especially useful during read where data block can be skipped entirely to minimize I/O, typical with range-restricted predicate (where population between 1000 and 2000). This is possible for every column as Redshift keep track of minimum/maximum values for each data block.

Sort keys can be either 1) Compound or 2) Interleaved. The first is useful when we use the same prefix keys as predicate (or a subset of), i.e. if index: K1, K2, K3 then where K1 = .. and K2 = .. is ok, but not where K2 = ... The second one is used when any combination of keys are found as predicate, without worrying about which one is used.

You define sort keys based on the query that frequently same columns as equality/range predicate. For example, if recent rows are more frequently accessed, then you'd sort on the Timestamp column. If some tables are very frequently joined together, you can even define their Sort key to be also their Distribution key and let the query optimizer use sort merge join (very fast as sorting is skipped) instead of slower hash join.

Best Practices

From the Redshift developer guide, there are a few important practices to keep in mind:
  • Make sure you define a Distribution style for each table, unless the default one EVEN is satisfactory
  • Make sure you define a good Sort key
  • Let COPY command decide on the best Compression encodings
  • Define data relationship integrity constraint (will not be enforced but still help query optimizer)
  • Use smallest format size for data type (temporary table created during long queries uses uncompress data with format defined in DDL)

Martin

Note: I’m now using Markdown for all my notes, so I can convert any part of them into html using a conversion tool such as text-to-HTML tool. Writing blog post this way is such a big time saver!

Monday, March 14, 2016

Cloud vs data warehousing


Things are going quickly when software moves to the Cloud. For the software vendor, there is no more need to maintain x number of code versions to fit every one's preference in hardware. And also the adoption rate can go very fast as the entry cost is pretty much nothing for any new client interested in trying out the offsite platform.

I'm currently spending effort to get up to speed with Cloud-provided DW solution. They could be disruptive to a conservative industry like DW. Here are a few factors I see as impacting the adoption rate.

Cloud computing trend

In recent years, the trend of cloud computing has increased both in scope and popularity to reach nearly ubiquity. It started off with simple needs like storage space, evolved into more advanced offers like computing resource letting user run their own software (PaaS) and has recently moved up the abstraction ladder with complete solution and service stacks being offered (SaaS).

In the realm of Business Intelligence and data warehousing, this trend did not pick up so rapidly (political and security worries) but things are quickly catching up with recent offering like RedShift (Amazon AWS), SQL Data Warehouse (Azure Microsoft) and Elastic Data Warehouse Service (Snowflake). There are many reasons we believe this will grow a lot more, but for the simplest and quickest answer: cost! Doing Business Intelligence and data warehousing projects with on-premise platform is an expensive adventure! (link)

Data Deluge and new Tools

Business intelligence domain has also increased in complexity recently following the advent of what has been called the data deluge. The appearance of a multitude of new data sources from social media, connected device, Internet of Thing, has challenged the relevance and adaptability of traditional BI solutions. BI tools and platforms were designed mostly around data generated from operational system, where data type (structured) and volume were still manageable.

Besides having created its own set of new buzzwords like Big Data, NoSQL, Data Science and Predictive Analytics, current trend has clearly been disruptive in terms of data type (semi-structured and unstructured) and data volumetry/velocity.

The industry and open community were quick to devise new tools borrowing ideas from Google’s scalability architecture (commodity hardware and map reduce paradigm) that focused mostly in handling the sheer data volume and velocity or burst rate. These tools have been loosely defined and grouped under the category NoSQL. No matter how good these tools are at handling the 3V of big data, they all fall short in meeting the most important aspect of any BI analytics solution: the Data integration bit!

The introduction of these new these tools increases significantly BI environment complexity and thus requires more than ever formal and proper architecture blueprint and principles. Otherwise your entire BI platform integrity is jeopardised and will cause on-premise TCO to increase even more. The solution to that is to rely on Cloud solution provider to provide with solid architecture expertise and to offer new cost model with no initial cost investment.

Data Integration & architecture (more relevant than ever)

Anyone today is seeking to extract useful information from raw data in order to gain knowledge and make better and informed decision (i.e. data-driven decision). No matter how valuable your data asset is, there will be many shortcomings if it is kept in its raw format.

Raw data is inconsistent, incomplete, plagued with error, unconstrained, unformatted, outdated, etc. Modeling data imply applying some rules/checks, standardization and harmonization throughout different sources. Data integration implies doing quality assessment and data correction. Hence only data integration and modeling can lead us to “desired” qualities of data, i.e. the five C’s of Data (see Rick Sherman BI guidebook):
  1. Clean (dirty, missing data treated) 
  2. Consistent (knowing which data version is right) 
  3. Conformed (enterprise-wide common and comparable data) 
  4. Current (data latency adapted to context) 
  5. Comprehensive (breaking departmental or section sillos) 

It is real hard work, time-consuming and requires a lot of thinking… but there is no alternative. Don’t believe in any magic solution or silver-bullet tool that promise to transform raw data into actionable information!

We are faced with a multi-source data integration problem, and the new source of data deluge should be integrated and complementing the existing and traditional BI data layers in some form or another (consolidated/aggregated), and not merely bury us with massive and unwieldy data, i.e. the deluge!


Transforming your raw data into knowledge (the "golden" path: raw data -> information -> knowledge) is conditioned on preliminary work that focus on analysing and modeling data. This is especially relevant with semi-structured data found in BigData where no explicit model and constraint exist. There are usually only implicit rules (schema-less) making raw data much harder to work with.

How can Cloud-based solution help you on that...along with your Cloud solution provider, you can find a highly skilled and specialised team dedicated in data integration and the particular Cloud technology. Then you can avoid having to find and train your own team for that purpose. Everyone should focus on its own core competency and avoid building/duplicating specialised skills ..

Martin

Wednesday, February 24, 2016

Neo4J, the Graph database

At work, we are confronted with the implementation of a database involving rich and densely interconnected data (.. more relationships than entities) and involving complex recursive query to run.  We thought it could be a good opportunity to test out Graph database.

In a nutshell, Graph database are a good candidate with:
  • highly connected dataset
  • data access through a multi-steps traversals (finding stuff in a network/hierarchy of nodes)
  • data access that is both flexible and unknown (path is changed according to nodes encounter in the way)
  • ultimate flexibility with thousands of classes (node type) having different properties with new classes/properties defined everyday!
We decided to go with Neo4J which seems to be the most stable implementation having the longest track record.  The following are notes that are relevant to Neo4J but some general advice are probably applicable to other implementations. 

I've taken some notes that I share here... (I've also worked more on the Cypher language and will try to post my Python notebook experimentation later).

1) Nodes

These are our entities characterised by attributes, called properties.  It is schema-less, so entity may have different properties. Things to avoid:

  • Nodes with rich properties.  Ex. Country node type with language spoken, currency, etc..  (it's a smell indicating these properties should probably correspond to Nodes) 
  • Nodes should be fine-grained, so avoid nodes representing multiple concept.
2) Relationship

The relationship are for access.  Access using global index is not typical, we do traversing instead using relationship are the primary means for nodes access.
  • Properties on relationship that represent entity (frequent with relationship like "WorkFor", "GrantedTo"..)
  • Used to access neighbouring nodes starting from one nodes and traverse many using the Relationship type and properties.
3) Model

The data model is fun and easy to do simply with whiteboard drawing: you model by Examples and not by Class! So quite suitable to do with business and used a communication tool.
The model is typically normalised and any properties appearing in multiple places should probably be nodes.


4) Anti-Patterns

- Avoid hefty nodes: User nodes with many properties like names, email, address, job, ... a lot of times, these correspond to nodes.

- missing nodes: John emailed Tim, make it 3 nodes with email a node also, and not a relationship between node John and Tim.

- Avoid hot nodes: nodes having a very large number of incoming relationships.


5) Graph Querying

2 approaches are possible for querying a Graph DB:
  1. Graph traversing
    • Start off with some nodes (the roots)
    • Spread out from these according to given path
    • Yield some side-effect in the process (intermediate computation, ..)
  2. Graph pattern matching 
    • Provide an abstract (sub)graph with any number of defined instances
    • Graph engine determines all subgraphs of the larger one fitting the pattern

The first category (imperative language) is used by generic RDF language like SPARQL and Gremlin (this one can support the two). The second one (declarative language like SQL) and is used by language like Cypher.

6) Loading Data 

There are a few ways we can load data into Neo4J: 1) transactional (live update through its REST-API), 2) batch (for high volume like initial load), 3) Cypher (similar to suing DML commands in SQL), or 4) through some API based on Java or other language.

Various tools can also be used for data loading, whether they are XLS-based, Neo4J-shell based, CLI-based (command line programs like batch importer), Neo4J browser application, or ETL-based.


Martin