Monday, March 14, 2016

Cloud vs data warehousing


Things are going quickly when software moves to the Cloud. For the software vendor, there is no more need to maintain x number of code versions to fit every one's preference in hardware. And also the adoption rate can go very fast as the entry cost is pretty much nothing for any new client interested in trying out the offsite platform.

I'm currently spending effort to get up to speed with Cloud-provided DW solution. They could be disruptive to a conservative industry like DW. Here are a few factors I see as impacting the adoption rate.

Cloud computing trend

In recent years, the trend of cloud computing has increased both in scope and popularity to reach nearly ubiquity. It started off with simple needs like storage space, evolved into more advanced offers like computing resource letting user run their own software (PaaS) and has recently moved up the abstraction ladder with complete solution and service stacks being offered (SaaS).

In the realm of Business Intelligence and data warehousing, this trend did not pick up so rapidly (political and security worries) but things are quickly catching up with recent offering like RedShift (Amazon AWS), SQL Data Warehouse (Azure Microsoft) and Elastic Data Warehouse Service (Snowflake). There are many reasons we believe this will grow a lot more, but for the simplest and quickest answer: cost! Doing Business Intelligence and data warehousing projects with on-premise platform is an expensive adventure! (link)

Data Deluge and new Tools

Business intelligence domain has also increased in complexity recently following the advent of what has been called the data deluge. The appearance of a multitude of new data sources from social media, connected device, Internet of Thing, has challenged the relevance and adaptability of traditional BI solutions. BI tools and platforms were designed mostly around data generated from operational system, where data type (structured) and volume were still manageable.

Besides having created its own set of new buzzwords like Big Data, NoSQL, Data Science and Predictive Analytics, current trend has clearly been disruptive in terms of data type (semi-structured and unstructured) and data volumetry/velocity.

The industry and open community were quick to devise new tools borrowing ideas from Google’s scalability architecture (commodity hardware and map reduce paradigm) that focused mostly in handling the sheer data volume and velocity or burst rate. These tools have been loosely defined and grouped under the category NoSQL. No matter how good these tools are at handling the 3V of big data, they all fall short in meeting the most important aspect of any BI analytics solution: the Data integration bit!

The introduction of these new these tools increases significantly BI environment complexity and thus requires more than ever formal and proper architecture blueprint and principles. Otherwise your entire BI platform integrity is jeopardised and will cause on-premise TCO to increase even more. The solution to that is to rely on Cloud solution provider to provide with solid architecture expertise and to offer new cost model with no initial cost investment.

Data Integration & architecture (more relevant than ever)

Anyone today is seeking to extract useful information from raw data in order to gain knowledge and make better and informed decision (i.e. data-driven decision). No matter how valuable your data asset is, there will be many shortcomings if it is kept in its raw format.

Raw data is inconsistent, incomplete, plagued with error, unconstrained, unformatted, outdated, etc. Modeling data imply applying some rules/checks, standardization and harmonization throughout different sources. Data integration implies doing quality assessment and data correction. Hence only data integration and modeling can lead us to “desired” qualities of data, i.e. the five C’s of Data (see Rick Sherman BI guidebook):
  1. Clean (dirty, missing data treated) 
  2. Consistent (knowing which data version is right) 
  3. Conformed (enterprise-wide common and comparable data) 
  4. Current (data latency adapted to context) 
  5. Comprehensive (breaking departmental or section sillos) 

It is real hard work, time-consuming and requires a lot of thinking… but there is no alternative. Don’t believe in any magic solution or silver-bullet tool that promise to transform raw data into actionable information!

We are faced with a multi-source data integration problem, and the new source of data deluge should be integrated and complementing the existing and traditional BI data layers in some form or another (consolidated/aggregated), and not merely bury us with massive and unwieldy data, i.e. the deluge!


Transforming your raw data into knowledge (the "golden" path: raw data -> information -> knowledge) is conditioned on preliminary work that focus on analysing and modeling data. This is especially relevant with semi-structured data found in BigData where no explicit model and constraint exist. There are usually only implicit rules (schema-less) making raw data much harder to work with.

How can Cloud-based solution help you on that...along with your Cloud solution provider, you can find a highly skilled and specialised team dedicated in data integration and the particular Cloud technology. Then you can avoid having to find and train your own team for that purpose. Everyone should focus on its own core competency and avoid building/duplicating specialised skills ..

Martin

No comments: