Lambda Architecture for the DWH2 mei 2016 2016-05-02 11:23
Lambda Architecture for the DWH
Recently I saw this post from Davide Mauri (basically he announced his session "Reference Big Data Lambda Architecture in Azure" at SQL Nexus). Although I would've loved to attend SQL Nexus, I'm not able to do so this year 🙁 . The next best thing I could do was order and read Marz & Warren's "Big Data" (Manning) - and I did so immediately. Boy, what a read! Especially the first few chapters: well-written, concise, more data- than business-oriented (which is okay - after all we're talking about architectures not BI methodologies), and explains the Lambda Architecture really well.
All this raised a question within me: Is the - primarily Big Data oriented - Lambda Architecture suitable for Data Warehouses too? A quest with a rather surprising outcome, IMHO - read on for the details!
Overview of the Lambda Architecture
The Lambda Architecture is pretty well documented - online[ref]http://lambda-architecture.net/, Applying the Big Data Lambda Architecture (dr. Dobb's), The Lambda Architecture: principles for architecting realtime Big Data systems (James Kinley)[/ref] as well as in the book I just mentioned[ref]Manning | Big Data[/ref]. For a quick overview, Lambda Architecture is basically a system where the raw data is always stored, and never thrown away. All information that's derived from this raw data is always recomputed - often stated as query = function(all data). This provides for a fool-proof architecture that's rigorously simple (compared to classic RDBMS solutions), made up of three layers:
- Batch layer - storing the raw data (as raw as you can get it). This collection of raw data is called the master dataset. All data inside is considered "immutable" (updates are not allowed)
- Serving Layer - provides "views" on the batch layer (not to be confused with RDBMS views: serving layer views are pre-computed and stored)
- Speed layer - because the recomputation of the Batch Layer takes quite a few hours, the Speed Layer provides data that arrived after the current view in the Serving Layer was recomputed[ref]This part relatively complex but is allowed to contain errors, because errors will never remain inside the system for more than a few hours[/ref]
All this raw-data-with-computed-views is pretty much like Power Query: you don't alter the data itself (Power Query doesn't do UPDATE statements in the database or write anything in your source CSV's), but it transforms the data, and writes the result in a different location which the user can access[ref]In fact, Power Query's language is a functional language, with very much resemblance of OCaml and F#. Oh, and guess what? These so-called "functional languages" are based on a paradigm called Lambda Calculus - I leave it to you to connect the dots..[/ref].
Data Vault: Lambda Architecture for EDW?
As I was reading up on Lambda Architecture, I realized that many projects in the Netherlands are almost following the pattern, although I'm pretty sure not everyone is aware of it. This is because a part of the architecture described as "Lambda Architecture" was already incorporated as part of the "Data Vault" methodology of building Data Warehouses[ref]Please note that Data Vault is more than only architecture or modeling. However, when employing the Data Vault methodology, you'll always include the Data Vault architecture and do Data Vault modeling.[/ref]. Here's why:
The Immutable Master Dataset
During a Data Vault training I attended from Hans Hultgren a few years ago, I remember a pretty good discussion we had about Data Warehousing, where the following phrase was stated:
The Data Warehouse should always be able to reproduce two pieces of information: First, all data that has ever entered the Data Warehouse. Second, all data that ever been delivered by the Data Warehouse.
Therefor, in a Data Vault modeled EDW, you've got two main "layers": the Raw Layer (containing all data as much alike as it once arrived in the EDW) and the Business Layer (containing the integrated data). In the Raw Layer, nothing is ever thrown away - everything is just appended as it arrives: This Raw Layer approach is basically what we'd call an immutable master dataset in the Lambda Architecture.
The Master Dataset as a Graph
When applying the Lambda Architecture, it is considered a good practice to model the "master dataset" using a graph schema, consisting of nodes, edges and attributes.
The principles of a graph schema for the master dataset are also the foundation of the Data Vault modeling, consisting of three types of objects:
- HUB: business key containers, designating unique business entities
- LINK / LNK: business relations, only between HUBs.
- SATELLITE / SAT: attribute sets, solely linked to a HUB, sometimes to a LINK
Data Vault maps the use of nodes, edges and attributes to describe business concepts inside a DWH. Even when you store it in a relational database, in essence the Data Vault model is a graph schema[ref]Note that although a Hub is a Node, a Node isn't a Hub 🙂 [/ref].
Generating "Master views" for querying
The practice of generating a "master view" can be translated to a EDW context whenever Data Marts (a master view) are generated (recomputed) from the EDW (the master dataset). Indeed, this is a widespread practice in EDW projects using Data Vault: Most Data Vault projects I've worked on could regenerate their Data Marts out of the Data Warehouse, and some of them even did so daily[ref]while rebuilding the entire Dimensional Model every night is not the most time-efficient way, it makes the effect of changes very obvious, and it's easy to recover from human errors.[/ref]. Other projects had an incremental load, but they always had a way to regenerate the (almost always Dimensional) Data Mart from scratch.
In Lambda Architecture, every "batch view" in the serving layer should be recomputable from the master dataset. For performance reasons, incremental loads can be added, but this makes the process more fragile to human error, so you should still be able to recompute the entire batch view from scratch. In the Manning Book mentioned above this is captured in the discussion of "recomputation algorithms vs. incremental algorithms (section 6.3; page 88).
The Gap: Data Vault towards Lambda Architecture
Although there's a lot of similarity, Data Warehouses applying Data Vault still don't have a Lambda Architecture. Where's the gap?
The Speed Layer
First and most obvious is the speed layer. The speed layer allows for human real-time delivery and reporting, but so far I've never encountered such construct inside an EDW[ref]I have no knowledge of Data Vault 2.0 yet - so if I'm wrong here, please feel free to comment / contribute![/ref].
Secondly, by default the majority of EDW (including Data Vault) solutions are built on a RDBMS, although this is not strictly necessary. To achieve the Lambda Architecture (and the horizontal scalability accompanying it) it will be beneficial to switch the Warehouse layer (the EDW) towards another platform (for the Serving Layer / Data Mart a relational database is no problem and IMHO even a good choice)
Number of attributes inside a Sattelite
Within Data Vault, attributes are grouped based on several conditions inside satellites (SATs). Within the Lambda Architecture, attributes are as much as possible tracked on their own. This reminds of the way anchor modeling[ref]http://www.anchormodeling.com/[/ref] takes on modeling. In principle, Data Vault doesn't prohibit creating a satellite per attribute, but especially in a RDBMS it's seen as a pretty inefficient way.
Satellites in Data Vault have the possibility to end-date. This means you have a separate column inside your satellite tables where you can store until when that version of the truth was "valid". This is for querying purposes, but also to track disappeared objects. As far as I know deletion tracking is not included in the Lambda Architecture.
The habit of end-dating satellites in Data Vault might make the requirements for your Big Data system much more complex, because it requires updates on your existing records[ref]Updating is quite an extension of the basic functionality needed (Big Data systems are generally "append-only", which makes it much easier to achieve eventual consistency after network partitions)[/ref]. You could work around that though: it's possible to achieve registering deletes by introducing a separate attribute describing 'IsActive' combined with a start date.
Please note that updating records in order to record end-dates is NOT strictly needed inside Data Vault, but it is possible - and used very often. Here's my advice: think about this practice whenever thinking about moving your EDW to another (non-RDBMS) platform.
Conclusion: Lambda Architecture for the DWH?
Part of the Lambda Architecture is suitable for DWH's
So to answer the question whether Lambda Architecture is suitable for Data Warehouses: up to a point, definitely. What surprised me, is that Data Vault included already so many concepts. A second thought however does make this somewhat less strange: Dan Linstedt developed Data Vault initially inside the NSA, where he had to handle amounts of data too large for traditional database systems (yup, we now call this "Big Data" 😀 ). Here's my conclusion - unexpected for me in advance:
Many Data Warehouse initiatives are already on their way implementing a Big Data architecture like the Lambda Architecture, because they use Data Vault.
I'm not sure if all of the Lambda Architecture is currently applicable for Data Warehouse inititatives - the speed layer introduces much complexity, and is not always needed in current BI solutions.
There's room for technical improvement in current DWH's
If these inititatives are 'on their way', what does stand in their way from achieving horizontal scalability? The Lambda Architecture and experiences from within the Big Data community give some ideas about the technical requirements of a Data Warehouse platform. These batch-oriented systems containing a immutable "master dataset" maybe aren't on their best inside a classical RDBMS. What about a horizontally scalable datastore fit for a lambda-like approach? As it turns out, DWH initiatives employing the Data Vault methodology are already set to embark on this horizontally scalable mission - and even more, this addresses some of the often-faced issues when implementing a Data Vault:
- system needs more resources than the Dimensional DWH used before
- too much (expensive, OLTP-grade) storage is used
- as soon as the databases changes its mind about a query plan you're screwed because the ETL doesn't resolve any more. (Heck, query plan? We're talking about pipe line batches here!)
The Speed Layer: the start of real-time BI?
Finally, concerning the speed layer. I haven't encountered the speed layer in an EDW yet, but sounds promising. It introduces much complexity, but offers in exchange a new distinction within your data solution: there's "slow" but accurate data (arrived before the execution of the last batch load), and "fast" (even human real-time) but possibly less accurate data (arriving via the "speed layer"). You're allowed to make approximates inside the "fast" data. It isn't too bad when data isn't accurate to the fifth decimal: after a few hours, it will be discarded anyway because the truth is then offered via the batch layer, and compared to all data the numbers are small. Could this be the start of human real-time BI?
One More Thing: The Dimensional EDW?
One thing that I'd like to emphasize, is that the qualities of a Dimensional Model aren't thrown away or set aside: an enterprise-focused dimensional model is still desirable. If you've got a Dimensional Data Warehouse currently, think as the EDW described above as a "persisted staging" with automated procedures being able to rebuild your entire EDW. With that in place, remove all the technical stuff you don't need anymore from your EDW. What remains is a dimensional model even more focusing on and comprehensible for the business users, separated from the EDW where integration is handled and which no business user ever can access. Separation of concerns, if you like. No data inside the EDW gets ever thrown away, or is altered - it's just re-calculated sometimes and stored again. Let's call this construct a Vault 😉