Data Warehouse Automation vs. Data Virtualization pt. 2

Data Virtualization / Data Warehouse Automation

Data Warehouse Automation vs. Data Virtualization pt. 2

I see a new pattern emerge: every time I want to write a blog post about something I've been thinking about, I discover some content that's way better than I could ever produce. Just like last week's Wikipedia article contained everything I wanted to say, so today there's an article making a far better point of the position of Data Warehouse Automation - in regards to Data Virtualization. Anyway, I wanted to utter my words so badly - here's my $0.02:

Data Warehouse Automation as well as Data Virtualization are getting a lot of traction lately. I'm not going to dive into all details, but to set the field, here is what I mean by those two terms:

Data Warehouse Automation (DWA) are tool(set)s delivered by external vendors which help you to automate your Data Warehouse construction on all fields. To me, DWA deserves its name when it focuses on the entire Data Warehouse (as opposed to mere "ETL generation" toolkits, even "metadata-driven ETL generation" toolkits, which focus only on the ETL part, or "Raw Data Vault generators" which help you only generate a historized copy of your source system). I've written about this earlier in more detail.

Data Virtualization are tools to integrate data without building a Data Warehouse. Inside the tool, you set up business rules, connect source systems, and so on. Then, a consuming party connects to Data Virtualization Tool. Initially, a Data Virtualization tool doesn't store any data, but gets and integrates data as soon as a request for the data comes in.

Disruption anyone?

From a certain viewpoint, one could state that Data Virtualization is focused on the way the world should work: when integrating data, one shouldn't have to store it everywhere. Why not let the system decide when to store? For some, to adopt this view might mean a paradigm shift: suddenly, the Data Warehouse isn't the go-to integration point any more!

From this viewpoint, DWA is a tool "from the trenches[ref]which is something positive according to developers[/ref]": after years of struggle and hard work to build our warehouses, we've developed some smart ways to automate our warehouse-building based on abstract models.

I even heard somebody say that

In a pessimistic way, you could say that DWA is all about building an assembly line for cars, while Data Virtualization is suggesting a new way for transportation.

Woo-hoo - disruption ahead! Or wait.. assembly lines for cars have proven to be pretty disruptive in early car industry, whereas new ways of transportation still have to prove themselves.

Still, I think the observation that Data Virtualization focuses on the world as it should work (thereby rethinking data integration needs) while DWA focuses on improving the current state resonates with my experiences: especially newer DWA players seem to focus on lead developers, and solving the technical problem. Data Virtualization vendors on the other hand seem to focus more on architects inside enterprises, and selling "the new world".

Now vs soon?

For many organizations Data Virtualization is currently a bridge too far: while even struggling to keep the DW up and running, it's hard to focus on an entire new way of work! In some sense, it looks like the way most organizations adopt scrum: "let's just pick some practices that fit in our organization". Nothing wrong with that - there's no such thing as a silver bullet! In this sense, Data Warehouse Automation is more feasible: getting better tools for managing the current (DW) process, with a clear visible ROI.

What about Data Warehouse Virtualization?

As I mentioned earlier, before I finished writing this article I encountered one that captured the essence better than everything all I had written above. Read Roelant Vos's article "Beyond ETL Generation & Data Warehouse Virtualization". Now keep in mind that Data Warehouse Virtualization (DWV) is different than Data Virtualization: it involves actually moving all data towards a Persistant Staging area, after which the DWV tool takes care of the integration. This is clearly different from Data Virtualization (where data stays inside the sources until you query it), but also different than the offerings of most DWA tools: you tell the DWV tool how integration should look like, and the DWV tool does the plumbing, caring about ways of historizing, which layers to persist in order to keep performance up, and which underlying techniques to use.

The thing that particularly struck me was the way Roelant pictured this as seven "stages", moving from Manual ETL development via ETL Generation towards a system where we can re-generate not only the marts, but the entire Data Warehouse whenever we like, while maintaining all history we want. As I said, most DWA tools don't offer this (some do). But it's definitely on the DWA side.

A false dichotomy?

Within companies currently looking at Data Virtualization, no one seriously suggests turning off the Data Warehouse. Even though the data integration being offered overlaps, the dichotomy DWA "vs." Data Virtualization is a false one: Data Virtualization usually doesn't provide any history tracking - which is one of the most powerful (as well as complex) features of a Data Warehouses, making Data Warehouses as trusted as we like them to be. Representing merely the actual state of a source system in general is more error-prone.

To close with, I'll provide a link to another article which discusses not so much DWA vs Data Virtualization, but Data Warehouses in general versus Data Virtualization. James Serra's "Data Virtualization vs. Data Warehouse".