Data Warehouse Automation Thoughts

Data Vault / Data Warehouse Automation / Dimensional Modeling / Enterprise BI

Data Warehouse Automation Thoughts

I've been looking into DWA for a few months now, and might give it a try sooner or later. Here's some thoughts about it:

Not all DWA tools are created equal

Currently, I think there are two main approaches to Data Warehouse Automation

  1. Data Warehouse Generation: You provide sources, mappings, datatype mappings etc.. The tool generates code (or artifacts).
  2. Data Warehouse Automation (DWA): The tool not only generates code / artifacts, but also manages the existing Data Warehouse, by offering continuous insight in data flows, actual lineage, row numbers, etc..

The difference might seem small, but IMHO is visible most clearly whenever changes occur in the Data Warehouse - the second class of tools can handle those changes (while preserving history). With the first class of tools provide you with the new structures, but you need to handle the preservation of history yourself (as you would've without DWA).

Worth noticing here is Biml: historically a pure "generation" tool in which you could build your own generation frameworks (plain "free" BimlScript, early Mist). New developments in BimlStudio more and more support you when changing existing deployments. They're really crossing this border as of now.

Another difference I see is the freedom of modeling: does the tool require you to use a certain way of modeling (e.g. Kimball-style DWH only)? Or are you just provided with the tools to set up layers, mappings and flows without having to adhere to a certain modeling style? Both can be beneficial - the first to assure adherance to a certain modeling technique, the second gives more freedom to adapt to your existing DWH infrastructure when implementing the tooling later on.

Amount of time / money saved

Some DWA vendors calculate the amount of time saved by acting as if you wouldn't automate without their tooling. This is bullshit: years ago (even before BIML) we already automated our SSIS deployments using the SSIS Automation API, and later using EzAPI. Also, we generated SQL scripts and tables so we could extract new data from sources in an automated way is nothing new. Although the ease with which you can automate is greater using any of these tools, your primary focus shouldn't be on "I saved X hours because I didn't do everything by hand".

Having said that, DWA will in fact save you a lot of money (and time). I just don't think it is the primary selling point at the moment. Especially comparing the amount of time saved when comparing tool X with hand-crafting your entire Data Vault. I have yet to meet the first shack who considered hand-crafting a Data Vault. Maybe it's because they were out of business before they finished ;-).

Where you will save lots of time, is in not having to build the automation framework by yourself. Of course your engineers are able to build anything you like - but as long as off-the-shelves solutions exist, they'll hardly do it cheaper. So when making the "amount of money" saved argument, be sure to include the following aspects:

  • The DWA tooling will continue to improve on the experience they have from all other customers
  • It's way easier to transfer work to new developers - no in-company workarounds any more
  • It's also easier to transfer the maintenance of delivered DWH parts to administrators, if needed.
    • Documentation is provided out-of-the-box
    • The GUI provides a structured oversight of what's running
    • Default reports are often included for lineage, ongoing performance over time, etc.
  • Finally, depending on the solution the DWA tooling will monitor your process and sometimes choose other loading techniques. I know that TimeXtender (TX-DWA) does this, and other providers probably too. So the administrative workload will reduce!

Privacy and GDPR

This is where DWA tooling appliances can make a good business case: if they help you with good structures and audit trails to comply to all privacy laws. Not only in a negative way ("not sending your data to our servers" - see also Clint Huijbers' conversation with Peter Avenant on LinkedIn[ref]Clint references several articles on PrivacyShield:

  1. Benefits of Participation
  2. Privacy Shield FAQ (European Commission)
  3. Why You Can't Ignore Privacy Shield (Dark Reading)

[/ref]), but also in a positive way ("we can help you get insight[ref]For example, because the DWA framework helps you tag the privacy-sensitive fields, and ensures the right way of storage - encryption - and handling - explicit insight in who can access which data[/ref]").

I understand that adhering to the law isn't something any tool can "solve", but at places where lots of data are gathered tooling will come in handy to ensure compliance.

Update (21-07-2017 10:25): As Harm Geerlings points out in the discussion under this LinkedIn thread, the handling of sensitive data, GDPR and Privacy Shield is part of the broader data governance:

the value of automation is certainly not limited to privacy and/or GDPR. The right integrated data automation tooling supports governance in general better than any other solution.

This is certainly true -  privacy and the right handling of sensitive data is indeed part of data governance. Where possible DWA tooling should augment or support your data governance. I think it will make DWA offerings even more compelling if the ways your DWA tool supports this are highlighted more clearly.


Data Warehouse Automation tooling is gaining a lot of traction lately:

For me, the great argument for buying DWA tooling is threefold:

  1. The way of working is standardized (easier to transfer technical work)
  2. Less technical work and thus cost savings (more business intelligence)
  3. Batteries included: instead of having to allocate my team's resources to improve the technical stuff, lots of things are already provided out-of-the box

Currently, I've not yet decided when we're going to jump the wagon, and which wagon we're going to jump, but the Data Warehouse Automation Guide[ref]I'm not associated in any way to this site. I just like it :-).[/ref] is a great help in providing a good overview as well as comparing the features.

Comments (2)

  1. Hi Koos,
    Good post, nice read.
    The first point, however, about data driven and model driven tools, needs some addition. I think what you are saying is not necessarily true. Providing delta scripts to change the datawarehouse structure without data loss is not only for model driven tools. On DataWarehouseAutomation.Guide there are tools that are only data driven but still support delta deployments.
    Your argument about the time saving is spot on. Yes this is what happens, consultants build their own framework to generate code, I did so and I know others that did so too. So the time saving is in not having to build the automation framework yourself, as you say. The gain is in time (so money), quality, maintainability, standardization, to name a few.

    1. Hans, thanks for your addition! I think you're right - the ability to have delta scripts is not the same as being model-driven. Heck, "model-drivenness" of the tools might even make a greater difference than the ability to generate change scripts: the ability to generate change-scripts is not a very hard problem - you can build it into your in-company grown frameworks too.. However, getting it to a point where the tool provides for changing modeling choices is something I don't see happening quite as soon with in-company or homegrown tools.

Comments are closed.