Data Warehouse Automation - To Automate or Not?
Guest Blog by Data Warehouse Expert Paul te Braak of abaX Data
BIReady Australia
BIReady Pty Ltd
143 Aiken Road,
West Pennant Hills NSW 2125
Australia
Email info@biready.com.au
Telephone +61 (0)2 8880 5111
www.biready.com.au
ABN : 57 167 132 112
To Automate or Not? – The Pros and Cons of Data Warehouse Automation
by Paul te Braak - October 2014. Reposted here with kind permission from the author.
I have been thinking about data warehouse automation for at least 2 years now. That’s just a long enough time for me to see the
pros and cons, examine the state of affairs, consider the consequences of both camps and lastly consider my reaction to those
consequences. I’ve come to realise that the last point (my reaction) plays on my mind because the subject is filled more with
emotion than rational decisions. What I hope to achieve with this post is to materialise some of my thoughts on the matter and
present a few ideas that may be considered by others (if they were considering utilising this technology).
Before we get into the grit of discussion, it is probably worth noting that I consider the data warehouse in the more traditional
sense. That is, a relational engine that stores and delivers agreed data. Recently, the additional possibilities arising from the big
data echo system and the concepts of a data lake or swamp, the possibilities of schema on read (where data doesn’t necessary
have to conform to a model on load) will add a layer of complexity to this discussion which we just don’t need. We could also add
data virtualisation but that would also cloud the water in what could be simplified as decision to automate or not.
In my traditional view of this world, the data warehouse needs to deliver two things. Firstly, it needs to store an historical record
data (as it existed) and secondly, it needs to present information to the consumer in the manner that they require. Those two
points might seem like real no-brainers but I thought it’s better to call them out now because they create some flow on implications
which are;
•
the historical record of data may not be what the consumer currently wants
•
the ‘delivery’ requirements of the system will change
•
the system must be capable of handling changing requirements
Data Warehouse Development
In traditional development, the system is designed at a conceptual level, a data modeller designs and builds the relational framework (the schema) and then ETL developers
populate it. The last stage (ETL development) requires the largest effort on a project and has been documented to require more than 60% of total effort.
Now, let’s examine the first two steps. I suggest that the conceptual design is driven by understood practices (#1) – to the extent of a cookie cutter template. Regardless of
the methodology used, most employ a series of processes and silos that do not change from implementation to implementation. Data is extracted into a staging area,
transformed into a warehouse, deployed to a presentation server/layer and delivered via a user interface (reporting/analytic layer). We expect the modeller to use a tool to
manage and maintain the modelling process (eg ERwin, DeZign) and would think it unprofessional and prone to error if components were hand-coded manually.
In contrast, when it comes to ETL development, most processes are manual. Sure, there is a tool that is used (eg SSIS, Kettle, Talend) which acts as an IDE however, the
development process is still labour intensive. The developer links together components to perform some action or task and each task/package must be manually developed
and configured (for example, you would write 10 tasks to extract 10 tables into a staging area).
Is the process of the ETL development using an IDE similar to that of the modeller (and therefore argued as automated)? I think yes and no. ‘Yes’ because there is a layer of
abstraction between the development process and the output (today we would think it insane to manually write code in a text editor) and ‘No’ because the abstraction layer
is not high enough (meaning that still very task orientated).
Automation
Enter the realm of automation technologies. To summarise these tools, they;
1.
Generally use a model (or data) driven approach to define the conceptual layer required. For example, you point the tool source
system to reverse engineer the underlying model which can later adjust (for example at other entities, relationships etc).
2.
Define a schema based on the prior step (and generate the DDL to define an alternate schema).
3.
Map source systems to the target schema and generate the code and processes to do this.
4.
Manage the execution of the code/processes derived in step 3.
5.
Provide some presentation layer.
The Arguments
So why are Data Warehouse Automation products considered by purists and coders as second-rate citizens in the world of data
warehouse development? There are a few key arguments that I’ve come across;
•
“The tool does not produce the schema that we want or we can’t change it to the way we want”. This contention presents
itself in several ways:-
o
The modeller suggests that there is a correct way to build the schema and the tool should conform to that (or be capable of
alteration to the correct schema). I’ve italicised the word correct here because that can really be a matter of conjecture.
Consider storing related data in a different table which is directly related to the first (ie 1:1 relationship). You can convincingly argue both sides (there should be
1 table or 2). I think the key point of this argument is that the tool is not flexible for schema alteration. Sometimes this is true, sometimes it’s just nonsense and
based on hearsay or single product evaluation.
o
Another point is that the data is not retained correctly (for example, history is not recorded and is incapable of change). Most people that present this have not
completed an in depth analysis of the products available. Just like other process based software, each product is not created equally and it would be foolish to
judge the class of products on a limited scope (I am sure there’s a product that suites your needs). What is interesting about this point is how much bias is
targeted at automation software. The same scenarios can be readily found in traditional implementations where requirements have changed (or dare I suggest
an oversight in design) renders the existing schema redundant. Change and rework is inevitable in this scenario and it’s surprising how that circumstance is
eliminated from our evaluation process.
•
“The tool is a black-box and I can write better transformations”. This is perhaps the most common discussion point I see amongst developers. They argue that
the tool produces generic code which could be improved had it been created manually (by them). Perhaps this is true and there are some instances where code could
be written that outperforms and is more elegant than generated code. However the converse may also be true – that is, manual development may be more prone to
poor performance. How can this be? Well, for a start you assume that the developer can deliver the results they say (and they actually know what they’re talking
about). But this is not always the case in some developers obtain their positions through bluffs on their resume which are all too often found out too late.
o
When presented with this argument there is a natural association that the solution requires an elegant and well performing solution. Sometimes this is true and
sometimes it’s not. Performance improvements that look good on paper may have absolutely no benefit to the overall solution even if they deliver a tangible
result. For example, a generated process taking 25 minutes compared to a manually-created process of 20 minutes may show improvement however, is there
really an improvement if the entire process completes in the processing window?
•
“It’s too expensive”. When an automation tool is evaluated, its price is naturally considered as part of the evaluation equation. Let’s say the typical range of prices is
between $50K and $80K. That is evaluated against a consultant for six to ten weeks or a full-time employee the nine months. More than enough time to build it
manually right? – I’m not so sure. ETL projects are notorious for overruns and while we expect our projects to go smoothly, the chances are that there will be overruns
and these are not factored into the cost equation.
o
Also I’d like to consider in the situation where the developer(s) leave. In this situation, there is often more unplanned rework through either a lack of knowledge
or a different consultants approach.
o
Further considering the “it’s too expensive” line, I would also like to introduce the concept of the time value of data (whoops, information). I have never seen
this addressed in a cost-benefit analysis, but to my mind it must play a role. The time value of data relates to the increase in productivity, revenue or output by
having solution faster (let’s assume that an automated tool can do this). What does it mean in monetary terms (to the business) to have data now (and not in
six/nine months)? Can they utilise the data warehouse from competitive advantage to increase profitability?
The Plusses – Viva Automation!
What can Data Warehouse Automation deliver that has not been addressed above? On the ‘pro’ side of Data Warehouse Automation software we could suggest;
•
A consistent approach. Since the automation software develops everything for us, its output is delivered in a consistent manner. Of course this has good and bad
points, but the key is that there is a consistent approach. Developer peculiarities should not exist since the tools operate through meta-data (and generation).
Theoretically one developer is equivalent to another so there may be no need for sliding pay scales based on experience or special skills.
•
Since the operator manages the software, one might suggest that developers are not needed (and neither are their specialist skills). This may make resourcing a lot
simpler and remove the barriers between IT and the business (provided we accept those barriers exist). There is the potential for a business user with training the
software to build and manage the data warehouse rather than a specialist IT resource. This type of transformation has occurred with accounting software that does
not require an accountant to operate it (perhaps only a data entry clerk).
The Emotion
Every discussion I have had around the use of automation tools involves some emotion (in some cases I’d go so far to say that the conversation is sometimes dictated to by
emotion). To get straight to the point, the idea that specialist developers can be replaced by software is so confrontational that we (myself included) simply refuse to accept
it and so live in a state of denial. The denial says “I’ve been doing this for XX years, have specialist skills and no machine can take my place”.
This has happened in so many industries and is such a predictable response, I think all we need do is identify its (potential) presence.
Finally
There have been lots of points made and discussed in this post. If you think I’ve missed something, or want to add your own point or expand on something I’ve mentioned,
then just comment or drop me a line. As usual comments are open and welcome.
I also say that all comments will be published (provided they are not spam, contribute to the discussion and are vendor agnostic). If you would like to insert the comment in
a section so that the flow of the article remains intact, just let me know and I’ll add it.
The floor is now open!
****
#1 As a side note, it is time to revise these practices – perhaps more in a later post.
About the Author
Paul te Braak is an independent vonsultant and founder of abaX data. Paul has been delivering solutions to business for over ten years and focuses on data, business and
process modelling to improve the alignment of performance management and operational systems. Starting his career as a Chartered Accountant, Paul soon realised the
need to align business and information technology. He is a regular speaker on business intelligence, data modelling, data mining and design.
Website: abaX Data
Blog: http://www.paultebraak.wordpress.com/
Book: Microsoft Tabular Modeling Cookbook
Get More
Like This!
At BIReady Australia, we provide our customers with best-of-breed Information
Management solutions. As such, our solutions and services offer advanced expertise
in Data Warehouse Automation, Data Quality, Textual Analysis and Business
Intelligence. Our experience extends across Healthcare, Finance and Retail.
See our special offers or visit our online store.
Call us now!
02 8880 5111