The rise of Hadoop: Is ETL dead?
If you have mainly repeatable processes why switch
Published 09:22, 03 January 12
“Is ETL dead?” This was the question posed to me recently by an anxious funds manager whose investments included a significant position with an IT vendor that offers, among other things, data integration software, including ETL (that is, “extract, transform, load” software).
It seems she had been speaking with a recent Hadoop convert who told her that Hadoop would sweep away everything having to do with data movement and analysis including ETL and data warehousing. Freaking out, she called us to get another opinion. After I managed to talk her down, I paused to reflect. Is this how bad the misperceptions are getting? So, I decided to lay out the relationship between Hadoop and what you might call the “conventional” data integration world.
Let’s begin by clearly describing what Hadoop’s proper role is. Hadoop’s main value is its ability to take very large collections of data, which may be simple, regular data (e.g., machine generated data such as from sensors) or complex, unstructured data, and process that data very quickly. Hadoop is an implementation of MapReduce, and so each phase has two steps.
The Map step involves a function (usually written in Java) that sifts through the data and maps the relevant pieces to one or more lists. The Reduce step involves another function that is particular to each list, reading through the list, eliminating elements (thus “reducing” the list) and putting out a result, which may be a single value, a report, or an output indexed dataset. If this is a pure Apache Hadoop implementation, the output dataset is kept in a key-value database called HBase for further processing.
Hadoop is able to process large amounts of data very quickly because it includes the ability to break up the input data collection into lots of small subsets, assign each subset to a map process (usually running on its own commodity server in a cluster), execute the processes in parallel, then roll up the result sets and break them out again for the reduce step, which follows the same pattern.
Although the result dataset has no schema, a facility called Hive allows the programmer to declaratively assert a structure for the purposes of query execution.
Appropriate use cases for Hadoop include searching for patterns and meaningful combinations of facts in large collections of unstructured data, detecting meaningful patterns in machine-generated data, performing statistical analysis on regularized extractions from these sources, and building a regularized set of structured data for, yes, insertion into a data warehouse. It is also useful for performing queries against a large collection of data that has been built as part of a temporary project, such as a marketing campaign.
Hadoop is not going to replace data warehouses because data warehouses fulfil a different function. They are used for the collecting of well-defined data on a regular basis and supporting reports and queries based on that data. Such reports and queries generally revolve around routine business intelligence activities like sales trend analysis.
Regularity is important because these reports are generally done over and over again. For this reason, the structure of the data needs to be stable and well defined. If one were to use Hadoop for this purpose, there would be no internal controls to ensure that all the data conformed to the structural rules, and the structure itself would be ephemeral, since it is not imposed or enforced, but merely “projected” (Apache’s word) on the data.
Also, since a large data warehouse is schema-driven, it can be optimized for its structure to ensure maximum query performance at minimum cost, something that can’t be done by a system that is insensible to the structure of the data that it manages. It is important to note that the Hadoop result set key-value database, HBase, is a framework that is superimposed on the Hadoop file system, HDFS, which supports only sequential files. So, each query, even one that is key-based, is essentially sequential in nature, resulting in a spectacular waste of I/O operations.
Some vendors, such as MapR, Oracle, IBM, and others, have developed Hadoop-based or Hadoop-compatible systems that provide substitutes for some or all the Hadoop components (note: MapR is also a Hadoop distributor), but even with these, Hadoop is less efficient in executing queries against data that has an essentially fixed and well-defined structure.
So we return to the original issue. Can Hadoop replace ETL? Well, assuming one has all the necessary source and target interfaces, one could use it in place of ETL, but that would be a monumental mistake in most cases. This is because most ETL operations involve the repeated execution of well-defined and well understood transformation of data from the format of the source to the format of the target.
Hadoop is simply overkill for this sort of task, plus it is much more difficult to program than ETL, given the simplicity of the task. ETL is used to move data from operational databases to operational data stores and data warehouses. It is also used to coordinate databases and for certain application integration projects that involve the movement of data from one application database to the other. By the way, a number of ETL products can also break up input datasets and process them in parallel to speed overall execution.
So, what is the proper relationship between Hadoop and ETL? In a word, it is a complementary one. For large, irregular datasets or for sets of data that require preliminary analytic processing, Hadoop is the right choice to build either a set of data to be queried, or to be loaded into a data warehouse.
For data that has a well-known and controlled structure, where the transformations are always the same and the target is similarly well known and stable, ETL is the better choice. Plus, ETL can also be used to extract subsets of data from HBase (or whatever key-value database is being used for the Hadoop result set) and load them into a data warehouse for regular query, reporting, and analysis. Vendors that provide software along these lines include IBM, Informatica, Oracle, Pervasive, and Talend. SAS Institute also has publicly stated plans along these lines..
Posted by Carl Olofson