Reflections on Hadoop World 2011
Great technology but war looms between open source purists and propriatory software houses
Published 15:23, 22 November 11
I attended Hadoop World 2011, and found that attendance was robust and the attendees enthusiastic. Most seemed to be Hadoop neophytes, either exploring Hadoop for the first time ("Hadoop curious", as they were called), or engaged in their first pilot project using the technology.
Most of the major vendors in the area of information management were present, as well as a number of open source Big Data software vendors, both supporting Hadoop and offering complementary software. I came away with three major observations?
- Support for Hadoop is growing rapidly, and mainstream IT organisations are making tentative investments in the technology.
- A major "religious war" is brewing between the open source purists and vendors who are seeking to blend elements of Hadoop with their own proprietary software.
- Programmers are streaming to Hadoop as an area of special expertise that they can use to do contract development, and there's plenty of work out there; the operative phrase is "resume shortage".
Hadoop World 2011 was held in New York in early November. The major event organiser was Cloudera, a principle distributor of Apache Hadoop. Hortonworks, a leading contributor to the Hadoop code base, was also involved. The top sponsors were Dell, HP, Informatica, and NetApp. Conspicuous by their absence was IBM, despite their Hadoop-based analytic package, BigInsights, and the fact that their Jeopardy champion system, Watson, was running Hadoop.
For the unitiated, Hadoop is a key element in the Big Data universe consisting of a set of open source technologies, community developed and managed by Apache, that offer a complete MapReduce development and execution environment.
MapReduce is an application architecture for the ingestion and either analysis or organisation of very large input data sets by breaking them into smaller data sets and executing map functions (putting selected data into lists) and reduce functions (boiling down those lists and generating either report or formatted output) iteratively. It is used to find useful data in large unorganised (often text-based) data collections, and for deep analysis of very large data collections. Currently, most major implementations of Hadoop are in firms that base their business on Internet services, including retail, information services, and social networking. Mainstream IT is just beginning to explore this area in earnest.
The New Wave
By my personal estimation, a majority of the attendees seemed to be either learning about Hadoop for the first time, or engaged in pilot or proof-of-concept projects for their companies. It is a testimony to the commercial interest in Hadoop that so many of their employers were willing to fly them to New York and cover their expenses in order to attend this conference. In speaking with a few of them, I found a key challenge was to find initial use cases for Hadoop that would deliver measurable value for the enterprise. In other words, they were committed to the solution before they had found a suitable problem.
The folks at Cloudera, Hortonworks, and a lot of the developers there were clearly "true believers" in the quasi-Jeffersonian open source ideal: myriad individual programmers (contract and staff) developing custom solutions for their clients or employers, building on a free and open code base. All power to the little guy. Most of the larger vendors, including most of the larger sponsors, had another view: Hadoop as a common standard for structure, APIs, and some base code, with proprietary technology interpolated into it and layered over it to provide purchasable, supportable packaged solutions.
Hadoop developers acknowledge that they have some issues around data management performance and security, but insist that the solution must be "up to the community". Vendors, on the other hand, are offering faster data loads, more efficient alternatives to HBase for key-value database support, and so on. Informatica announced HParser, which is parsing software designed to be executed within a MapReduce environment.
Oracle, of course pitched their Big Data Appliance, which substitutes their own key-value store (based on the open source Berkeley DB) for HBase. Even MapR, a distributor of Hadoop, offers optimising technology in its premium M5 Edition. It was interesting to see Aster Data, a subsidiary of Teradata, present. Aster Data offers its own MapReduce system as an alternative to Hadoop.
Not all vendors were looking to "adulterate" Hadoop. Some simply offered complementary software and hardware. NetApp announced an optimised storage solution for Hadoop. Dell offered a packaged cluster for Hadoop, and Cisco offered a fabric of servers and networking to serve as a Hadoop platform. Talend presented software that provides integration with external sources and a Hadoop application and offers data quality support as well.
The Resume Shortage
One major factor attracting developers is the apparently rapid growth in demand for Hadoop programmers. The air at Hadoop World was electric with the phrase "resume shortage", meaning that firms looking to start developing Hadoop-based solutions are desperate for qualified programmers. And no wonder. Building a Hadoop application today involves a deep knowledge of Java as well as principles of clustered systems and network management; skills well beyond the average IT staff. Developers at the conference clearly saw developing expertise in this area as representing a golden ticket to a limitless series of rewarding projects.
I spoke with a variety of vendors at the conference. Some, like VoltDB (purveyors of an open source memory-based transactional RDBMS) see their technology as a natural complement to Hadoop. Others see themselves as improving on Hadoop. I believe that the reality here is this: commercial use of Hadoop will break open when packaged tools and optimised systems are introduced that make Hadoop applications easy to specify, customise, and implement. They will include optimised software that is API-compatible with elements of Hadoop. They will include development tools for doing design and code generation. They will include templates that address various use cases, mostly oriented along vertical industries. And most of that stuff will be proprietary.
In time, Hadoop will be a neutral standard, with the core technology still provided by the community, but much of the technology that makes it perform, secure, that ensures data quality, and that addresses market needs, being proprietary and a basis for vendor competition.
To my Hadoop developer friends, I say enjoy the demand while you have it, and prepare for the day when you will be choosing and optimising commercial Hadoop-based packages of hardware and software. The vast majority of enterprises will not be interested in spending money on a technology in which every business solution is a one-off. They want packaged, supported products. Software is not a religion. It's a business.
Posted by Carl Olofson