"Big Data" technology: getting hotter, but still too hard
What should the industry do to help developers?
Published 09:13, 17 May 11
“Big Data” is coming up more often on the agendas of key vendors, as well as some of the more advanced users of information management technology. Although some of this increased activity reflects PR calendars - companies promote new offerings in the Spring - there’s more than that going on.
The range of design patterns that fall under this large umbrella are genuinely on the increase in a wider range of usage scenarios, driving continuing innovation from both technology providers and users. In part because of the frequent use of open source technology such as Apache Hadoop to implement “Big Data,” this is the type of innovation the industry most needs at this early stage of the market. A few key data points:
- I attended last week’s IBM “Big Data” Symposium at the Watson Research Labs, together with a number of other Forrester analysts and a number of IBM customers. Among the analysts attending was Brian Hopkins, who blogged about it last week. We saw a number of interesting examples of “Big Data,” including those cited by a users’ panel including Illumina, eZly, Dr. Carolyn McGregor of the University of Toronto Institute of Technology, and Acxiom. We also heard how IBM had applied Hadoop inside Watson, which recently won at Jeopardy.
- Other vendors of data analytics, warehousing, and integration technology have recently briefed Forrester analysts on their “Big Data” related capabilities, both current and planned. Many vendors embed the Apache Hadoop codebase into their solutions, and many of those also include proprietary forks of the Apache project to address requirements, such as real-time data integration and high availability, that the open-source project has yet to support.
- Noel Yuhanna and I are working on the next Forrester Wave™ of “Information as a Service,” or data services, technology. For this project, we are interviewing firms with data services apps in production. More than one of these firms is using or plans to use “Big Data” (via warehouse appliances or Hadoop) to manage increasing volumes of content coming from web interactions or physical devices like oil wells, developing insights they then deliver in real-time to consumers through integrated data services interfaces. They view their data services layers as a point of not only integration but also security and governance, and most have implemented canonical models as a key part of their data services strategy. Note that I’ll be blogging about my day at the second annual Canonical Model Management forum (also last week) in the near future.
What does it all mean?
That is the subject of much research from Forrester this year, not only from Brian and Noel, but also from Jim Kobielus, Gene Leganza, and others. Here’s my quick take based on what I know today:
- Experts place much of the focus of “Big Data” on “data-centric” use cases, as it should be - advanced analytics performed by experts in data and statistics, extending the insights firms are gaining today beyond existing solutions like data warehouses, or pioneering newer use cases conventional technology is less well suited to conquer.
- However, “Big Data” also matters to application developers - at least, to those who are building applications in domains where “Big Data” is relevant. These include smart grid, marketing automation, clinical care, fraud detection and avoidance, criminal justice systems, cyber-security, and intelligence.
- One “big question” about “Big Data”: What’s the right development model? Virtually everyone who comments on this issue points out that today’s models, such as those used with Hadoop, are too complex for most developers. It requires a special class of developers to understand how to break their problem down into the components necessary for treatment by a distributed architecture like Hadoop. For this model to take off, we need simpler models that are more accessible to a wider range of developers - while retaining all the power of these special platforms.
Making complex things more accessible to developers by evolving the development model is right in the sweet spot for our team that serves application development & delivery professionals.
We’ve already begun to address this issue, at least in a general way, by defining the emerging Elastic Application Platform (EAP), great new work from John Rymer and Mike Gualtieri that shows how “NoSQL” techniques for data management will evolve as part of a broader platform for apps built on private, public, or hybrid cloud architectures.
What is the industry doing to make “Big Data” easier for developers?
Some of the existing approaches for making “Big Data” platforms accessible to more developers work by bringing familiar APIs like SQL to bear. While this may be appropriate for some applications, SQL brings baggage, too - primarily that it can “lock” the data schema for the application, depending on how developers use it.
But the most flexible applications that work with unstructured content need to be able to dynamically evolve the data schema, based on the data that’s showing up through the input content or streams.
For example, social web content that Marketing mines for customer insights may evolve new kinds of information about new kinds of products or services, dynamically, at any time. Marketing pros can’t predict what these topics will be ahead of time, nor what they will want to know about them - the structure evolves naturally from the content.
Applications that work with unstructured data can benefit from this kind of dynamic schema evolution, and developers can work using Agile processes in such an environment, but they need a development model that is similarly dynamic to support their efforts.
From the data services / “Big Data” use cases we’ve seen so far, data services appear well suited to meeting this requirement. Developers can introspect (query) services at runtime to ask what information they have about which topics, and then access that information for dashboards or other flexible and interactive means of visualization, or to inform other processes with analytical insight.
What do you think? Are data services potentially relevant to your use of “Big Data”?
Posted by Mike Gilpin