Blog Series – Five Challenges for Hadoop MapReduce in the Enterprise, Part 3

Challenge #3: Lack of Application Deployment Support

In my previous blog, I explored the shortcomings in resource management capabilities in the current open source Hadoop MapReduce runtime implementation. In this installment of the “Five Challenges for Hadoop MapReduce in the Enterprise” series, I’d like to take a different view on the existing open source implementation and examine the weaknesses in its application deployment capabilities. This is critically important because, at the end of day, it is the applications that a runtime engine needs to drive, without a sufficient support mechanism, a runtime engine will only have limited use.
To better illustrate the shortcomings in the current Hadoop implementation for its application support, we use below diagram to demonstrate how the current solution handles workloads.

As shown in the diagram, the current Hadoop implementation does not provide multiple workload support. Each cluster is dedicated to a single MapReduce application so if a user has multiple applications, s/he has to run them in serial on that same resource or buy another cluster for the additional application. This single-purpose resource implementation creates inefficiency, a siloed IT environment and management complexity (IT ends up managing multiple resources separately).

Our enterprise customers have told us they require  a runtime platform designed to support mixed workloads running across all resources simultaneously so that multiple lines of business can be served. Customers also need support for workloads that may have different characteristics or  are  written in different programming languages. For instance, some of those applications could be data intensive such as MapReduce applications written in Java, some could be CPU intensive such as Monte Carlo simulations which are often written in C++ -- a runtime engine must be designed to support both simultaneously.  In addition, the workload scheduling engine in this runtime has to be able to handle many levels of fair share scheduling priorities and also be capable of handling exceptions such as preemptive scheduling. It needs to be smart enough to detect resource utilization levels so it can reclaim functionalities when the resources are available.  Finally, a runtime platform needs to be application agnostic so that developers do not have to make code changes or recompile to adapt the runtime engine supporting their applications. The architecture design of the current Hadoop implementation simply does not provide those enterprise-class features required in a true production environment.   


Post a Comment