With the emergence of “big data” has come a number of new programming methodologies for collecting, processing and analyzing the large volume and often unstructured data. Although Hadoop MapReduce is one of the promising approaches for processing and organizing results from unstructured data, the engine running underneath MapReduce applications is not yet enterprise ready. At Platform Computing, we have identified five major challenges in the current Hadoop MapReduce implementation:
· Lack of performance and scalability
· Lack of flexible and reliable resource management
· Lack of application deployment support
· Lack of quality of service
· Lack of multiple data source support
I will be taking an in-depth look at each of the above challenges in this blog series. To finish, I will share our vision on what an enterprise–class solution should be that will not only address the five challenges customers are currently facing, but also expand beyond those boundaries to explore the capabilities of the next generation Hadoop MapReduce runtime engine.
Challenge #1: Lack of performance and scalability
Currently the open source Hadoop MapReduce programming model does not provide the performance and scalability needed for production environment, this is mainly due to its fundamental architectural design. On the performance measure, to be most useful in a robust enterprise environment a MapReduce job should take sub-millisecond to start, but the job startup time in the current open source MapReduce implementation is measured in seconds. This high latency at the beginning can lead to subsequent delays in getting to the final results and cause significant financial loss to an organization. For instance, in capital markets of the financial service sector, a millisecond of delay can cost a firm millions of dollars. On the scalability front, customers are looking for a runtime solution that is not only capable of scaling one MapReduce application as the problem size grows, but one that can also support multiple applications of different kinds running across thousands of cores and servers at the same time. The current Hadoop MapReduce implementation does not provide such capabilities. As a result, for each MapReduce job, a customer has to assign a dedicated cluster to run that particular application, one at a time. This lack of scalability will not only introduce additional complexity into an already complex IT data center and make it hard to manage, but it also creates a siloed IT environment in which resources are poorly utilized.
A lack of guaranteed performance and scalability is just one of the roadblocks preventing enterprise customers from running MapReduce applications at production scale. In the next blog, we will discuss the shortcomings in resource management in the current Hadoop MapReduce offering and examine the impact it brings to organizations tackling “Big Data” problems.