No, I’m not trying to be heretical. The point is that I find it more useful to get beyond the buzz to the real details and lessons behind cloud, utility, grid, or whatever we call the latest computing model.
There are a few high performance computing (HPC) type scientific applications that are being talked about living on the public cloud infrastructure, such as the European Space Agency project Gaia which will attempt to map 1% of the galaxy, and the backend map/reduce functionality will reside on a cloud, or the fact that the US DOE is funding Argonne National Lab’s Magellan project to do general scientific computing problems as a test case for HPC in the cloud.
Some cloud concepts are worthwhile in a private HPC setting, especially if your setting is multiple grids or one big grid but divided into many logical grids via queues or workload management. Elasticity (the ability to grow and shrink as needs dictate), self service provisioning, tracking and billing for chargeback, and flexibility/agility to handle multiple application requirements (such as OS and patches, reliability/availability, amount of CPU or memory, data locality awareness, etc) can all improve data center utilization and responsiveness.
Most HPC grids or clusters already have some of these features, such as self-service and metering, but the ideas of flexibility and elasticity have not been a realizable goal until recently.
Today the primary driver for cloud computing is to maximize the utilization of resources. Typically that is a goal for a workload management (WLM) as well, but too often the HPC landscape is turned into silos (either grid based or queue based), which can mean that overall utilization is only in the 30-50% range much of the time. And even when it is in the 60-70% range, there is quite a bit of either backlogged demand that cannot get scheduled or systems running at low loads consuming expensive power and cooling resources, meaning that there is definite room for improvement.
Some of the primary causes of lower than ideal utilization within many current grid implementations include the budgeting process (typically done project by project, with each one buying new equipment for their own grid, or part of the existing grid with dedicated queues for them), service level requirements (such as immediate access to critical resources without pre-emption, which means that certain machines are left idle much of the time to allow for these jobs), and a fixed, rather than dynamically changeable, OS on each node (meaning that applications requiring a different stack cannot reuse the same equipment).
There are also a number of potential roadblocks to getting many HPC applications into the cloud. Let’s first investigate one of the core assumptions when the term cloud comes up, and that is one of virtualization. Indeed, there are many out there that state that this is a foundational and required stepping stone to cloud computing. I would argue, rather, that concepts that are generally embodied into the technology of virtualization such as agility and flexibility, as well as non direct ownership of resources, would be the foundation, rather than any one tool or type of tool. This is one of the key concepts to a couple of the biggest features of cloud computing, flexibility and elasticity. It is certainly possible to achieve these goals using physical systems instead of virtual machines, just not as easily. But that brings up the question of why … why don’t people just put their HPC applications into VMs and be done with it? The primary reasons are centered around performance … most of these applications were created to scale out across hundreds or even thousands of systems in order to achieve one primary goal: Get the highest possible performance at the lowest possible price. And quite often that also means very specialized networks (such as InifiniBand (IB) using RDMA instead of TCP/IP), specialized global file systems (such as Lustre), specialized memory mapping and cache locality, etc all of which get somewhat or completely disrupted in a VM environment.
There are several companies addressing the problem, and one example is that Platform Computing has recently announced a new capability called HPC Adaptive Clusters, which takes these concepts and applies them equally to physical machines or to VMs. The physical instances would be multi-boot capable allowing for smart workload scheduler to dynamically change the landscape as needed and by policy in order to handle various job types, whether they need a flavor of Linux, Windows or whatever (thus, our tagline, “Clusters, Grids, Clouds, Whatever”). Additionally, as technology advances such as Intel’s new Nehalem processors with tools and APIs for power capping, socket and eventually individual core control, these physical boxes can even be setup appropriately for the application load as well as saving power and cooling costs whenever possible.
Platform has been a leader in WLM for over a decade, and now they are adding in the ability to efficiently combine resources, with dynamic control and distribution, as well as ever smarter workload management … Thus, HPC Adaptive Cluster. Check it out at http://www.platform.com/Products/platform-isf/platform-isf-adaptive-cluster
Phil Morris
CTO, HPC BU