Almost one year ago, Platform’s CTO, Phil Morris blogged about some of the primary drivers for cloud adoption. Looking back at his post, it is interesting to observe the paths many of our customers have since taken to actually implement cloud and, more specifically, HPC cloud. HPC clouds, loosely defined, are systems where your scheduler realizes that current HPC loads require more resources than are physically or virtually available and then negotiates with a public cloud provider (EC2, Azure, etc) to obtain/rent/lease more computing horsepower.
Almost one month ago, Phil and I attended the annual HPC Day at Stanford University, where he described in detail the three common ways our customers are implementing Platform products to benefit from high-performance cloud computing. The first way includes expanding and augmenting your existing cluster with instances from a virtual private cloud using Amazon’s high powered Cluster Compute Instances (CCI). In case you’re not familiar with CCI, a cluster of them provides low latency, full bisection 10 GBPS bandwidth between instances. Not only are they impressively robust systems (23 GB of usable memory, quad-core “Nehalem” architectures at 2.93GHz and 1690 GB of storage), but they can also be preconfigured to use IP addresses that are specific to your corporate domain. In other words, they can appear to be running inside your own data center. Because Platform LSF can automatically discover and add new compute instances within a specified IP range, expanding your local cluster by adding new instances from the cloud is as simple as starting new CCIs with the correct IP values.
The second way to HPC cloud is very similar to the first one, with the major distinction being that instead of adding new virtual instances to an existing physical cluster, you add a completely new virtualized cluster. In this instance it’s important to note that based on your organization’s security and compliance reasons, you may be allowed to only run certain jobs in the cloud while others must run inside your data center. Since our scheduler add-on product, Platform LSF Multicluster, is smart enough to dispatch jobs in compliance with your security and sharing policies, in this second version of HPC Cloud, the jobs that need to run locally are run locally, while those that can run in the cloud are run there.
As you can imagine, creating, managing, and enforcing security and sharing policies across physical and virtual clusters can be a daunting and complex task. To address all the management challenges that can arise, we recommend leveraging Platform ISF to help you administer your private cloud. Since ISF can determine when no internal resources are available and which jobs can be run in the cloud based on policy, it requests more resources from external cloud providers and builds new clusters on the fly in the public cloud. Since it works in conjunction with Platform LSF, jobs get sent to the new cluster as soon as they become available. This is the third way to extend your HPC resources into the cloud.