Probably everyone who has ever been in an HPC environment as a user has run into resource constraints. Limitations like hardware, licenses, memory, or maybe even GPUs on hybrid compute engines rank at the high end of the limitations most users face are facing today. The problem comes when you run into these limitations in the middle of running a job. What do you do then? “Hitting the stops,” so to speak, will often cause another procurement cycle and consume lots of resources in terms of analysis, internal meetings, planning, as well as the purchase and deployment phases before any additional real work can get completed.
Cloud computing, or in this case, cloud bursting offers an approach to mitigate the process and limitations that most HPC consumer corporations go through today. Certainly, using resources outside the firewall can have its own challenges for corporate users, but those aren’t the focus of this blog.
Assume for a moment that security, licensing, provisioning latency and data access are not a problem. Of course, they’re all major issues that need to be addressed to make a cloud solution usable, but bear with me. There are still some important questions that need to be answered:
- What are the appropriate conditions to start up and provision cloud infrastructure?
- What jobs should be sent to the cloud once that infrastructure is provisioned and which should stay local and wait?
This second question is the focus here. Often in science, the hardest part of solving a tough problem is stating the question properly. In this case, the question is nicely represented by the below inequality where each entity represents a factor of elapsed time. For cloud computing to be advantageous from a performance perspective,
(Data upload to cloud) + (Cloud Pend time) + (Cloud Run time) + (Data download from cloud) < (Local Pend time) + (Local Run time)
Such a statement allows us to draw a few conclusions about the conditions for when cloud bursting is advantageous for the HPC user:
- When local job pend time estimates for a job get very large
- When local elapsed run time is large -- A corollary to this condition is that if the job can be parallelized, but there are insufficient resources locally to run the job quickly, then cloud bursting the job may return results to the user sooner than allowing the job to run on insufficient resources locally.
- When the job’s data transfer requirements into and out of the cloud are small
Additionally to those conditions, we start to see where several of the real challenges are for a scheduler to make the right decision about which jobs get sent to the cloud and which don’t. For instance, most schedulers today do not consider the data volume associated with a job. But, in a cloud scenario, the data transfer times associated could be 2-50x the runtime for a job and not only dependent upon the file size, but the available transfer bandwidth. Schedulers will need to evolve on several levels to tackle this challenge:
- Allow users to indicate the files (both input and output) required for each job.
- Estimate pending and run time for disparate infrastructures
- Estimate the run time for jobs which run parallel