A few months ago, we did an audit for a small manufacturing company on their 32-node CFD (computational fluid dynamics) cluster. The objective of the audit was to both evaluate the efficiency of the cluster and develop a plan for upgrade. Bearing in mind of the value depreciation equation of the HPC hardware, we measured the cluster usage over the time starting from the very beginning when the cluster arrived.
It took two months to get the first application running on the cluster and three months to reach a usage of 85%. The usage was measured by whether or not a node had a job running on it. After such a long initial warm-up period, over the course of a two-year production run, the average usage was around 90%.
Based on the value depreciation of the cluster hardware and its usage, our calculations showed 20% of the cluster value was lost during the initial three month setup and warm up period. Another 20% of the cluster value was lost during the subsequent two-year production run. This translates to 18% user productivity lost in the past 27 months.
To develop an upgrade plan to avoid such a loss, we analyzed the reasons behind those numbers. What took so long for them to set up the cluster and to reach 85% of the cluster usage?
Similar as many other new HPC cluster adopters, our audit subject did not realize a cluster requires many software components in addition to the operating system and application itself. Researching and choosing the right software modules from a long list of open source packages took them a while. Once the software components were chosen, they started the process of integrating them to ensure they worked together to the point the application could be scheduled. This went through a process of trials and errors, particularly because the company was without a dedicated HPC expert in house. The first application eventually took two months to get into production. Adding the second application was much easier but still took another month.
Once the cluster was in production, the users continued to have various problems. First, they tried to get the remote graphics display working. Unfortunately, many users don’t know to use command line interface, so they constantly make mistakes that cause job failures. Then, when some software components had problems, it took a long time to identify the root cause of the problem, and the customer waited a long time to get the problem resolved through the open source community it was relying on for help. All of this contributed a less than optimal cluster usage in the later 24 months, where they wasted 40% of the cluster value and 18% of the user productivity.
In the next blog, I will describe their experience and results of deploying Platform HPC