Performance and Productivity of an HPC Cluster (2)

In my last blog, I described an audit we recently did for a small manufacturing company on their 32-node CFD (computational fluid dynamics) cluster. The objective of the audit was to both evaluate the efficiency of the cluster and develop an upgrade plan. In this blog, I will describe their experience and the results they had after deploying Platform HPC.


As a refresher, through the auditing process, we identified that the company had less than optimal cluster usage, with 40% of the cluster value wasted and an 18% loss of user productivity.


The solution for making the refreshed cluster more efficient was geared towards shorter set-up time, higher cluster utilization and creating quick troubleshooting processes. Platform HPC Enterprise was selected for the company’s upgrade. After the cluster hardware was installed and connected, it took less than one day to install all the software components on the head node, together with the operating systems and management software components provisioned on all compute nodes. It took another day to get ANSYS Fluent and ANSYS Mechanical integrated with the MPI library (Platform MPI) and Platform HPC web portal. After these steps, the cluster went in production almost immediately. Platform HPC Enterprise did the heavy lifting for the management software installation and application integrations. Similar tasks that previously took two months to complete were now completed within three days.


One of the improvements we made was in the application license configuration process, where we removed group specific application license reservations. This was one of the reasons contributing to the low cluster utilization. The license reservation system was supposed to guarantee application licenses for mission critical user groups. Rather than using a reservation system, Platform HPC allows dynamic license preemption. This means jobs with high priority can stop low priority jobs, pre-empting application licenses if necessary. The low priority jobs do not die. Instead, they resume when high priority jobs are completed and licenses are freed up. We saw a 15% increase in cluster usage using this configuration.
Another performance boost came as a result of the web-based interface in Platform HPC, because users do not need to deal with scripts. Using the web interface, they can manage their CFD jobs, as well as the data. This dramatically reduced job failure rates for the simulation. Administrators also got relief from dealing with user problems, thus resulting in a 10% time savings. This allows the administrator to now spend time on other duties.


The new cluster also uses Voltaire Fabric Collective Accelerator (FCA), which speeds up communication among all tasks for an MPI job. With the integration between Platform MPI and Voltaire FCA, ANSYS Fluent performance increased by 10-20%.


After the initial deployment and performance tuning, we saw a 25% job throughput improvement, a 15% license usage increase, and a 10% reduction in administration effort overall. This is a significant improvement over the previous cluster.


In the next blog, I will discuss the next steps planned for this site to further improve user productivity.

0 comments:

Post a Comment