Performance and Productivity of an HPC Cluster (2)

In my last blog, I described an audit we recently did for a small manufacturing company on their 32-node CFD (computational fluid dynamics) cluster. The objective of the audit was to both evaluate the efficiency of the cluster and develop an upgrade plan. In this blog, I will describe their experience and the results they had after deploying Platform HPC.


As a refresher, through the auditing process, we identified that the company had less than optimal cluster usage, with 40% of the cluster value wasted and an 18% loss of user productivity.


The solution for making the refreshed cluster more efficient was geared towards shorter set-up time, higher cluster utilization and creating quick troubleshooting processes. Platform HPC Enterprise was selected for the company’s upgrade. After the cluster hardware was installed and connected, it took less than one day to install all the software components on the head node, together with the operating systems and management software components provisioned on all compute nodes. It took another day to get ANSYS Fluent and ANSYS Mechanical integrated with the MPI library (Platform MPI) and Platform HPC web portal. After these steps, the cluster went in production almost immediately. Platform HPC Enterprise did the heavy lifting for the management software installation and application integrations. Similar tasks that previously took two months to complete were now completed within three days.


One of the improvements we made was in the application license configuration process, where we removed group specific application license reservations. This was one of the reasons contributing to the low cluster utilization. The license reservation system was supposed to guarantee application licenses for mission critical user groups. Rather than using a reservation system, Platform HPC allows dynamic license preemption. This means jobs with high priority can stop low priority jobs, pre-empting application licenses if necessary. The low priority jobs do not die. Instead, they resume when high priority jobs are completed and licenses are freed up. We saw a 15% increase in cluster usage using this configuration.
Another performance boost came as a result of the web-based interface in Platform HPC, because users do not need to deal with scripts. Using the web interface, they can manage their CFD jobs, as well as the data. This dramatically reduced job failure rates for the simulation. Administrators also got relief from dealing with user problems, thus resulting in a 10% time savings. This allows the administrator to now spend time on other duties.


The new cluster also uses Voltaire Fabric Collective Accelerator (FCA), which speeds up communication among all tasks for an MPI job. With the integration between Platform MPI and Voltaire FCA, ANSYS Fluent performance increased by 10-20%.


After the initial deployment and performance tuning, we saw a 25% job throughput improvement, a 15% license usage increase, and a 10% reduction in administration effort overall. This is a significant improvement over the previous cluster.


In the next blog, I will discuss the next steps planned for this site to further improve user productivity.

HPC from A-Z (part 2) - B

This is our second post in the HPC ABC series. Last week we mused upon the different industries beginning with the letter A that could benefit from HPC. The idea of this series is to help realize the potential HPC has to solve problems or enhance development and design for a number of industries. Predictably enough, we’re focusing on the letter B in this post.

Biology - One of our customers, The Sanger Institute, is a genome research institute that is primarily funded by the Wellcome Trust. It has participated in some of the most important advances in genomic research; developing new understanding of genomes and their role in biology. That type of research requires a great deal of computational power so scientists can perform large scale analysis, such as quickly comparing similar genomic structures.

For more information on how Sanger benefits from an HPC environment please have a look at our video.

HPC is helping biology researchers find out what we are made of. Next week we’ll look at how HPC is helping companies develop and design better consumer products.

HPC from A-Z (Part 1) – A

Working in this industry, sometimes it’s easy to get caught up in the bytes and pieces and lose sight of the bigger picture. The bigger picture is that high performance computing (HPC) is critical for maintaining a competitive advantage in multiple industries — literally from A-Z.

At Platform Computing we thought it would be a good idea to start brainstorming the many ways that HPC could be used to help a range of industries. Are there industries in desperate need of an HPC solution? For example, the UK could certainly benefit from advanced HPC to more accurately model its weather to help avoid “snow chaos”.

We’ve started to think about an HPC alphabet, and actually have most of it covered. We have to admit that we’re struggling a little with Q and Y, but let me start with A...

Automation – One of our customers, Red Bull Racing, is using HPC for computer-aided design and engineering processes for its winning Formula One Cars.

Animation – HPC can be used to accelerate the design process for animated films; it boosts the processing power significantly helping to create more realistic images and effects.

We’d be very interested in hearing your ideas... Architecture? Archaeology? Aeronautics?

Performance and Productivity of an HPC Cluster (1)

In HPC, hardware value reduces over time very quickly. This is because new hardware is always performing better and consuming less power. The value of even the most the leading edge hardware usually peaks at its first six months.


A few months ago, we did an audit for a small manufacturing company on their 32-node CFD (computational fluid dynamics) cluster. The objective of the audit was to both evaluate the efficiency of the cluster and develop a plan for upgrade. Bearing in mind of the value depreciation equation of the HPC hardware, we measured the cluster usage over the time starting from the very beginning when the cluster arrived.


It took two months to get the first application running on the cluster and three months to reach a usage of 85%. The usage was measured by whether or not a node had a job running on it. After such a long initial warm-up period, over the course of a two-year production run, the average usage was around 90%.


Based on the value depreciation of the cluster hardware and its usage, our calculations showed 20% of the cluster value was lost during the initial three month setup and warm up period. Another 20% of the cluster value was lost during the subsequent two-year production run. This translates to 18% user productivity lost in the past 27 months.


To develop an upgrade plan to avoid such a loss, we analyzed the reasons behind those numbers. What took so long for them to set up the cluster and to reach 85% of the cluster usage?


Similar as many other new HPC cluster adopters, our audit subject did not realize a cluster requires many software components in addition to the operating system and application itself. Researching and choosing the right software modules from a long list of open source packages took them a while. Once the software components were chosen, they started the process of integrating them to ensure they worked together to the point the application could be scheduled. This went through a process of trials and errors, particularly because the company was without a dedicated HPC expert in house. The first application eventually took two months to get into production. Adding the second application was much easier but still took another month.


Once the cluster was in production, the users continued to have various problems. First, they tried to get the remote graphics display working. Unfortunately, many users don’t know to use command line interface, so they constantly make mistakes that cause job failures. Then, when some software components had problems, it took a long time to identify the root cause of the problem, and the customer waited a long time to get the problem resolved through the open source community it was relying on for help. All of this contributed a less than optimal cluster usage in the later 24 months, where they wasted 40% of the cluster value and 18% of the user productivity.


In the next blog, I will describe their experience and results of deploying Platform HPC

Big Data meets Open Source meets Service Providers

Earlier this week, I had the privilege of attending 451 Group’s Client Conference in San Francisco. After a couple of days of listening to sessions on “The Data Structure of the Cloud,” I came away with some thoughts on what’s going on in our industry right now.

The trends are clear, and the camps are clear. It’s new and old. It’s not new vs. old. The real question is where, when and how to get to the new.

OK what’s ‘new’: a) IT as a service; b) data-driven businesses; and c) reduced software licensing costs. Theses collectively drive technology companies to new creative community powered business models.

Well, I guess that’s not really new. These trends have been at work for years. But what is new is that it’s finally happening and at an accelerating pace.

Cloud is real. Hadoop is real. VC funding for open source models is real.

The accelerating pace of these trends make it exciting to be in the technology business! I leave the conference reflecting on the last two decades and coming to the time-proven conclusion that the new needs to eat even if we starve the old.

Remember token-ring networks and life without cell phones and Facebook? I honestly can’t, and my children couldn’t fold a newspaper or mail a letter or sit down for an hour to play a board game if you asked them to. My 14-year old son recently asked me what the flags was for on the mailbox, my daughter texted me on the ski lift chair (while I was sitting next to her) and my 9–year old is single-handedly supporting Zynga. The old stuff still exists, but the new is crowding it out.

It’s clear to me that the old relational databases, system management tools, software selling/licensing models, business applications……will run for as long as Unix is around. But it’s equally clear that the web-generation and scientific/government communities are spawning the businesses and technologies that will consume the bulk of new investment budgets.

OK, that’s still nothing new. Yep, until you think about intersections of the new. Rachael Chalmers’ panel with Opscode (creators of open source framework Chef), Cloudera (the Hadoop people) and an unfortunately missing Amazon EC2 GM (I don’t need to say cloud here), was both fascinating and inspiring. Fascinating to think about its ramifications on society—like trust and ownership and responsibility when, after you’ve sequenced your genome, you realize that you also just sequenced your parents and siblings (one example given by Cloudera’s Mike Olsen). Inspiring in its message to go forth with the new (even when we don’t know how the railroad or Internet or cloud will turn out) once we understand why the old is limiting.

Takeaway: Find the applications at the intersection of cloud, big data and open systems models and go for it—mindset not technology is the only thing holding us back.