HPC from A-Z (part 19) - S

S is for Space exploration!

Are we alone in the cosmos? What is dark matter? What is the universe expanding into? The nature of the cosmos has fascinated us earthlings since we first looked up at the sky and began to wonder. We’ve come a long way since then. We've ditched the loin cloths, created the telescope and even set foot on the moon – but there are many questions which remain unanswered and HPC can be used to help to answer them.

In 1999, a project called SETI@home was set up to search for signs of intelligent life in the universe using volunteers’ idle computers. Back then, SETI borrowed cycles from these computers across the Internet, using their compute resources to analyse intergalactic data. Now while SETI@home didn’t quite manage to find E.T, it’s just one great example of how HPC can benefit space exploration. Put a little more thought into it though, and HPC could be used for a whole lot more; everything from crunching telescope data, analysing rocket and shuttle stability, to scrutinising far away galaxies which could host potential alien life forms.

Whatever the next great step for mankind is, you can bet that HPC will be somehow involved.

Is Platform Cluster Manager just commercial support of Kusu?

Many people think Platform Cluster Manager is just a commercial version of the open source software, Kusu. When Kusu was born a few years ago, that actually was the case. Since then, Platform Cluster Manager has evolved significantly, and by comparing the latest release of Platform Cluster Manager with Kusu today, one will find there are many differences now.

Kusu is open source cluster provisioning and management software developed by Platform Computing. The commercially supported open source version was initially called Open Cluster Stack (OCS), and that name was later changed to Platform Cluster Manager. In version 2.0, which was released in early 2010, we started to package proprietary code into Platform Cluster Manager for a better interface. In subsequent releases since then, more proprietary code has been added to Platform Cluster Manager. In the latest version 3.0, the original Kusu code is now just small part of the Platform Cluster Manager product. Installation, a graphical web interface, and the monitoring system are all proprietary code that have been added into the product. Although the Kusu code has gained enhancements for functionality and reliability release by release, Platform Cluster Manager 3 just uses Kusu to power its provisioning engine. The rest of the product’s functionality is not open source any more.

Today, Platform Cluster Manager as a licensed product is sold by many Platform Computing channel partners. It contains the following functional modules:

  1. High quality and flexible open source provisioning engine developed by Platform Computing.
  2. Web interface framework shared by most Platform Computing products
  3. Web interface for cluster management
  4. Monitoring framework based on reliable and scalable agent technology used in Platform LSF
  5. Installer that supports un-attended or factory install
Some key new features added to Platform Cluster Manager 3 include:
  1. An intuitive web interface
  2. Management node high availability
  3. Support of NIC bonding
  4. Monitoring and alerting
  5. More flexible network interface types
  6. Enhanced kit building process
Platform Cluster Manager is also included in Platform HPC as the cluster management tool. It is used for deploying and managing the additional software modules in Platform HPC. With one package and one web interface, customers can easily expand their management functionalities from Platform Cluster Manager to Platform HPC by just adding Platform HPC licenses.

With the addition of these new features over the last two years, Platform Cluster Manager has truly evolved from a community supported open source package to a commercial grade product.

Platform MapReduce: Tackling Big Data, One Enterprise at a Time

“Big Data” seems to be on the tip of everyone’s tongue in recent months, and here at the Platform this has been no exception. Applying MapReduce applications to the data deluge has so much potential, but in Derrick Harris’ sage words “Hadoop may be hot, but it needs to be useful” (source: GigaOM).  With this in mind, Platform has set itself to applying its 18-year experience in policy-driven workload scheduling to the development of a MapReduce solution ready for the enterprise and to tackle its unique challenges. Back in March, Platform strongly hinted at a forthcoming product, but now it’s official.

Platform announced the launch of Platform MapReduce, the industry’s first enterprise-class, distributed runtime engine for MapReduce applications, with general availability to come at the end of July. The new solution will be able to manage MapReduce applications in a cluster (even multiple applications on a shared cluster) across an entire distributed file system. With more than 10,000 policy levels and support for up to 300,000 concurrent tasks, Platform MapReduce provides unparalleled manageability and scale, while ensuring high resource utilization to maximize ROI. Applicable to industries across multiple sectors, these key features can enable such diverse functions as compliance and regulatory reporting for financial services and government agencies; customer churn prevention for telecommunications; and genome sequencing analysis for life sciences.

Platform MapReduce also supports open distributed file system architecture, including immediate support for Hadoop Distributed File System (HDFS) and Appistry Cloud IQ – with more to come! To ensure that open source solutions, in this case those used with the Platform MapReduce distributed runtime engine, receive the world-class support enterprise customers demand, Platform has also signed the Apache Corporate Contributor License Agreement to contribute to the development of Apache-based, open-source Hadoop Distributed File System (HDFS).

With Platform MapReduce and world-class support, the enterprise is now ready to tackle the data deluge!

The Experience of Building a Scalable Supercomputer

This week, the TOP500 ranked the system at Taiwan’s National Center for High Performance Computing as #42 in their June 2011 bi-annual supercomputer list. This system was provided by Acer together with its technology partners, AMD, DataDirect Networks, QLogic, and Platform Computing. Platform Computing provided the management software and MPI libraries for the system, as well as services for deploying these software components.

During the period of system installation and configuration, a number of areas demonstrated the advantages of partnering with Platform Computing:

(1) Management software: Platform HPC was chosen to manage the system. The scalability and maturity of the software components simplified the installation and the configuration of the management software layer. Both the workload scheduler (based on Platform LSF) and MPI library (Platform MPI) on the system scale effortlessly.

(2) MPI expertise: To achieve maximum Linpack performance results, it is critical to ensure MPI performance is optimized. During the installation and configuration stage, the Platform MPI development team provided numerous best practices to help maximize the benchmarking results, from checking cluster healthiness to MPI performance tuning. They collaborated closely with developers from QLogic, who provided Infiniband interconnects.

(3) Dynamic zoning: The system will be used by multiple research user groups. There is a separate workload management instance for each user group. Based on the workload of each user group, the size of the workload management zone will change from time to time. Each zone has its own user account management system and scheduling policies. Platform HPC was set up to easily manages such dynamic configuration changes.

The maturity of Platform HPC, as well as the expertise from Platform Computing’s development and services teams played a key role in ensuring the success of this Acer project. The maximized performance and stability of the benchmarking runs enabled the results to be submitted in time for the June TOP500 list. But mostly importantly, when the system is in hands of hundreds of users in production, the robustness of the workload management, the performance of MPI, as well as the support from experts who built the software will make a difference in delivering the quality of services from this top Taiwanese supercomputer.

HPC from A-Z (part 18) - R

R is for Reservoir modelling

It might just look like thin brown treacle to you and I, but crude oil is a big money business.

Millions of years worth of pressure under the earth’s surface has turned the tiny plants and animals of prehistoric Earth into the modern world’s most valuable resource – powering vehicles, industries and economies across the globe. As such, the financial rewards for finding and trading in oil are substantial. However, when you’re using millions of pounds worth of equipment including a 30ft drill to bore holes into the planet’s crust then equally, so are the risks. Choosing the wrong spot to drill can be an expensive mistake.

StatoilHydro ASA, a Norway-based oil and gas company, is one of the world’s largest crude oil traders. It relies on sophisticated 3D simulation programmes to search for natural oil-wells in the Earth’s crust -- if you want to strike it rich you need to be drilling in the right place. I don’t need to tell you that this process involves vast amounts of data, large numbers of complex calculations and requires thousands of iterations to produce accurate results. To put it simply, it’s a very, very big job.

To ensure StatoilHydro had the required resources to power such colossal calculations it installed an HPC environment, which is now invaluable to its reservoir engineers worldwide. It allows its users to run significantly more simulations which in turn means for much greater accuracy when drilling.

And accuracy is important when only a small error in location can cost many millions of dollars. This isn’t ‘pin the tail on the donkey’ – it’s an exact science.

HPC from A-Z (part 17) - Q

Q for Quantum Physics / Mechanics

Quantum physics, also known as quantum mechanics or theory, is one of those subjects which most people know little about. The micro infinitesimal scale of what it deals with is often too hard to comprehend. Lucky it doesn’t normally crop up in everyday conversation, and is a subject which most people can simply forget about.

However, quantum physics lies inextricably at the origins – and future – of the universe and human life, explaining the behaviour of matter and energy at a sub-atomic scale. It is the focus of some of the most advanced academic research underway today.

While HPC isn’t directly enabled by quantum mechanics (although you could argue that quantum effects can be significant at the microchip level), HPC can lend a welcome helping hand in assisting with the huge amounts of data crunching involved. For example, Platform has been working with University of Lancaster since 2009, equipping it with state of the art HPC technology. The university’s HPC resources enable it to maintain an enviable record in cutting edge computer-based research. This includes fundamental physics and quantum transport in carbon nanotubes.

So with the role that HPC is playing at the bleeding edge of quantum research, mind-bending technology innovations are possible. Yesterday – the laser, semiconductors and electronic microscopes; today – quantum computing and cryptography; tomorrow….teleportation?

Is Platform HPC 3 Just a GUI around Linux?

Platform HPC 3 has a new skin, a brand new web interface that is modern and future looking. While we are emphasizing the usability of Platform HPC with this new product release, some people think Platform HPC is just a GUI around Linux. They couldn’t be more mistaken.

GNOME and KDE are graphical desktop interfaces for Linux, which are included in most Linux distributions. They make Linux easier to use. However, making Linux easier to use does not turn a bunch of servers into a cluster. A cluster that acts as a single system for users and administrators requires more capabilities from management software.

First, it requires a cluster management system. This allows administrators to easily manage a large number of nodes within a cluster without perming tasks node by node. Management tasks include installing the OS and software, patching the OS and upgrading applications, etc. on nodes. Cluster monitoring is also an important component of the cluster management. Without cluster monitoring capabilities, administrators would have to login to individual nodes to get cluster health and performance information, which is impractical for a cluster with more than 10 nodes. Using consolidated alerts can release administrators from having to closely watch the cluster all day long. Cluster management is the foundation that allows all nodes within a cluster to act like a single system from an administration perspective.

Second, a cluster requires workload management. This helps make a cluster into a single system for multiple users. Without workload management, a cluster is many discrete servers from user’s perspective. Instead, workload management helps run each user’s application instance as a “job”. It schedules jobs according user specified requirements and available resources. Workload management software automates fault tolerance and load balancing. It also turns multiple discrete commodity servers into a single reliable system.

Finally, a reporting system is essential for management so administrators can understand how cluster capabilities have been used and whether or not additional capabilities are required. It also shows how well users are being served using performance indicators like average job wait time, job run time, etc. This is to ensure a good return on investment for the valuable computer cluster.

There exist open source software solutions for cluster management, workload management, and reporting on the market today. However, nothing replaces a solution that integrates them together with an easy-to-use web interface while the command line interface is still there. This dramatically reduces system set-up time and the learning curve for users. It also helps users focus on their work rather dealing with complex cluster issues. For power users, the flexible command lines are still there for customization and extension. This is what Platform HPC 3 delivers. It is far more than just a nice GUI around Linux.

HPC from A-Z (part 16) - P

P is for predictions

In the past year we’ve seen the devastation that weather can cause - from terrible floods to tornados to forest fires caused by intense heat. The need for better weather forecasts has never been greater. When meteorologists have the means to better predict the weather, people have more time to prepare.

High performance computing can help meteorologists crunch weather and climate data from satellites to provide an accurate and real time view of developing weather patterns. HPC can also help research climate change, a complex and controversial topic. If you imagine all the different data sets that climate modelling must take into account, high performance computing is vital in order to make sense of the data and predict climate changes. While much of our future is uncertain, HPC can at least help us have some visibility of what is likely to happen to our weather and climate.

In addition to the weather, HPC can also help to predict a whole host of other matters. For example it can help identify the best location to drill for oil, or be used to create predictive analytics for financial purposes or traffic flows. Somewhere along the line I expect HPC could help predict how long we might live, by crunching our genetic and lifestyle data to make an informed calculation.

A Missed Opportunity for Cluster Users - Aggregated Memory

Clusters for doing batch workloads continue to find new areas where they are useful. In the late 90s clusters were originally referred to as "Beowulf" (a term I believe was coined by Don Becker) and were focused on running Message Passing Interface (MPI) or Parallel Virtual Machine (PVM) based parallel applications by decomposing domains into multiple pieces, distributing those pieces to member compute hosts, iteratively running a simulation and then updating boundary conditions via a network interconnect.

Today clusters are used more widely than ever before especially because more application simulation algorithmshave been recast in a manner friendly to using MPI for communication; however, this is just a fraction of the total use cases now commonly deployed for clusters. Other uses include: financial services institutions using clusters to calculate risk in their portfolios; electronic design firms running simulations and regression tests on chip designs to push die sizes ever smaller; movie houses creating special effects and sometimes building and visualizing entire virtual worlds with clusters; scientists use Monte Carlo style simulations to the range of design scenarios for a risk situation; and pharmaceutical companies probing genomes to speed their drug discovery processes.

All of these use cases are serial applications that are run in batches, each often running thousands to millions of times. Clusters provide an aggregated level of throughput at a price point that has been proven extremely advantageous. We at Platform Computing like to think we helped this market develop by supplying the leading, fastest and most scalable and robust workload management suite available.

An emerging trend with these serial applications is that the amount of memory required to run the simulations is growing faster than commodity server memory capabilities. Additionally, clusters are being used for many purposes simultaneously, so jobs large and small must compete for resources. Large jobs tend to get "starved" since small jobs can often fit many to a single server and, while running, prevent a large job from running. This results in users either compromising on simulation detail to make footprints smaller or they are forced to buy non-commodity "big iron" to house 512GB of memory and more.

The frustrating part of this trend is that usually there is plenty of memory in the cluster to service an application requirement, however the memory is often located on a separate server.

Virtualization is often viewed as a non-HPC technology. Though I am fighting this blanket assumption, this is not the point of this blog post. Virtualization, in this case, offers mobility to a workload in a way that was not possible before. By this I am referring to migration technology that allows a virtual machine to move between hardware resources while running. Before virtualization, the only way an application could be moved from one server to another was by check pointing it, and even then it suffered from having to shutdown and restart, as well as several restrictions on the character of the hosts for restarting the job.

In the case of memory limitations preventing jobs from running, it is now possible to continuously "pack" a workload onto the minimum number of servers to maximize the availability of large chunks of memory. If this is done automatically and to good effect then those large memory jobs will not be starved and instead launch immediately.

Platform's Adaptive Cluster product will be able to do just this by leveraging the power of virtualization. It may be possible that the increase in throughput will more than balance the reduction in performance associated with virtualization. Time will tell.