Wednesday, August 20, 2014

GPUDirect RDMA in Action

Last year I wrote about the release of the GPUDirect RDMA technology. Simply saying, this is the technology that enables direct communications between GPUs over the network (RDMA capable network) which translates into much higher performance for applications using GPUs – high performance applications, data analytics, gaming  etc. basically any application that run over more than a single GPU. If in the past every data movement from the GPU had to go through the CPU memory, with GPUDirect RDMA it is not the case anymore. The data will go directly from the GPU memory to the RDMA capable network (for example InfiniBand) – data latency is being reduced by more than 70%, data throughput is being increased by 5-6X and the CPU bottleneck is being eliminated.


The University of Cambridge and the HPC Advisory Council have released performance information of GPUDirect RDMA with one of their application on a nearly 100 servers system. The application is HOOMD-blue - a general-purpose molecular dynamics simulation created by the university of Michigan, that can be used over GPUs. Each server includes two GPUs and two InfiniBand RDMA adapters – so each GPU can connect directly to the network instead of going through the CPU and the QPI interface (a pair of GPU and network adapter is located on the same PCI-Express root complex). Bottom line, the GPUDirect RDMA technology enabled Cambridge to increase HOOMD-blue performance by 2X over the given system. Same system, same hardware, setting on GPUDirect RDMA, twice the performance….  Impressive.



Tuesday, August 19, 2014

New MPI (Message Passing Interface) Solution for High-Performance Applications

A good source for high-performance computing information is the HPC Advisory Council (www.hpcadvisorycouncil.com). Of course one can find many news sites - insideHPC, HPCWire and others, but the HPC Council is a good option if you want to get more in the details, to learn on new technologies and solutions being developed etc. One of the recent publications was a case study on STAR-CCM+ application (CFD) - http://www.hpcadvisorycouncil.com/pdf/STAR-CCM_Analysis_Intel_E5_2680_V2.pdf.

The publication included for the first time some performance information on a new MPI solution – called HPC-X. In the world of high-performance computing, MPI is one of the most used parallel communications library. There are some open source solutions, such as MPICH, MVAPICH, OpenMPI and commercial options – Platform MPI (formally known as HP MPI, now owned by IBM) and Intel MPI. HPC-X is a new solution from Mellanox which seems to be based on Open MPI plus various accelerations. In the past I did cover new releases of both MVAPICH and OpenMPI as these are the two most used solutions by us so far.


The combination of open source base and support does make HPC-X an very interesting solution for any HPC system. Of course, it must perform, as this is the most important item…  according to the new publication, HPC-X does provide a performance advantage over the other commercial options, up to around 20% at 32 server nodes cluster (dual socket servers). It is definitely a good start, and am looking forward to see further reports on HPC-X. Meanwhile we do plan to download and try ourselves. 

Sunday, August 17, 2014

Cluster Topologies - Dragonfly

In the past I reviewed two clusters topologies - Fat Tree (CLOS) and Torus. The dragonfly is a hierarchical topology with the following properties: several groups are connected together using all to all links (i.e. each group has at least one link directly to each other group), the topology inside each group can be any topology, it requires non-minimal global adaptive routing and advanced congestion look ahead for efficient operation. Simply saying, a dragonfly topology is a two level (at least) topology where at the top level groups of switches are connected in a full graph. The internal structure of the groups may vary and be constructed as full graph, fat tree, torus, mesh, dragonfly and so on.



While many described dragonfly as a topology with one hop between groups, actually it is not correct and in many cases the network traffic will go over several hops before getting to the destination. The key to make dragonfly topology effective is to allow some pairs of end-nodes to communicate on a non-minimal route. It is the only way to distribute random group to group traffic. This represents a significant difference from other topologies. To support such routing a dragonfly system needs to utilize adaptive routing and to only send traffic on the longer paths only if congestion is impacting the minimal (hops) paths.


While dragonfly goal was to enable a higher bisectional bandwidth compared to torus topology (or similar) while reducing the overall costs (mainly cable lengths), in practice dragonfly does not provide such an advantage - nor performance or costs over fat tree topology for example. On the other hand it does add some sort of complexity with the need to enable adaptive routing. So far we prefer to use the fat tree option – either with a full bisectional bandwidth configuration or with some oversubscription options – depends on the targeted applications.

Friday, August 15, 2014

Here Comes The 100 Gigabit Per Second…

For some of the compute demanding applications, data throughput is a critical element in getting the needed performance. Throughput is of course important for storage performance and scalability, for checkpointing and other cases, which are do relevant to most applications. Today the fastest solution we can find is InfiniBand FDR. It enables bandwidth of 56 gigabit per second. Taking the overhead off, we are left with around 54 gigabit per second for actual data movement. Other options are QDR or 40 gigabit Ethernet. Ethernet is not what we use for our HPC systems. Too much performance overhead. 

One can claim that there are 100 gigabit ports on some Ethernet switches, but these are for network aggregation, not to the server. These ports actually use 10 lanes of 10 gigabit each. Less the desired 4 lanes approach.

We did see some announcements for real 100 gigabit HPC networks. The first InfiniBand 100 gigabit switch was announced back in June – not just higher throughput but also lower latency – so win on both sides. While no indications yet on when the 100 gigabit InfiniBand adapter will be out, the switch announcement hints that we are getting close to the 100 gigabit times.


The higher the bandwidth (typically) the higher the message rate. With InfiniBand FDR we already saw much higher message rate versus all the QDR options in the market – either from Mellanox or from Intel. The increase in message rate was greater than the bandwidth difference – therefore also due to the new architecture of the latest InfiniBand FDR adapters. We do base all of our system nowadays on FDR. Waiting for EDR….    

Tuesday, July 2, 2013

GPU Direct RDMA is Finally Out!


GPUDirect RDMA is the newest technology for GPU to GPU communications over the InfiniBand interconnect. GPUDirect RDMA enables a direct data transfer from the GPU memory over the InfiniBand network via PCI Pier-to-Pier (P2P). This capability introduced in the NVIDIA Kepler-class GPUs, CUDA 5.0 and the Mellanox InfiniBand solutions.

The importance of this capability is with bypassing the CPU for GPU communications (who needs the CPU…..), therefore a dramatic increase in performance. Finally after long time of waiting, the two companies mentioned above have demonstrated the new capability in the recent ISC’13 conference. Prof. Dhabaleswar K. (DK) Panda, Hari Subramoni and Sreeram Potluri from the Ohio State University presented at the HPC Advisory Council their first results with the GPU Direct RDMA – 70% reduction in latency! You can see the entire presentation at http://www.hpcadvisorycouncil.com/events/2013/European-Workshop/presentations/9_OSU.pdf. Seems that GE Intelligent Platforms already using the new technology - http://www.militaryaerospace.com/whitepapers/2013/03/gpudirect_-rdma.html, which is a great example of how the new capability can make our life better (or faster…). You can also read more on http://docs.nvidia.com/cuda/gpudirect-rdma/index.html.



In the graph: latency improvement presented by DK Panda

Wednesday, May 2, 2012

Amazon’s HPC cloud: not for HPC!


I came across an interesting article on Amazon HPC Cloud. As was reported recently, Cycle Computing built a 50,000-core Amazon cluster for Schrödinger, which makes simulation software for use in pharmaceutical and biotechnology research. Amazon and Cycle Computing made lot of noise around the HPC capability of EC2, how great it is for HPC applications and that Schrödinger is a great partner. 

When Schrödinger actually tried to run their simulations on the cloud, the results were not that great. Amazon EC2 architecture slows down HPC applications that require decent amount of communication between the servers, even at small scale. Schrödinger President Ramy Farid mentioned that they have successfully run parallel jobs on Amazon eight-core boxes, but when we tried anything more than that, they got terrible performance. Farid was using Amazon’s eight-core server instances, so running a job on 16 cores simultaneously required two eight-core machines. “Slow interconnect speeds between separate machines does become a serious issue”.
It is known that Amazon EC2 is not a good place for HPC application, and it will not change unless they actually build the right solution. But instead of the taking the right steps, Cycle Computing CEO Jason Stowe decided that the application is the fault…. and the application tested is an example for only 1% of the HPC applications and Amazon care for the other 99%. Jason, wake up! The applications tested are a good indication for many other HPC applications. Don’t blame the application or the user, blame yourself for building a lame solution. Deepak Singh, Amazon’s principal product manager for EC2 had also smart things to say - “We’re interested in figuring out from our customers what they want to run, and then deliver those capabilities to them. There are certain specialized applications that require very specialized hardware. It’s like one person running it in some secret national laboratory.” Deepak, you need to wake up too and to stop with these marketing responses. If you want to host HPC applications, don’t call every example a “secret national laboratory” and stop calling standard solutions that everyone can buy from any server manufacture, such as InfiniBand, “a very specialized hardware”.

Amazon is clearly not connected to its users, or potential users, and until they do try to understand what we need, the best thing is to avoid them. There are much better solutions for HPC clouds out there.

Sunday, February 19, 2012

My Pick for the Best High-Performance Computing Conferences March-May 2012

There are many HPC related conferences and workshops one can choose to attend. Therefore, a conference that covers multiple topics and combines real hands-on with technical sessions is the one I prefer to take part and attend. I have tried to find the best HPC conferences in the next 3 months:

March: The HPC Advisory Council Switzerland Conference, March 13-15 (http://www.hpcadvisorycouncil.com/events/2012/Switzerland-Workshop/index.php). The location is in the beautiful city of Lugano, but more important is the agenda of course… the conference will cover all the major developments, and will include hands-on sessions. Definitely worth the travel!

April: Two options to pick - 2012 High Performance Computing Linux for Wall Street in New York or the IDC HPC User Forum in Richmond, VA. It is more sessions and opinions rather than technical focused. Both are decent options if you have time…

May: Two options again – the NVIDIA GPU Technology Conference in San Jose, CA or the IEEE International Parallel & Distributed Processing Symposium (IPDPS) in Shanghai, China. IPDPS is of course more technical and covers more subjects – so the better option.

June: The International Supercomputing Conference in Germany, no doubt…