Wednesday, August 20, 2014

GPUDirect RDMA in Action

Last year I wrote about the release of the GPUDirect RDMA technology. Simply saying, this is the technology that enables direct communications between GPUs over the network (RDMA capable network) which translates into much higher performance for applications using GPUs – high performance applications, data analytics, gaming  etc. basically any application that run over more than a single GPU. If in the past every data movement from the GPU had to go through the CPU memory, with GPUDirect RDMA it is not the case anymore. The data will go directly from the GPU memory to the RDMA capable network (for example InfiniBand) – data latency is being reduced by more than 70%, data throughput is being increased by 5-6X and the CPU bottleneck is being eliminated.


The University of Cambridge and the HPC Advisory Council have released performance information of GPUDirect RDMA with one of their application on a nearly 100 servers system. The application is HOOMD-blue - a general-purpose molecular dynamics simulation created by the university of Michigan, that can be used over GPUs. Each server includes two GPUs and two InfiniBand RDMA adapters – so each GPU can connect directly to the network instead of going through the CPU and the QPI interface (a pair of GPU and network adapter is located on the same PCI-Express root complex). Bottom line, the GPUDirect RDMA technology enabled Cambridge to increase HOOMD-blue performance by 2X over the given system. Same system, same hardware, setting on GPUDirect RDMA, twice the performance….  Impressive.



Tuesday, August 19, 2014

New MPI (Message Passing Interface) Solution for High-Performance Applications

A good source for high-performance computing information is the HPC Advisory Council (www.hpcadvisorycouncil.com). Of course one can find many news sites - insideHPC, HPCWire and others, but the HPC Council is a good option if you want to get more in the details, to learn on new technologies and solutions being developed etc. One of the recent publications was a case study on STAR-CCM+ application (CFD) - http://www.hpcadvisorycouncil.com/pdf/STAR-CCM_Analysis_Intel_E5_2680_V2.pdf.

The publication included for the first time some performance information on a new MPI solution – called HPC-X. In the world of high-performance computing, MPI is one of the most used parallel communications library. There are some open source solutions, such as MPICH, MVAPICH, OpenMPI and commercial options – Platform MPI (formally known as HP MPI, now owned by IBM) and Intel MPI. HPC-X is a new solution from Mellanox which seems to be based on Open MPI plus various accelerations. In the past I did cover new releases of both MVAPICH and OpenMPI as these are the two most used solutions by us so far.


The combination of open source base and support does make HPC-X an very interesting solution for any HPC system. Of course, it must perform, as this is the most important item…  according to the new publication, HPC-X does provide a performance advantage over the other commercial options, up to around 20% at 32 server nodes cluster (dual socket servers). It is definitely a good start, and am looking forward to see further reports on HPC-X. Meanwhile we do plan to download and try ourselves. 

Sunday, August 17, 2014

Cluster Topologies - Dragonfly

In the past I reviewed two clusters topologies - Fat Tree (CLOS) and Torus. The dragonfly is a hierarchical topology with the following properties: several groups are connected together using all to all links (i.e. each group has at least one link directly to each other group), the topology inside each group can be any topology, it requires non-minimal global adaptive routing and advanced congestion look ahead for efficient operation. Simply saying, a dragonfly topology is a two level (at least) topology where at the top level groups of switches are connected in a full graph. The internal structure of the groups may vary and be constructed as full graph, fat tree, torus, mesh, dragonfly and so on.



While many described dragonfly as a topology with one hop between groups, actually it is not correct and in many cases the network traffic will go over several hops before getting to the destination. The key to make dragonfly topology effective is to allow some pairs of end-nodes to communicate on a non-minimal route. It is the only way to distribute random group to group traffic. This represents a significant difference from other topologies. To support such routing a dragonfly system needs to utilize adaptive routing and to only send traffic on the longer paths only if congestion is impacting the minimal (hops) paths.


While dragonfly goal was to enable a higher bisectional bandwidth compared to torus topology (or similar) while reducing the overall costs (mainly cable lengths), in practice dragonfly does not provide such an advantage - nor performance or costs over fat tree topology for example. On the other hand it does add some sort of complexity with the need to enable adaptive routing. So far we prefer to use the fat tree option – either with a full bisectional bandwidth configuration or with some oversubscription options – depends on the targeted applications.

Friday, August 15, 2014

Here Comes The 100 Gigabit Per Second…

For some of the compute demanding applications, data throughput is a critical element in getting the needed performance. Throughput is of course important for storage performance and scalability, for checkpointing and other cases, which are do relevant to most applications. Today the fastest solution we can find is InfiniBand FDR. It enables bandwidth of 56 gigabit per second. Taking the overhead off, we are left with around 54 gigabit per second for actual data movement. Other options are QDR or 40 gigabit Ethernet. Ethernet is not what we use for our HPC systems. Too much performance overhead. 

One can claim that there are 100 gigabit ports on some Ethernet switches, but these are for network aggregation, not to the server. These ports actually use 10 lanes of 10 gigabit each. Less the desired 4 lanes approach.

We did see some announcements for real 100 gigabit HPC networks. The first InfiniBand 100 gigabit switch was announced back in June – not just higher throughput but also lower latency – so win on both sides. While no indications yet on when the 100 gigabit InfiniBand adapter will be out, the switch announcement hints that we are getting close to the 100 gigabit times.


The higher the bandwidth (typically) the higher the message rate. With InfiniBand FDR we already saw much higher message rate versus all the QDR options in the market – either from Mellanox or from Intel. The increase in message rate was greater than the bandwidth difference – therefore also due to the new architecture of the latest InfiniBand FDR adapters. We do base all of our system nowadays on FDR. Waiting for EDR….