As Intel just released the new CPU platform - official name Intel Xeon E5-2600 v3, code name “Haswell”, wanted to share some of our performance testing. Forr start, we tested the simple InfiniBand bandwidth and latency benchmarks one can find as part of the InfiniBand software distribution. We measured around 6.4 Giga Byte per second bandwidth and latency of close to 0.6 micro second. You can see the full graphs below. More to come J
Monday, September 15, 2014
Remote Direct Memory Access (RDMA) is the technology that that allows server-to-server data communication to go directly to the user space (aka application) memory without any CPU involvement. RDMA technology delivers faster performance for large data transfers while reducing CPU utilization or overhead. It is a technology used in many applications segments – database, storage, cloud and of course HPC. All of the MPIs include support for RDMA for the rendezvous protocol.
There are three communications standards for RDMA – InfiniBand (the de-facto solution for HPC), RoCE and iWARP. The latter two are over Ethernet. RoCE has being standardized by the IBTA organization, and iWARP by the IETF.
iWARP solutions are being sold by Intel (due to the acquisition of NetEffect) and Chelsio. RoCE solutions are being sold by Mellanox, Emulex and others. The major issues of iWARP are performance and scalability. With iWARP, the data needs to pass through multiple protocols before it can hit the wire and therefore the performance iWARP delivers is not in par with RoCE (not to mention InfiniBand). The major RoCE limitation was with support over layer 3, but this has been solved with the new specification that is about to be released for RoCE v2.
Last week Intel announced their new Ethernet NICs (“Fortville”). No iWARP support is listed for these new NICs, and this leaves Intel without RDMA capability for their Ethernet NICs. Seems that the iWARP camp is shrinking… well… there is a RoCE reason for it…
Thursday, September 4, 2014
A recent release from the Texas Advanced Computing Center (TACC) sheds light on one of the research programs that is being supported, or better say enabled, by the TACC powerful supercomputer, one of the fastest machines in the world. Using supercomputer simulations on TACC's “Lonestar” system, researchers are able to model radiation in a magnetic field, which will facilitate the safe use of the MRI-linac and enable more effective cancer treatment.
The research is being done by the MD Anderson Cancer Center in Houston. According to the team working on it, the new solution they develop unites radiation therapy and magnetic resonance imaging (MRI), allowing physicians to view the cancer tumor in real-time and in high detail during treatment. It also permits physicians to adapt the radiation treatment during the procedure, sparing healthy tissue and reducing side effects.
To develop the system, the MD team utilize the TACC supercomputer to ran complex simulations. A great use for the supercomputing power. TACC system was build using the most flexible architecture of a cluster, a combination of CPUs and co-processors and InfiniBand for the connectivity. A great example of a standard based system and an example on why there is no reason to use proprietary products for supercomputers. You can read more on TACC systems at https://www.tacc.utexas.edu/resources/hpc. I enjoy using them too.
Wednesday, August 20, 2014
Last year I wrote about the release of the GPUDirect RDMA technology. Simply saying, this is the technology that enables direct communications between GPUs over the network (RDMA capable network) which translates into much higher performance for applications using GPUs – high performance applications, data analytics, gaming etc. basically any application that run over more than a single GPU. If in the past every data movement from the GPU had to go through the CPU memory, with GPUDirect RDMA it is not the case anymore. The data will go directly from the GPU memory to the RDMA capable network (for example InfiniBand) – data latency is being reduced by more than 70%, data throughput is being increased by 5-6X and the CPU bottleneck is being eliminated.
The University of Cambridge and the HPC Advisory Council have released performance information of GPUDirect RDMA with one of their application on a nearly 100 servers system. The application is HOOMD-blue - a general-purpose molecular dynamics simulation created by the university of Michigan, that can be used over GPUs. Each server includes two GPUs and two InfiniBand RDMA adapters – so each GPU can connect directly to the network instead of going through the CPU and the QPI interface (a pair of GPU and network adapter is located on the same PCI-Express root complex). Bottom line, the GPUDirect RDMA technology enabled Cambridge to increase HOOMD-blue performance by 2X over the given system. Same system, same hardware, setting on GPUDirect RDMA, twice the performance…. Impressive.
Tuesday, August 19, 2014
A good source for high-performance computing information is the HPC Advisory Council (www.hpcadvisorycouncil.com). Of course one can find many news sites - insideHPC, HPCWire and others, but the HPC Council is a good option if you want to get more in the details, to learn on new technologies and solutions being developed etc. One of the recent publications was a case study on STAR-CCM+ application (CFD) - http://www.hpcadvisorycouncil.com/pdf/STAR-CCM_Analysis_Intel_E5_2680_V2.pdf.
The publication included for the first time some performance information on a new MPI solution – called HPC-X. In the world of high-performance computing, MPI is one of the most used parallel communications library. There are some open source solutions, such as MPICH, MVAPICH, OpenMPI and commercial options – Platform MPI (formally known as HP MPI, now owned by IBM) and Intel MPI. HPC-X is a new solution from Mellanox which seems to be based on Open MPI plus various accelerations. In the past I did cover new releases of both MVAPICH and OpenMPI as these are the two most used solutions by us so far.
The combination of open source base and support does make HPC-X an very interesting solution for any HPC system. Of course, it must perform, as this is the most important item… according to the new publication, HPC-X does provide a performance advantage over the other commercial options, up to around 20% at 32 server nodes cluster (dual socket servers). It is definitely a good start, and am looking forward to see further reports on HPC-X. Meanwhile we do plan to download and try ourselves.
Sunday, August 17, 2014
In the past I reviewed two clusters topologies - Fat Tree (CLOS) and Torus. The dragonfly is a hierarchical topology with the following properties: several groups are connected together using all to all links (i.e. each group has at least one link directly to each other group), the topology inside each group can be any topology, it requires non-minimal global adaptive routing and advanced congestion look ahead for efficient operation. Simply saying, a dragonfly topology is a two level (at least) topology where at the top level groups of switches are connected in a full graph. The internal structure of the groups may vary and be constructed as full graph, fat tree, torus, mesh, dragonfly and so on.
While many described dragonfly as a topology with one hop between groups, actually it is not correct and in many cases the network traffic will go over several hops before getting to the destination. The key to make dragonfly topology effective is to allow some pairs of end-nodes to communicate on a non-minimal route. It is the only way to distribute random group to group traffic. This represents a significant difference from other topologies. To support such routing a dragonfly system needs to utilize adaptive routing and to only send traffic on the longer paths only if congestion is impacting the minimal (hops) paths.
While dragonfly goal was to enable a higher bisectional bandwidth compared to torus topology (or similar) while reducing the overall costs (mainly cable lengths), in practice dragonfly does not provide such an advantage - nor performance or costs over fat tree topology for example. On the other hand it does add some sort of complexity with the need to enable adaptive routing. So far we prefer to use the fat tree option – either with a full bisectional bandwidth configuration or with some oversubscription options – depends on the targeted applications.
Friday, August 15, 2014
For some of the compute demanding applications, data throughput is a critical element in getting the needed performance. Throughput is of course important for storage performance and scalability, for checkpointing and other cases, which are do relevant to most applications. Today the fastest solution we can find is InfiniBand FDR. It enables bandwidth of 56 gigabit per second. Taking the overhead off, we are left with around 54 gigabit per second for actual data movement. Other options are QDR or 40 gigabit Ethernet. Ethernet is not what we use for our HPC systems. Too much performance overhead.
One can claim that there are 100 gigabit ports on some Ethernet switches, but these are for network aggregation, not to the server. These ports actually use 10 lanes of 10 gigabit each. Less the desired 4 lanes approach.
We did see some announcements for real 100 gigabit HPC networks. The first InfiniBand 100 gigabit switch was announced back in June – not just higher throughput but also lower latency – so win on both sides. While no indications yet on when the 100 gigabit InfiniBand adapter will be out, the switch announcement hints that we are getting close to the 100 gigabit times.
The higher the bandwidth (typically) the higher the message rate. With InfiniBand FDR we already saw much higher message rate versus all the QDR options in the market – either from Mellanox or from Intel. The increase in message rate was greater than the bandwidth difference – therefore also due to the new architecture of the latest InfiniBand FDR adapters. We do base all of our system nowadays on FDR. Waiting for EDR….