Tuesday, July 2, 2013

GPU Direct RDMA is Finally Out!

GPUDirect RDMA is the newest technology for GPU to GPU communications over the InfiniBand interconnect. GPUDirect RDMA enables a direct data transfer from the GPU memory over the InfiniBand network via PCI Pier-to-Pier (P2P). This capability introduced in the NVIDIA Kepler-class GPUs, CUDA 5.0 and the Mellanox InfiniBand solutions.

The importance of this capability is with bypassing the CPU for GPU communications (who needs the CPU…..), therefore a dramatic increase in performance. Finally after long time of waiting, the two companies mentioned above have demonstrated the new capability in the recent ISC’13 conference. Prof. Dhabaleswar K. (DK) Panda, Hari Subramoni and Sreeram Potluri from the Ohio State University presented at the HPC Advisory Council their first results with the GPU Direct RDMA – 70% reduction in latency! You can see the entire presentation at http://www.hpcadvisorycouncil.com/events/2013/European-Workshop/presentations/9_OSU.pdf. Seems that GE Intelligent Platforms already using the new technology - http://www.militaryaerospace.com/whitepapers/2013/03/gpudirect_-rdma.html, which is a great example of how the new capability can make our life better (or faster…). You can also read more on http://docs.nvidia.com/cuda/gpudirect-rdma/index.html.

In the graph: latency improvement presented by DK Panda

Wednesday, May 2, 2012

Amazon’s HPC cloud: not for HPC!

I came across an interesting article on Amazon HPC Cloud. As was reported recently, Cycle Computing built a 50,000-core Amazon cluster for Schrödinger, which makes simulation software for use in pharmaceutical and biotechnology research. Amazon and Cycle Computing made lot of noise around the HPC capability of EC2, how great it is for HPC applications and that Schrödinger is a great partner. 

When Schrödinger actually tried to run their simulations on the cloud, the results were not that great. Amazon EC2 architecture slows down HPC applications that require decent amount of communication between the servers, even at small scale. Schrödinger President Ramy Farid mentioned that they have successfully run parallel jobs on Amazon eight-core boxes, but when we tried anything more than that, they got terrible performance. Farid was using Amazon’s eight-core server instances, so running a job on 16 cores simultaneously required two eight-core machines. “Slow interconnect speeds between separate machines does become a serious issue”.
It is known that Amazon EC2 is not a good place for HPC application, and it will not change unless they actually build the right solution. But instead of the taking the right steps, Cycle Computing CEO Jason Stowe decided that the application is the fault…. and the application tested is an example for only 1% of the HPC applications and Amazon care for the other 99%. Jason, wake up! The applications tested are a good indication for many other HPC applications. Don’t blame the application or the user, blame yourself for building a lame solution. Deepak Singh, Amazon’s principal product manager for EC2 had also smart things to say - “We’re interested in figuring out from our customers what they want to run, and then deliver those capabilities to them. There are certain specialized applications that require very specialized hardware. It’s like one person running it in some secret national laboratory.” Deepak, you need to wake up too and to stop with these marketing responses. If you want to host HPC applications, don’t call every example a “secret national laboratory” and stop calling standard solutions that everyone can buy from any server manufacture, such as InfiniBand, “a very specialized hardware”.

Amazon is clearly not connected to its users, or potential users, and until they do try to understand what we need, the best thing is to avoid them. There are much better solutions for HPC clouds out there.

Sunday, February 19, 2012

My Pick for the Best High-Performance Computing Conferences March-May 2012

There are many HPC related conferences and workshops one can choose to attend. Therefore, a conference that covers multiple topics and combines real hands-on with technical sessions is the one I prefer to take part and attend. I have tried to find the best HPC conferences in the next 3 months:

March: The HPC Advisory Council Switzerland Conference, March 13-15 (http://www.hpcadvisorycouncil.com/events/2012/Switzerland-Workshop/index.php). The location is in the beautiful city of Lugano, but more important is the agenda of course… the conference will cover all the major developments, and will include hands-on sessions. Definitely worth the travel!

April: Two options to pick - 2012 High Performance Computing Linux for Wall Street in New York or the IDC HPC User Forum in Richmond, VA. It is more sessions and opinions rather than technical focused. Both are decent options if you have time…

May: Two options again – the NVIDIA GPU Technology Conference in San Jose, CA or the IEEE International Parallel & Distributed Processing Symposium (IPDPS) in Shanghai, China. IPDPS is of course more technical and covers more subjects – so the better option.

June: The International Supercomputing Conference in Germany, no doubt…

Wednesday, February 15, 2012

New MPIs released – Open MPI 1.4.5 and MVAPICH2 1.8

In a very close timing, both the Open MPI group and the MVAPICH team released new versions of their open source MPIs. The Open MPI Team announced the release of Open MPI version 1.4.5. This release is mainly a bug fix release over the v1.4.4 release. Version 1.4.5 can be downloaded from the main Open MPI web site and it contains an improve management of the registration cache, a fix for SLURM cpus-per-task allocation, as well as some other bug fixes.

The MVAPICH team announced the release of MVAPICH2 1.8a2 and OSU Micro-Benchmarks (OMB) 3.5.1. The new features include support for collective communication from GPU buffers, non-contiguous datatype support in point-to-point and collective  communication from GPU buffers, efficient GPU-GPU transfers within a node using CUDA IPC, adjust shared-memory communication block size at runtime, enable XRC by default at configure time, new shared memory design for enhanced intra-node small message performance and SLURM integration with mpiexec.mpirun_rsh to use SLURM allocated hosts without specifying a hostfile. For downloading MVAPICH2 1.8a2, OMB 3.5.1, associated user guide,
quick start guide, and accessing the SVN, you can check http://mvapich.cse.ohio-state.edu.

Congrats for the teams for the new releases.

Friday, December 16, 2011

InfiniBand for the Home in Less Than $150 (10Gb Networking on the Cheap)

I came across Dave Hunt's Blog - http://davidhunt.ie/wp/?p=232, talking on using older generation of InfiniBand (10Gb) for the home. Think about your home entertainment system, Blu-ray streaming from a storage box to your PC and other usage models…

When it comes to Ethernet and 10Gbs, the price is in the sky, but you can get the same speed with InfiniBand for pennies.  Dave built a system that passes over 700MB/sec throughput between his PCs at home for under $150! As Dave says, that’s like a full CD’s worth of data every second…

From Dave blog – “So, I now have an InfiniBand Fabric working at home, with over 7 gigabit throughput between PCs. The stuff of high-end datacenters in my back room. The main thing is that you don’t need a switch, so a PC to PC 10-gigabit link CAN be achieved for under $150! Here’s the breakdown: 2 x Mellanox MHEA28-XTC InfiniBand HCA’s @ $34.99 + shipping = $113 (from eBay), 1 x 3m Molex SFF-8470 InfiniBand cable include shipping = $29. Total: $142”.

Next will be for me to bring InfiniBand into my home…

PCI-Express 3.0 is Finally Here…

Long time since my last post… You know how it is – new academic year, lots of preparations, going to Supercomputing 2011 in the freezing Seattle…

One of the new technologies that I was really waiting for is PCI Express 3.0. PCI Express 2.0 was released in 2008 and is the bottleneck whenever you use faster than 20Gb/s network - such as InfiniBand. PCI Express 2.0 was released in 2008, and it is about time to get the new generation out.

While sources said that the official release of PCI Express 3.0 will be in the March 2012 time frame, the first systems based on PCI Express 3.0 (and InfiniBand FDR!!) are already out there. If you were at SC’11, or if you monitor the news from the TOP500 list, you could hear (or read) on the new systems. One of them is the Carter supercomputer in Purdue University, which is said to be the US fastest campus supercomputer. Carter systems was ranked 54th on the November TOP500.org list and was built using the latest technologies from Intel, HP and Mellanox, including not-yet-released Xenon E-5 "Sandy Bridge" Intel processors, HP Proliant servers and the already-released InfiniBand FDR.

The folks from Purdue claim that "Carter is running twice as fast as the supercomputer we were using and is using only half of the nodes. That will allow us to scale our models for better forecasts." So higher performance at a lower operational cost. Great deal…

Saturday, October 15, 2011

New MPIs Version Announced – Open MPI and MVAPICH (OSU)

This week, with one day difference, new MPI versions of Open MPI and MVAPICH were announced. On Thursday, the Open MPI team announced the release of Open MPI version 1.4.4. This release is mainly a bug fix release over the previous v1.4.3 release. The team strongly recommends that all users upgrade to version 1.4.4 if possible.

A Day after, On Friday, Prof. Dhabaleswar Panda from Ohio State University announced on behalf of the MVAPICH team the release of MVAPICH2-1.7 and OSU Micro-Benchmarks (OMB) 3.4. You can check both Open MPI and OSU websites for details on the new releases.

As part of the release, Prof. Dhabaleswar Panda has provided some performance results, and according to him, MVAPICH2 1.7 is being made available with OFED 1.5.4 and it continues to deliver excellent performance. OpenFabrics/Gen2 on Westmere quad-core (2.53 GHz) with PCIe-Gen2 and Mellanox ConnectX2-QDR (Two-sided Operations) provides 1.64 microsec one-way latency (4 bytes), 3394 MB/sec unidirectional bandwidth and 6537 MB/sec bidirectional bandwidth. QLogic InfiniPath Support on Westmere quad-core (2.53 GHz) with PCIe-Gen2 and QLogic-QDR (Two-sided Operations) provides 1.70 microsec one-way latency (4 bytes), 3265 MB/sec unidirectional bandwidth and 4228 MB/sec bidirectional bandwidth. Prof. Dhabaleswar Panda results clearly indicate that if you go with InfiniBand, Mellanox ConnectX-2 provides lower latency and much higher throughput. Clearly the performance winner.