Tuesday, July 5, 2011

NVIDIA GPUDirect – Here is the Complete Story

Let me start with some history here. If we follow the development of things, it appears that the first time GPUDirect was mentioned was in the NVIDIA press release announcing the GPUDirect project as collaboration with Mellanox Technologies - “NVIDIA Tesla GPUs To Communicate Faster Over Mellanox InfiniBand Networks”, http://www.nvidia.com/object/io_1258539409179.html. Since then you can find more press releases and numerous papers describing what it is, and the performance gain you can achieve using it.

Today there are two GPUDirect versions – version 1.0 and version 2.0. Version 1.0 is for accelerating GPU communications between GPUs located on separate servers over InfiniBand and it exists in CUDA 3.0 (that version requires kernel patches to make it works) and in CUDA 4.0 (which does not require any kernel patches, so easier to use). GPUDirect version 2.0 is for accelerating GPU communications between GPUs on the same server and on the same CPU chipset (it does not work if the GPUs are on separate CPU chipsets) and exists in CUDA 4.0. So if you use CUDA 4.0, you have both GPUDirect version 1 and GPUDirect version 2.

GPUDirect version 1 enables better communication between remote GPUs over InfiniBand. Why InfiniBand? Because you need to use RDMA for the data communications between the GPUs, else it does not work. Without RDMA support, you will require the server CPU to be involved in the data path, hence no much of “GPUDirect”... Looking into the InfiniBand vendors – the one that has RDMA support is Mellanox, and the one that does not (well, they have kind of software emulation of RDMA) is QLogic. No surprise why NVIDIA announced the GPUDirect project with Mellanox.

I have examined the performance of GPUDirect that both companies have published. I am sure that the cluster configurations were not identical, so one should expect some “noisy” performance comparison. When examining the performance results from both companies, I noticed that Mellanox used a single GPU per server, therefore their measurements clearly shows the benefit of GPUDirect version 1. I also noticed that QLogic used two GPUs per server, therefore their measurement could definitely cover GPUDirect version 2 (and not GPUDirect version 1), which has nothing to do with the network... Comparing between QLogic and Mellanox numbers has clearly no meaning knowing those setups details, and even when you do compare, the difference is negligible – meaning that with half the number of GPUs, Mellanox is doing almost the same as QLogic. I now know what I want to use in my GPU platform….  J

At the end of the day, if you are using GPUs, GPUDirect version 1 and version 2 can introduce performance benefits, just make sure you are using the right InfiniBand solution with version 1 and have the GPUs on the same CPU chipset to take advantage of version 2, and you will be all set.

No comments:

Post a Comment