Thursday, July 21, 2011

NVIDIA GPUDirect – QLogic TrueScale InfiniBand – Real or Not Real?

Noting personal, but when you come across advertisements that can be misleading you do want to express your own opinion on that case…  Since I am a big supporter of hybrid computing, I am trying to be on top of things and find ways to make my GPU clusters run faster. I was happy to see the GPUDirect development – both GPUDirect 1 for remote GPU communications and GPUDirect 2 for local GPU communications. Unfortunately GPU Direct 2 is limited right now to GPUs on the same IOH, but we can always hope for better…

Looking into the better setting for GPUDirect 1, I noticed that there are two options for the InfiniBand side (it seems that GPU Direct 1 is supported on InfiniBand only for now) – the QLogic one and the Mellanox one. Both are listed on the NVIDIA web site.  I will focus on the QLogic one for this post.

QLogic published a white paper on “Maximizing GPU Cluster Performance”, and from this paper I quote:  “One of the key challenges with deploying clusters consisting of multi-GPU nodes is to maximize application performance. Without GPUDirect, GPU-to-GPU communications would require the host CPU to make multiple memory copies to avoid a memory pinning conflict between the GPU and InfiniBand. Each additional CPU memory copy significantly reduces the performance potential of the GPUs.” I completely agree with this statement. This is the reason for GPUDirect 1. I would prefer to see a more direct connectivity between the GPU and the network, but for now this is useful as well.

Continuing quoting” “QLogic’s implementation of GPUDirect takes a streamlined approach to optimizing…” wait a second…  QLogic InfiniBand is based on using the host-CPU for the InfiniBand transport (also known as “On-Loading”) therefore any data must go through the CPU before it can hit the wire. So how come you can avoid having the CPU from being involved in every GPU communication if you need the CPU to create the InfiniBand network packet??? Well, you cannot. “streamlined” in this case means no real GPUDirect… What QLogic does not mention in this paper, is a comment they made during a presentation they gave before this paper was published. Their test bed system included 2 (yes, two… the number 2) GPUs per node. Therefore in this case, the only real GPUDirect they could test is GPUDirect 2.

In the same paper, they also compare their InfiniBand (“True Scale”) performance to their competitor – meaning Mellanox. This is a tricky situation… if you do such a comparison, do it right. I went o look for any numbers from Mellanox, and was happy to find some but not all. Some of the QLogic results claimed to be on “Other InfiniBand” could only be found in the QLogic paper, therefore I can assume that these benchmarks were done by QLogic. I did compare the results found on both the QLogic paper and Mellanox publications and guess what – QLogic published much lower performance on the Mellanox solution than what Mellanox did on their own. For example - Amber FactorIX benchmark results – QLogic claims that Mellanox can achieve 10 nanosecond/day on 8 GPUs, and Mellanox reported nearly 19 nanosecond/day on 8 GPUs – nearly twice. It would have been much better if QLogic would have focused on their solution, rather than spreading FUD… food for the thought. 

No comments:

Post a Comment