Tuesday, April 19, 2016

HPC interconnects: InfiniBand versus Intel Omni-Path

This is a hot topic nowadays, definitely in the HPC world. The leading InfiniBand solution versus the new proprietary product from Intel - Omni-Path. One can find massive information on what Omni-Path is (well, it was already announced at the 2015 supercomputing conference) but interesting enough, Intel did not release any applications performance so far. You can find some low level benchmarks such as network latency and bandwidth, but those numbers are actually similar between the two solutions, and actually do not reflect the main architectural differences between InfiniBand and Omni-Path. Just few days ago, Mellanox published an interesting article covering in details the main differences between InfiniBand and Omni-Path, and for the first time release some application performance comparisons.  Not surprising (still, Omni-Path is based on the QLogic True-Scale architecture) I guess, InfiniBand shows much higher applications performance even on a small cluster size - getting to 60% of performance advantage. It is an interesting article to read, and seems that InfiniBand will remain the top solution for HPC systems.
You can find the article at - http://www.hpcwire.com/2016/04/12/interconnect-offloading-versus-onloading/.

Tuesday, March 22, 2016

Co-Processors - Intel Knights Landing

In the past I wrote on NVIDIA GPUs, and on the GPU-Direct technology that enables a direct network communication between the NVIDIA GPU and the cluster interconnect (for example InfiniBand). Since then, there were several enhancement to the GPU-Direct technology, and I will try to cover that in future posts.

In this post I would like to give my view on the upcoming Intel Knights Landing Xeon Phi co-processor. Knights Landing is the next generation Xeon Phi after Knight Corner. Knights Corner was not a success story to say the least, and Intel aims to gain some attraction with the next generation Knight Landing.

Knights Landing has 72 cores, and more important it is a bootable device, so one can use Knights Landing as the main CPU in a server platform. This is a nice capability, as one can build/use a single CPU (Knight Landing) boards for example.

On the connectivity side, Knights Landing will be provided in two versions (packages) - KNL and KNL-F. KNL is the Knights Landing CPU, with two PCI-Express x16 and one PCI-Express x4 interfaces. The two x16 can be connected to any network solution. KNL-F is a packaged Knights Landing with two Intel Omni-Path ASICs, connected via PCI-Express x16 each to the KNL ASIC. There will be PCI-Express x4 connectivity out of the KNL-F package for management options.

KNL has being tested with InfiniBand and works great using the Open Fabrics OFED distribution, or the Mellanox OFED distribution. It has being implemented in several sites already.

KNL-F usage is questionable. It makes sense to Intel to try and lock users to their propriety interconnect product (which is no more than a new version of the QLogic TrueScale product), but why would one want to be locked down to OmniPath? Especially as Omni-Path requires the KNL to spend many expensive cores cycles to manage and operate the Omni-Path network (which means loss of KNL performance)?  TrueScale was not used with GPUs in the past due to its overhead, and Omni-Path, which is based on the same TrueScale architecture, is no better.

Wednesday, March 9, 2016

UCX – a Unified Communication X Framework for High Performance Communications

UCX is a new, an open-source, production grade, set of network APIs and their implementations for high performance communications for high-performance computing and data-centric applications.  UCX solves the problem of moving data memory across multiple type of memories (DRAM, accelerator memories, etc.) and multiple transports (e.g. InfiniBand, uGNI, Shared Memory, CUDA, etc. ), while minimizing latency, and maximizing bandwidth and message rate. The new communications framework supports all communication libraries (MPI, PGAS etc.) and enable a closer connection to the underlying hardware. A new community of supporters has been established behind the UCX efforts, which includes key participants from the HPC industry, laboratories and academia.  

UCX ensures there is very little software overhead in the communication path, allowing for near native-level hardware performance. To ensure the production quality of UCX, it was be co-designed, maintained, tested and used in production environments by leading proponents of the research community. In addition, UCX community is working on the UCX specification which will enable further extension of the UCX effort.

 As an open-source project licensed under the BSD-3 license, UCX is open for contributions from anyone in the industry. More information can be found at  www.openucx.com.

Sunday, February 28, 2016

Intel OmniPath network - what happened to the performance numbers?

Intel has been talking on their new network products for high performance computing for quite some time. OmniPath is a new network based on the old technology Intel acquired from QLogic (aka TrueScale). As Intel officially announced OmniPath in the last supercomputing conference, I was waiting to see any real performance data using this network, but nothing was actually published so far (SC'15 was in Nov, March 2016 is around the corner...).

So, for now all I can do is to read what Intel publishes. But even this information is being changed over time. In July 2015 Intel published a chart claiming OmniPath do deliver 160 million messages per second, which is a very impressive number (and higher versus what InfiniBand can do today). But, but....wait, in the recent Linley Group data center conference (February 2016) Intel presented a slide that actually talks on a complete different set of numbers - they actually projected that the OmniPath message rate will be between 75 and 86 million, and said to actually measure 108 million (which is higher than what they have projected originally...). What happened to the 160 million number? 108 million is much lower versus InfiniBand, and is not that interesting to say the least.

 Intel July 2015 - 160 million messages per second

 Intel February 2016 - 108 million messages per second

Monday, December 22, 2014

Making treatment affordable and more accurate using supercomputers

A great example on how supercomputers can help improving our life. A new supercomputer in Denmark, dedicated to life science, will help analyzing the growing amounts of data we collect, to find better health treatments, safer, more accurate. And it is all based on standard technologies. Happy holidays!

Monday, September 15, 2014

InfiniBand Performance Over Intel Haswell – First Numbers

As Intel just released the new CPU platform - official name Intel Xeon E5-2600 v3, code name “Haswell”, wanted to share some of our performance testing. Forr start, we tested the simple InfiniBand bandwidth and latency benchmarks one can find as part of the InfiniBand software distribution. We measured around 6.4 Giga Byte per second bandwidth and latency of close to 0.6 micro second. You can see the full graphs below. More to come J

Remote Direct Memory Access - RoCE versus iWARP

Remote Direct Memory Access (RDMA) is the technology that that allows server-to-server data communication to go directly to the user space (aka application) memory without any CPU involvement. RDMA technology delivers faster performance for large data transfers while reducing CPU utilization or overhead. It is a technology used in many applications segments – database, storage, cloud and of course HPC. All of the MPIs include support for RDMA for the rendezvous protocol.
There are three communications standards for RDMA – InfiniBand (the de-facto solution for HPC), RoCE and iWARP. The latter two are over Ethernet. RoCE has being standardized by the IBTA organization, and iWARP by the IETF.
iWARP solutions are being sold by Intel (due to the acquisition of NetEffect) and Chelsio. RoCE solutions are being sold by Mellanox, Emulex and others. The major issues of iWARP are performance and scalability. With iWARP, the data needs to pass through multiple protocols before it can hit the wire and therefore the performance iWARP delivers is not in par with RoCE (not to mention InfiniBand). The major RoCE limitation was with support over layer 3, but this has been solved with the new specification that is about to be released for RoCE v2.
Last week Intel announced their new Ethernet NICs (“Fortville”). No iWARP support is listed for these new NICs, and this leaves Intel without RDMA capability for their Ethernet NICs. Seems that the iWARP camp is shrinking… well…  there is a RoCE reason for it…