Tuesday, March 22, 2016

Co-Processors - Intel Knights Landing

In the past I wrote on NVIDIA GPUs, and on the GPU-Direct technology that enables a direct network communication between the NVIDIA GPU and the cluster interconnect (for example InfiniBand). Since then, there were several enhancement to the GPU-Direct technology, and I will try to cover that in future posts.

In this post I would like to give my view on the upcoming Intel Knights Landing Xeon Phi co-processor. Knights Landing is the next generation Xeon Phi after Knight Corner. Knights Corner was not a success story to say the least, and Intel aims to gain some attraction with the next generation Knight Landing.

Knights Landing has 72 cores, and more important it is a bootable device, so one can use Knights Landing as the main CPU in a server platform. This is a nice capability, as one can build/use a single CPU (Knight Landing) boards for example.

On the connectivity side, Knights Landing will be provided in two versions (packages) - KNL and KNL-F. KNL is the Knights Landing CPU, with two PCI-Express x16 and one PCI-Express x4 interfaces. The two x16 can be connected to any network solution. KNL-F is a packaged Knights Landing with two Intel Omni-Path ASICs, connected via PCI-Express x16 each to the KNL ASIC. There will be PCI-Express x4 connectivity out of the KNL-F package for management options.

KNL has being tested with InfiniBand and works great using the Open Fabrics OFED distribution, or the Mellanox OFED distribution. It has being implemented in several sites already.

KNL-F usage is questionable. It makes sense to Intel to try and lock users to their propriety interconnect product (which is no more than a new version of the QLogic TrueScale product), but why would one want to be locked down to OmniPath? Especially as Omni-Path requires the KNL to spend many expensive cores cycles to manage and operate the Omni-Path network (which means loss of KNL performance)?  TrueScale was not used with GPUs in the past due to its overhead, and Omni-Path, which is based on the same TrueScale architecture, is no better.

Wednesday, March 9, 2016

UCX – a Unified Communication X Framework for High Performance Communications

UCX is a new, an open-source, production grade, set of network APIs and their implementations for high performance communications for high-performance computing and data-centric applications.  UCX solves the problem of moving data memory across multiple type of memories (DRAM, accelerator memories, etc.) and multiple transports (e.g. InfiniBand, uGNI, Shared Memory, CUDA, etc. ), while minimizing latency, and maximizing bandwidth and message rate. The new communications framework supports all communication libraries (MPI, PGAS etc.) and enable a closer connection to the underlying hardware. A new community of supporters has been established behind the UCX efforts, which includes key participants from the HPC industry, laboratories and academia.  

 
UCX ensures there is very little software overhead in the communication path, allowing for near native-level hardware performance. To ensure the production quality of UCX, it was be co-designed, maintained, tested and used in production environments by leading proponents of the research community. In addition, UCX community is working on the UCX specification which will enable further extension of the UCX effort.

 
 As an open-source project licensed under the BSD-3 license, UCX is open for contributions from anyone in the industry. More information can be found at  www.openucx.com.