Tuesday, March 22, 2016

Co-Processors - Intel Knights Landing

In the past I wrote on NVIDIA GPUs, and on the GPU-Direct technology that enables a direct network communication between the NVIDIA GPU and the cluster interconnect (for example InfiniBand). Since then, there were several enhancement to the GPU-Direct technology, and I will try to cover that in future posts.

In this post I would like to give my view on the upcoming Intel Knights Landing Xeon Phi co-processor. Knights Landing is the next generation Xeon Phi after Knight Corner. Knights Corner was not a success story to say the least, and Intel aims to gain some attraction with the next generation Knight Landing.

Knights Landing has 72 cores, and more important it is a bootable device, so one can use Knights Landing as the main CPU in a server platform. This is a nice capability, as one can build/use a single CPU (Knight Landing) boards for example.

On the connectivity side, Knights Landing will be provided in two versions (packages) - KNL and KNL-F. KNL is the Knights Landing CPU, with two PCI-Express x16 and one PCI-Express x4 interfaces. The two x16 can be connected to any network solution. KNL-F is a packaged Knights Landing with two Intel Omni-Path ASICs, connected via PCI-Express x16 each to the KNL ASIC. There will be PCI-Express x4 connectivity out of the KNL-F package for management options.

KNL has being tested with InfiniBand and works great using the Open Fabrics OFED distribution, or the Mellanox OFED distribution. It has being implemented in several sites already.

KNL-F usage is questionable. It makes sense to Intel to try and lock users to their propriety interconnect product (which is no more than a new version of the QLogic TrueScale product), but why would one want to be locked down to OmniPath? Especially as Omni-Path requires the KNL to spend many expensive cores cycles to manage and operate the Omni-Path network (which means loss of KNL performance)?  TrueScale was not used with GPUs in the past due to its overhead, and Omni-Path, which is based on the same TrueScale architecture, is no better.

Wednesday, March 9, 2016

UCX – a Unified Communication X Framework for High Performance Communications

UCX is a new, an open-source, production grade, set of network APIs and their implementations for high performance communications for high-performance computing and data-centric applications.  UCX solves the problem of moving data memory across multiple type of memories (DRAM, accelerator memories, etc.) and multiple transports (e.g. InfiniBand, uGNI, Shared Memory, CUDA, etc. ), while minimizing latency, and maximizing bandwidth and message rate. The new communications framework supports all communication libraries (MPI, PGAS etc.) and enable a closer connection to the underlying hardware. A new community of supporters has been established behind the UCX efforts, which includes key participants from the HPC industry, laboratories and academia.  

 
UCX ensures there is very little software overhead in the communication path, allowing for near native-level hardware performance. To ensure the production quality of UCX, it was be co-designed, maintained, tested and used in production environments by leading proponents of the research community. In addition, UCX community is working on the UCX specification which will enable further extension of the UCX effort.

 
 As an open-source project licensed under the BSD-3 license, UCX is open for contributions from anyone in the industry. More information can be found at  www.openucx.com.

Sunday, February 28, 2016

Intel OmniPath network - what happened to the performance numbers?

Intel has been talking on their new network products for high performance computing for quite some time. OmniPath is a new network based on the old technology Intel acquired from QLogic (aka TrueScale). As Intel officially announced OmniPath in the last supercomputing conference, I was waiting to see any real performance data using this network, but nothing was actually published so far (SC'15 was in Nov, March 2016 is around the corner...).

So, for now all I can do is to read what Intel publishes. But even this information is being changed over time. In July 2015 Intel published a chart claiming OmniPath do deliver 160 million messages per second, which is a very impressive number (and higher versus what InfiniBand can do today). But, but....wait, in the recent Linley Group data center conference (February 2016) Intel presented a slide that actually talks on a complete different set of numbers - they actually projected that the OmniPath message rate will be between 75 and 86 million, and said to actually measure 108 million (which is higher than what they have projected originally...). What happened to the 160 million number? 108 million is much lower versus InfiniBand, and is not that interesting to say the least.


 
 Intel July 2015 - 160 million messages per second
 














 Intel February 2016 - 108 million messages per second

Monday, December 22, 2014

Making treatment affordable and more accurate using supercomputers

A great example on how supercomputers can help improving our life. A new supercomputer in Denmark, dedicated to life science, will help analyzing the growing amounts of data we collect, to find better health treatments, safer, more accurate. And it is all based on standard technologies. Happy holidays!






Monday, September 15, 2014

InfiniBand Performance Over Intel Haswell – First Numbers

As Intel just released the new CPU platform - official name Intel Xeon E5-2600 v3, code name “Haswell”, wanted to share some of our performance testing. Forr start, we tested the simple InfiniBand bandwidth and latency benchmarks one can find as part of the InfiniBand software distribution. We measured around 6.4 Giga Byte per second bandwidth and latency of close to 0.6 micro second. You can see the full graphs below. More to come J



Remote Direct Memory Access - RoCE versus iWARP

Remote Direct Memory Access (RDMA) is the technology that that allows server-to-server data communication to go directly to the user space (aka application) memory without any CPU involvement. RDMA technology delivers faster performance for large data transfers while reducing CPU utilization or overhead. It is a technology used in many applications segments – database, storage, cloud and of course HPC. All of the MPIs include support for RDMA for the rendezvous protocol.
There are three communications standards for RDMA – InfiniBand (the de-facto solution for HPC), RoCE and iWARP. The latter two are over Ethernet. RoCE has being standardized by the IBTA organization, and iWARP by the IETF.
iWARP solutions are being sold by Intel (due to the acquisition of NetEffect) and Chelsio. RoCE solutions are being sold by Mellanox, Emulex and others. The major issues of iWARP are performance and scalability. With iWARP, the data needs to pass through multiple protocols before it can hit the wire and therefore the performance iWARP delivers is not in par with RoCE (not to mention InfiniBand). The major RoCE limitation was with support over layer 3, but this has been solved with the new specification that is about to be released for RoCE v2.
Last week Intel announced their new Ethernet NICs (“Fortville”). No iWARP support is listed for these new NICs, and this leaves Intel without RDMA capability for their Ethernet NICs. Seems that the iWARP camp is shrinking… well…  there is a RoCE reason for it…

Thursday, September 4, 2014

Supercomputer Simulations Help Gain Insight into New Cancer Treatment Technology

A recent release from the Texas Advanced Computing Center (TACC) sheds light on one of the research programs that is being supported, or better say enabled, by the TACC powerful supercomputer, one of the fastest machines in the world. Using supercomputer simulations on TACC's “Lonestar”  system, researchers are able to model radiation in a magnetic field, which will facilitate the safe use of the MRI-linac and enable more effective cancer treatment.

The research is being done by the MD Anderson Cancer Center in Houston.  According to the team working on it, the new solution they develop unites radiation therapy and magnetic resonance imaging (MRI), allowing physicians to view the cancer tumor in real-time and in high detail during treatment. It also permits physicians to adapt the radiation treatment during the procedure, sparing healthy tissue and reducing side effects.


To develop the system, the MD team utilize the TACC supercomputer to ran complex simulations. A great use for the supercomputing power. TACC system was build using the most flexible architecture of a cluster, a combination of CPUs and co-processors and InfiniBand for the connectivity. A great example of a standard based system and an example on why there is no reason to use proprietary products for supercomputers. You can read more on TACC systems at https://www.tacc.utexas.edu/resources/hpc. I enjoy using them too.