This is a hot topic nowadays, definitely in the HPC world. The leading InfiniBand solution versus the new proprietary product from Intel - Omni-Path. One can find massive information on what Omni-Path is (well, it was already announced at the 2015 supercomputing conference) but interesting enough, Intel did not release any applications performance so far. You can find some low level benchmarks such as network latency and bandwidth, but those numbers are actually similar between the two solutions, and actually do not reflect the main architectural differences between InfiniBand and Omni-Path. Just few days ago, Mellanox published an interesting article covering in details the main differences between InfiniBand and Omni-Path, and for the first time release some application performance comparisons. Not surprising (still, Omni-Path is based on the QLogic True-Scale architecture) I guess, InfiniBand shows much higher applications performance even on a small cluster size - getting to 60% of performance advantage. It is an interesting article to read, and seems that InfiniBand will remain the top solution for HPC systems.
You can find the article at - http://www.hpcwire.com/2016/04/12/interconnect-offloading-versus-onloading/.
HPC-Opinion
High Performance Computing Valued Opinion
Tuesday, April 19, 2016
Tuesday, March 22, 2016
Co-Processors - Intel Knights Landing
In the past I wrote on NVIDIA GPUs, and on the GPU-Direct technology that enables a direct network communication between the NVIDIA GPU and the cluster interconnect (for example InfiniBand). Since then, there were several enhancement to the GPU-Direct technology, and I will try to cover that in future posts.
In this post I would like to give my view on the upcoming Intel Knights Landing Xeon Phi co-processor. Knights Landing is the next generation Xeon Phi after Knight Corner. Knights Corner was not a success story to say the least, and Intel aims to gain some attraction with the next generation Knight Landing.
Knights Landing has 72 cores, and more important it is a bootable device, so one can use Knights Landing as the main CPU in a server platform. This is a nice capability, as one can build/use a single CPU (Knight Landing) boards for example.
On the connectivity side, Knights Landing will be provided in two versions (packages) - KNL and KNL-F. KNL is the Knights Landing CPU, with two PCI-Express x16 and one PCI-Express x4 interfaces. The two x16 can be connected to any network solution. KNL-F is a packaged Knights Landing with two Intel Omni-Path ASICs, connected via PCI-Express x16 each to the KNL ASIC. There will be PCI-Express x4 connectivity out of the KNL-F package for management options.
KNL has being tested with InfiniBand and works great using the Open Fabrics OFED distribution, or the Mellanox OFED distribution. It has being implemented in several sites already.
KNL-F usage is questionable. It makes sense to Intel to try and lock users to their propriety interconnect product (which is no more than a new version of the QLogic TrueScale product), but why would one want to be locked down to OmniPath? Especially as Omni-Path requires the KNL to spend many expensive cores cycles to manage and operate the Omni-Path network (which means loss of KNL performance)? TrueScale was not used with GPUs in the past due to its overhead, and Omni-Path, which is based on the same TrueScale architecture, is no better.
In this post I would like to give my view on the upcoming Intel Knights Landing Xeon Phi co-processor. Knights Landing is the next generation Xeon Phi after Knight Corner. Knights Corner was not a success story to say the least, and Intel aims to gain some attraction with the next generation Knight Landing.
Knights Landing has 72 cores, and more important it is a bootable device, so one can use Knights Landing as the main CPU in a server platform. This is a nice capability, as one can build/use a single CPU (Knight Landing) boards for example.
On the connectivity side, Knights Landing will be provided in two versions (packages) - KNL and KNL-F. KNL is the Knights Landing CPU, with two PCI-Express x16 and one PCI-Express x4 interfaces. The two x16 can be connected to any network solution. KNL-F is a packaged Knights Landing with two Intel Omni-Path ASICs, connected via PCI-Express x16 each to the KNL ASIC. There will be PCI-Express x4 connectivity out of the KNL-F package for management options.
KNL has being tested with InfiniBand and works great using the Open Fabrics OFED distribution, or the Mellanox OFED distribution. It has being implemented in several sites already.
KNL-F usage is questionable. It makes sense to Intel to try and lock users to their propriety interconnect product (which is no more than a new version of the QLogic TrueScale product), but why would one want to be locked down to OmniPath? Especially as Omni-Path requires the KNL to spend many expensive cores cycles to manage and operate the Omni-Path network (which means loss of KNL performance)? TrueScale was not used with GPUs in the past due to its overhead, and Omni-Path, which is based on the same TrueScale architecture, is no better.
Wednesday, March 9, 2016
UCX – a Unified Communication X Framework for High Performance Communications
UCX is a new, an open-source, production grade, set of
network APIs and their implementations for high performance communications for high-performance
computing and data-centric applications.
UCX solves the problem of moving data memory across multiple type of
memories (DRAM, accelerator memories, etc.) and multiple transports (e.g.
InfiniBand, uGNI, Shared Memory, CUDA, etc. ), while minimizing latency, and
maximizing bandwidth and message rate. The new communications framework supports
all communication libraries (MPI, PGAS etc.) and enable a closer connection to
the underlying hardware. A new community of supporters has been established
behind the UCX efforts, which includes key participants from the HPC industry,
laboratories and academia.
UCX ensures there is very little software overhead in the
communication path, allowing for near native-level hardware performance. To
ensure the production quality of UCX, it was be co-designed, maintained, tested
and used in production environments by leading proponents of the research
community. In addition, UCX community is working on the UCX specification which
will enable further extension of the UCX effort.
As an open-source
project licensed under the BSD-3 license, UCX is open for contributions from
anyone in the industry. More information can be found at www.openucx.com.
Sunday, February 28, 2016
Intel OmniPath network - what happened to the performance numbers?
Intel has been talking on their new network products for high performance computing for quite some time. OmniPath is a new network based on the old technology Intel acquired from QLogic (aka TrueScale). As Intel officially announced OmniPath in the last supercomputing conference, I was waiting to see any real performance data using this network, but nothing was actually published so far (SC'15 was in Nov, March 2016 is around the corner...).
So, for now all I can do is to read what Intel publishes. But even this information is being changed over time. In July 2015 Intel published a chart claiming OmniPath do deliver 160 million messages per second, which is a very impressive number (and higher versus what InfiniBand can do today). But, but....wait, in the recent Linley Group data center conference (February 2016) Intel presented a slide that actually talks on a complete different set of numbers - they actually projected that the OmniPath message rate will be between 75 and 86 million, and said to actually measure 108 million (which is higher than what they have projected originally...). What happened to the 160 million number? 108 million is much lower versus InfiniBand, and is not that interesting to say the least.
Intel July 2015 - 160 million messages per second
Intel February 2016 - 108 million messages per second
So, for now all I can do is to read what Intel publishes. But even this information is being changed over time. In July 2015 Intel published a chart claiming OmniPath do deliver 160 million messages per second, which is a very impressive number (and higher versus what InfiniBand can do today). But, but....wait, in the recent Linley Group data center conference (February 2016) Intel presented a slide that actually talks on a complete different set of numbers - they actually projected that the OmniPath message rate will be between 75 and 86 million, and said to actually measure 108 million (which is higher than what they have projected originally...). What happened to the 160 million number? 108 million is much lower versus InfiniBand, and is not that interesting to say the least.
Intel July 2015 - 160 million messages per second
Intel February 2016 - 108 million messages per second
Monday, December 22, 2014
Making treatment affordable and more accurate using supercomputers
A great example
on how supercomputers can help improving our life. A new supercomputer in
Denmark, dedicated to life science, will help analyzing the growing amounts of data
we collect, to find better health treatments, safer, more accurate. And it is
all based on standard technologies. Happy holidays!
Monday, September 15, 2014
InfiniBand Performance Over Intel Haswell – First Numbers
As Intel just released the new CPU platform - official name
Intel Xeon E5-2600 v3, code name “Haswell”, wanted to share some of our
performance testing. Forr start, we tested the simple InfiniBand bandwidth and latency benchmarks
one can find as part of the InfiniBand software distribution. We measured around 6.4 Giga Byte per second bandwidth and latency of close to 0.6 micro
second. You can see the full graphs below. More to come J
Remote Direct Memory Access - RoCE versus iWARP
Remote Direct Memory Access (RDMA) is the
technology that that allows server-to-server data communication to go directly
to the user space (aka application) memory without any CPU involvement. RDMA
technology delivers faster performance for large data transfers while reducing
CPU utilization or overhead. It is a technology used in many applications
segments – database, storage, cloud and of course HPC. All of the MPIs include
support for RDMA for the rendezvous protocol.
There are three communications standards for RDMA
– InfiniBand (the de-facto solution for HPC), RoCE and iWARP. The latter two
are over Ethernet. RoCE has being standardized by the IBTA organization, and
iWARP by the IETF.
iWARP solutions are being sold by Intel (due to
the acquisition of NetEffect) and Chelsio. RoCE solutions are being sold by
Mellanox, Emulex and others. The major issues of iWARP are performance and
scalability. With iWARP, the data needs to pass through multiple protocols
before it can hit the wire and therefore the performance iWARP delivers is not
in par with RoCE (not to mention InfiniBand). The major RoCE limitation was
with support over layer 3, but this has been solved with the new specification
that is about to be released for RoCE v2.
Last week Intel announced their new Ethernet NICs
(“Fortville”). No iWARP support is listed for these new NICs, and this leaves
Intel without RDMA capability for their Ethernet NICs. Seems that the iWARP
camp is shrinking… well… there is a RoCE
reason for it…
Subscribe to:
Posts (Atom)