A great example
on how supercomputers can help improving our life. A new supercomputer in
Denmark, dedicated to life science, will help analyzing the growing amounts of data
we collect, to find better health treatments, safer, more accurate. And it is
all based on standard technologies. Happy holidays!
Monday, December 22, 2014
Monday, September 15, 2014
InfiniBand Performance Over Intel Haswell – First Numbers
As Intel just released the new CPU platform - official name
Intel Xeon E5-2600 v3, code name “Haswell”, wanted to share some of our
performance testing. Forr start, we tested the simple InfiniBand bandwidth and latency benchmarks
one can find as part of the InfiniBand software distribution. We measured around 6.4 Giga Byte per second bandwidth and latency of close to 0.6 micro
second. You can see the full graphs below. More to come J
Remote Direct Memory Access - RoCE versus iWARP
Remote Direct Memory Access (RDMA) is the
technology that that allows server-to-server data communication to go directly
to the user space (aka application) memory without any CPU involvement. RDMA
technology delivers faster performance for large data transfers while reducing
CPU utilization or overhead. It is a technology used in many applications
segments – database, storage, cloud and of course HPC. All of the MPIs include
support for RDMA for the rendezvous protocol.
There are three communications standards for RDMA
– InfiniBand (the de-facto solution for HPC), RoCE and iWARP. The latter two
are over Ethernet. RoCE has being standardized by the IBTA organization, and
iWARP by the IETF.
iWARP solutions are being sold by Intel (due to
the acquisition of NetEffect) and Chelsio. RoCE solutions are being sold by
Mellanox, Emulex and others. The major issues of iWARP are performance and
scalability. With iWARP, the data needs to pass through multiple protocols
before it can hit the wire and therefore the performance iWARP delivers is not
in par with RoCE (not to mention InfiniBand). The major RoCE limitation was
with support over layer 3, but this has been solved with the new specification
that is about to be released for RoCE v2.
Last week Intel announced their new Ethernet NICs
(“Fortville”). No iWARP support is listed for these new NICs, and this leaves
Intel without RDMA capability for their Ethernet NICs. Seems that the iWARP
camp is shrinking… well… there is a RoCE
reason for it…
Thursday, September 4, 2014
Supercomputer Simulations Help Gain Insight into New Cancer Treatment Technology
A recent release from the Texas Advanced Computing Center
(TACC) sheds light on one of the research programs that is being supported, or better
say enabled, by the TACC powerful supercomputer, one of the fastest machines in
the world. Using supercomputer simulations on TACC's “Lonestar” system, researchers are able to model
radiation in a magnetic field, which will facilitate the safe use of the
MRI-linac and enable more effective cancer treatment.
The research is being done by the MD Anderson Cancer Center
in Houston. According to the team
working on it, the new solution they develop unites radiation therapy and
magnetic resonance imaging (MRI), allowing physicians to view the cancer tumor
in real-time and in high detail during treatment. It also permits physicians to
adapt the radiation treatment during the procedure, sparing healthy tissue and
reducing side effects.
To develop the system, the MD team utilize the TACC
supercomputer to ran complex simulations. A great use for the supercomputing
power. TACC system was build using the most flexible architecture of a cluster,
a combination of CPUs and co-processors and InfiniBand for the connectivity. A
great example of a standard based system and an example on why there is no reason
to use proprietary products for supercomputers. You can read more on TACC systems
at https://www.tacc.utexas.edu/resources/hpc.
I enjoy using them too.
Wednesday, August 20, 2014
GPUDirect RDMA in Action
Last year I wrote about the release of the GPUDirect RDMA
technology. Simply saying, this is the technology that enables direct
communications between GPUs over the network (RDMA capable network) which
translates into much higher performance for applications using GPUs – high
performance applications, data analytics, gaming etc. basically any application that run over
more than a single GPU. If in the past every data movement from the GPU had to
go through the CPU memory, with GPUDirect RDMA it is not the case anymore. The
data will go directly from the GPU memory to the RDMA capable network (for
example InfiniBand) – data latency is being reduced by more than 70%, data
throughput is being increased by 5-6X and the CPU bottleneck is being
eliminated.
The University of Cambridge and the HPC Advisory Council
have released performance information of GPUDirect RDMA with one of their
application on a nearly 100 servers system. The application is HOOMD-blue - a
general-purpose molecular dynamics simulation created by the university of
Michigan, that can be used over GPUs. Each server includes two GPUs and two
InfiniBand RDMA adapters – so each GPU can connect directly to the network
instead of going through the CPU and the QPI interface (a pair of GPU and network
adapter is located on the same PCI-Express root complex). Bottom line, the
GPUDirect RDMA technology enabled Cambridge to increase HOOMD-blue performance
by 2X over the given system. Same system, same hardware, setting on GPUDirect
RDMA, twice the performance….
Impressive.
Tuesday, August 19, 2014
New MPI (Message Passing Interface) Solution for High-Performance Applications
A good source for high-performance computing information is
the HPC Advisory Council (www.hpcadvisorycouncil.com).
Of course one can find many news sites - insideHPC, HPCWire and others, but the
HPC Council is a good option if you want to get more in the details, to learn
on new technologies and solutions being developed etc. One of the recent
publications was a case study on STAR-CCM+ application (CFD) - http://www.hpcadvisorycouncil.com/pdf/STAR-CCM_Analysis_Intel_E5_2680_V2.pdf.
The publication included for the first time some performance
information on a new MPI solution – called HPC-X. In the world of
high-performance computing, MPI is one of the most used parallel communications
library. There are some open source solutions, such as MPICH, MVAPICH, OpenMPI
and commercial options – Platform MPI (formally known as HP MPI, now owned by
IBM) and Intel MPI. HPC-X is a new solution from Mellanox which seems to be
based on Open MPI plus various accelerations. In the past I did cover new
releases of both MVAPICH and OpenMPI as these are the two most used solutions
by us so far.
The combination of open source base and support does make
HPC-X an very interesting solution for any HPC system. Of course, it must perform,
as this is the most important item…
according to the new publication, HPC-X does provide a performance advantage
over the other commercial options, up to around 20% at 32 server nodes cluster (dual
socket servers). It is definitely a good start, and am looking forward to see
further reports on HPC-X. Meanwhile we do plan to download and try ourselves.
Sunday, August 17, 2014
Cluster Topologies - Dragonfly
In the past I reviewed two clusters topologies - Fat Tree
(CLOS) and Torus. The dragonfly is a hierarchical topology with the following
properties: several groups are connected together using all to all links (i.e.
each group has at least one link directly to each other group), the topology
inside each group can be any topology, it requires non-minimal global adaptive
routing and advanced congestion look ahead for efficient operation. Simply
saying, a dragonfly topology is a two level (at least) topology where at the
top level groups of switches are connected in a full graph. The internal
structure of the groups may vary and be constructed as full graph, fat tree,
torus, mesh, dragonfly and so on.
While many described dragonfly as a topology with one hop
between groups, actually it is not correct and in many cases the network
traffic will go over several hops before getting to the destination. The key to
make dragonfly topology effective is to allow some pairs of end-nodes to
communicate on a non-minimal route. It is the only way to distribute random
group to group traffic. This represents a significant difference from other
topologies. To support such routing a dragonfly system needs to utilize adaptive
routing and to only send traffic on the longer paths only if congestion is
impacting the minimal (hops) paths.
While dragonfly goal was to enable a higher bisectional bandwidth
compared to torus topology (or similar) while reducing the overall costs (mainly
cable lengths), in practice dragonfly does not provide such an advantage - nor
performance or costs over fat tree topology for example. On the other hand it
does add some sort of complexity with the need to enable adaptive routing. So
far we prefer to use the fat tree option – either with a full bisectional bandwidth
configuration or with some oversubscription options – depends on the targeted
applications.
Friday, August 15, 2014
Here Comes The 100 Gigabit Per Second…
For some of the compute demanding applications, data
throughput is a critical element in getting the needed performance. Throughput
is of course important for storage performance and scalability, for
checkpointing and other cases, which are do relevant to most applications. Today
the fastest solution we can find is InfiniBand FDR. It enables bandwidth of 56
gigabit per second. Taking the overhead off, we are left with around 54 gigabit
per second for actual data movement. Other options are QDR or 40 gigabit
Ethernet. Ethernet is not what we use for our HPC systems. Too much performance
overhead.
One can claim that there are 100 gigabit ports on some Ethernet
switches, but these are for network aggregation, not to the server. These ports
actually use 10 lanes of 10 gigabit each. Less the desired 4 lanes approach.
We did see some announcements for real 100 gigabit HPC
networks. The first InfiniBand 100 gigabit switch was announced back in June –
not just higher throughput but also lower latency – so win on both sides. While
no indications yet on when the 100 gigabit InfiniBand adapter will be out, the
switch announcement hints that we are getting close to the 100 gigabit times.
The higher the bandwidth (typically) the higher the message
rate. With InfiniBand FDR we already saw much higher message rate versus all
the QDR options in the market – either from Mellanox or from Intel. The
increase in message rate was greater than the bandwidth difference – therefore also
due to the new architecture of the latest InfiniBand FDR adapters. We do base
all of our system nowadays on FDR. Waiting for EDR….
Subscribe to:
Posts (Atom)