Wednesday, August 31, 2011

Clustering Topologies – Overview – and Fat Trees

Building a high-performance computing cluster is not a too much complicated task today. The variety of open source tools and commodity building blocks have made the task of building HPC systems an easier one compared to years ago. One item that one needs to decide when coming to build HPC cluster is its topology, or in other words, how to connect the servers together (the network topology). There are several topologies that are more common than others – Fat-Tree, Mesh, Torus, Hypercube, Butterfly, Dragonfly etc.

If the system size in discussion fits a single network switch, than the decision is very simple… but when you need to use multiple network switches, the topology decision need to be made. Fat-tree, hypercube, and the *fly topologies can be design as oversubscribed or non-oversubscribed topologies, while Torus is typically an oversubscribed topology.

I would say that Fat-Tree is probably the easiest solution to use. Fat-trees are a class of network topologies that were shown to scale their performance with the networking resources. The evolution of fat-trees started with a single common root node. A tree in which each switch is serving communication between all its branches is shown in the figure below. Fat-trees use higher bandwidth links when approaching the tree roots (the fat-tree name stems from the fact that links grow fatter when approaching the root). In order to make the fat tree concept a feasible concept for large scale networks, a family of fat-trees named k-ary-n-tree was defined. A k-ary-n-tree has N levels made out of KN-1 switches, each having K ports going down the tree and K going up the tree. This topology trades off the practical hardware aspects by using switches of the same bandwidth and port count for all levels of the tree. Fat trees probably provide the lowest latency between any given compute pairs compared to other topologies, and if you are not hitting the issues of spanning trees (for example if you use InfiniBand), fat trees are very easy to build. In my scale of clusters, fat tree is also the most economical solution. If you need help to design your own cluster topology, drop me a note. In my future blogs I will discuss some other topologies.

Sunday, August 28, 2011

Math Libraries – Overview

I would say that the most used Math libraries are Intel Math Kernel Library (Intel MKL), AMD Core Math Library (ACML), and in particular of BLAS there is the GOTO one.

Intel MKL latest release is the 10.3 release. In includes support for BLAS, LAPACK, Trust Region Solver, ScaLAPACK and Cluster FFT. If you are using Intel CPUs, MKL is a library that you would like to use. Mathematical domains supported by Intel MKL are sparse Linear Algebra—sparse BLAS, sparse format converters, PARDISO direct sparse solver, iterative sparse solvers and pre-conditioners; fast fourier transforms, LINPACK benchmark, vector math library and statistics functions - vector random number generators, summary statistics library. The first place to start is

If you have AMD based systems, AMD ACML can provide higher performance. ACML provides a free set of math routines. ACML consists of the following main components: A full implementation of Level 1, 2 and 3 basic linear algebra subroutines (BLAS), a full suite of linear algebra (LAPACK) routines, a suite of fast fourier transforms (FFTs) in both single, double, single-complex and double-complex data types and a random number generators in both single- and double-precision. For downloading ACML (latest version is 5.0) -

For BLAS, GOTO was one of the leading solutions. You can find the latest release on TACC web - GOTO has moved from TACC to Microsoft and it seems that unfortunately he stopped supporting or developing the GOTO BLAS library. If you have any news on this front, drop a comment.

Monday, August 22, 2011

The End of Bash Scripts?

Bash is a command processor, typically run as text commands, allowing the user to type commands which cause actions – start programs, check files, check machine status etc. In the HPC world, Bash scripts have become one of the main tools of administrators and users to help maintain, administer, and create clustered systems. The basis for these scripts is often shared in books, articles, and on public mailing lists. Tutorial on Bash programming can be found at:

As described in details on ClusterMonkey by Douglas Eadline (, there is a battle now over Bash scripts in a Kansas courtroom. A Linux cluster vendor, Atipa Technologies is claiming all the Bash scripts they shipped to customers contain trade secrets and were stolen by former employees. Should this issue be decided in Atipa's favor, the fundamental idea of shared and open software could be blanketed by the simple claim of trade secrets. You can find the rest of the details (and Doug listed each one of them…) on Doug’s web site.

One of the driving forces behind HPC Clusters is the open source or open community. I hope that it stays like that moving forward. Open source does not mean free stuff, but it means the capability to share and develop together.The end of Bash scripts? I hope not...

Friday, August 19, 2011

iWARP - Been There, Tried It and Walked (Actually Run…) Away

iWARP stands for Internet Wide Area RDMA Protocol and it is an Internet Engineering Task Force (IETF) update of the RDMA Consortium’s RDMA legacy TCP. The main component of the iWARP protocol is the Data Direct Protocol (DDP), which permits zero-copy transmission. DDP itself does not perform the transmission, TCP does. However, TCP does not respect message boundaries and it sends data as a sequence of bytes without regard to protocol data units (PDU). In this regard, DDP itself may be better suited for SCTP, and indeed a focus of the IETF has been to standardize RDMA over SCTP. Running DDP over TCP requires a tweak known as marker PDU aligned (MPA) framing so as to guarantee boundaries of messages. Furthermore, DDP is not intended be to be accessed directly. Instead, a separate RDMA protocol (RDMAP) provides the services to read and write data. Therefore, the entire RDMA over TCP specification means RDMAP over DDP over MPA over TCP. Complex enough? absolutely…

iWARP does not have a standard programming interface, and instead, only has a single communication protocol option. This is the only service that TCP and SCTP can provide, and thus, lacks some of the feature that other RDMA solutions provide such as atomic operations. Since the kernel implementation of the TCP stack is a tremendous bottleneck, few vendors have tried to implement TCP in the networking hardware. In some cases, the error-correction (a mechanism of TCP) is still performed by the software while the more frequently performed communications are handled by logic embedded on the NIC. This additional hardware is known as the TCP offload engine (TOE). TOE itself does not prevent copying on the receive side of course, and must be combined with RDMA technology for zero-copy. Complex enough? absolutely…

The main vendors (if not only) that offer iWARP solutions today are Intel (through the acquisition of NetEffect) and Chelsio Communications. There is a reason why “Internet” is part of the iWARP name… due to its complexity iWARP does not provide any performance benefits for HPC workloads. Latency is higher, bandwidth is the same as any other Ethernet solution, and the high CPU overhead is still there. I am better running directly over TCP instead of running my job over RDMAP over DDP over MPA over TCP…  

Wednesday, August 17, 2011

NCSA Post Blue Waters – What’s Next?

As mentioned in one of my previous posts, IBM and NCSA announced the termination of the Blue Waters project due to higher cost that has been associated with this project. In simple words, the NSF grant was not enough to cover the expenses of a proprietary system that was suppose to reach a specific performance goal. I cannot say that this was a surprise…. So what’s next? There are two obvious options - the first one is for NCSA to find another vendor that can build a system that will hit the performance target and will stay within the budget. The second is for NSF to collect the grant back and to open a new bid for the money.

As I see it, the only system that can hit both the performance goals and the budget limitation is a standard based system. X86 (Intel or AMD) processors and the InfiniBand network might do the job. If only NCSA was clever enough to go for it the first time, I bet that the system would have been already up and running and with the change they could focus on some innovative software development.

The Pleiades Supercomputer at NASA/Ames Research Center is a great example for a cost effective, high-performance system. 184 racks (11,712 nodes), 1.315 Pflop/s peak cluster, 1.09 Pflop/s Linpack, 111,872 total cores (no GPUs), InfiniBand interconnect, partial 11D hypercube topology (SGI in this case, but you can pick any other topology as well). 12 DDN RAIDs, 6.9 Peta byte total, Lustre. Could be a good example to what NCAS needs to do now?

Monday, August 15, 2011

What is Condor?

Condor is a freely software package that allows utilization of un-used computing resources to run users jobs.  Unlike other resource managers or job schedulers, Condor is suited to provide compute power over long periods of time, for jobs that are not critical. It is particularly suited to problems where the same analysis needs to be applied to many different sets of data or simulations based on stochastic analysis (such as Monte Carlo simulations). One usage example is using Condor to take advantage of organization’s PCs during the night. Some applications which were used with Condor are modeling of electronic structures, ocean modeling etc. Condor jobs should have the following attributes:

-          They are "self contained" programs
-          They are statically linked executables (but DLLs can be accommodated)
-          They do not require communication between programs running concurrently on different hosts
-          They apply the same analysis/computation repeatedly.

Jobs may run at any time on the Condor pool of compute platforms (servers, PCs) when they are not in use but if a job is running when a user logs into a compute platform, the job will be killed and will be scheduled again later. If users do not provide their own check-pointing option, any results will be lost.

Condor is not for everyone, but if you have times where your systems are un-used, and your applications meet the above criteria, you can give it a try. The latest release of Condor is 7.7.0 (July 2011). The new release added performance statistics and of course bug fixes. More info at

If you use Condor, leave a note. Will be interesting to get feedback.

Tuesday, August 9, 2011

The Maui Job Scheduler – How to Get Started

A job scheduler is a policy engine which allows users to control over when and how resources (CPU time, memory etc.) are allocated to jobs. It also provides mechanisms to help optimize the use of these resources.

In theory any job that need more than a single piece of allocated resource runs the risk of starvation, unless there is a way of reserving resources in advance. Another way to solve starvation is by using a preemptive job scheduler but if jobs can be preempted it is more difficult to predict when the job will finish. When an idle job becomes eligible to run, it is assigned a priority. This priority is used to sort the jobs before the scheduler selects a job to start. Many batch systems use queues to divide and classify the workload. Each queue is then assigned a priority and sometimes each job is assigned a second priority to sort themselves within the queue. This classifying scheme is often too coarse. To take into account all parameters that set a batch job policy, you may end up with more queues than jobs.

The Maui job scheduler has the ability to allocate resource in the future as well as in Maui, "queues" have lost their importance in classification and priority calculations. Instead a quality of service attribute can be used to classify the jobs. The quality of service is basically a method of setting the parameters of a job when it enters the scheduler. All jobs eligible to run remain in one common idle-queue and their priorities are compared with all others.

Jobs in Maui can be in one of three major states:  running (a job that has been allocated its required resources and has started its computation is considered running until it finish), queued or idle (job that is eligible to run. The priority is calculated here and the jobs in this state are sorted according to calculated priority and non-queued (job that is not allowed to start. Jobs in this state do not gain any queue-time priority).

The Maui job scheduler is an open source job schedule and currently maintained and supported by Adaptive Computing. For more information, documents, and to download it, you can check