In the past I reviewed two clusters topologies - Fat Tree (CLOS) and Torus. The dragonfly is a hierarchical topology with the following properties: several groups are connected together using all to all links (i.e. each group has at least one link directly to each other group), the topology inside each group can be any topology, it requires non-minimal global adaptive routing and advanced congestion look ahead for efficient operation. Simply saying, a dragonfly topology is a two level (at least) topology where at the top level groups of switches are connected in a full graph. The internal structure of the groups may vary and be constructed as full graph, fat tree, torus, mesh, dragonfly and so on.
While many described dragonfly as a topology with one hop between groups, actually it is not correct and in many cases the network traffic will go over several hops before getting to the destination. The key to make dragonfly topology effective is to allow some pairs of end-nodes to communicate on a non-minimal route. It is the only way to distribute random group to group traffic. This represents a significant difference from other topologies. To support such routing a dragonfly system needs to utilize adaptive routing and to only send traffic on the longer paths only if congestion is impacting the minimal (hops) paths.
While dragonfly goal was to enable a higher bisectional bandwidth compared to torus topology (or similar) while reducing the overall costs (mainly cable lengths), in practice dragonfly does not provide such an advantage - nor performance or costs over fat tree topology for example. On the other hand it does add some sort of complexity with the need to enable adaptive routing. So far we prefer to use the fat tree option – either with a full bisectional bandwidth configuration or with some oversubscription options – depends on the targeted applications.