Computer Architecture and VLSI Systems (CARV) Laboratory,Institute of Computer Science (ICS), FORTH, Heraklion, Crete, Greece
© copyright 2002-07 by FORTH, Crete, Greece
Buffered switching fabrics (bibliography) are an efficient method for building packet switches with a very large number of ports. To keep the cost down, buffers inside the fabric should be small, with large buffers being necessary only at the input ports. We research flow control, congestion management, and scheduling methods to make such buffered switching fabrics operate as efficiently as possible. Our focus is on the Benes network, which is the lowest-cost topology that is known to yield operation free of internal blocking.
ABSTRACT: This thesis considers buffered multistage interconnection networks (fabrics), and investigates methods to reduce their buffer size requirements. Our contribution is a novel flow and congestion control scheme that achieves performance close to that of per-flow queueing while requiring much less buffer space than what per-flow queues would need. The new scheme utilizes a request-grant pre-approval phase, as many contemporary bufferless networks do, but its operation is much simpler and its performance is remarkably better. Traditionally, the role of requests in bufferless networks is to reserve an available time slot on each link along a packet's route, where these time slots are contiguous in time along the path, so as to guarantee non-conflicting packet transmission. These requirements impose a very heavy toll on the scheduling unit of such bufferless fabrics. By contrast, our requests do not reserve links for a specific time duration, but instead only reserve space in the buffers at their entry points; effectively, the scheduling decisions that concern different links are decoupled among themselves, leading to a much simpler admission process. The proposed scheduling subsystem comprises independent single-resource schedulers, operating in a pipeline; they operate asynchronously to each other. In this thesis we show that the reservation of buffers in front of critical network links --links that are unable to carry the potential aggregate demand-- eliminates congestion, in the sense that traffic flows seamlessly through the network: it neither gets dropped, nor is excessively blocked waiting for downstream buffers to become available.
First, we apply request-grant scheduling to a single-stage switch, with small, shared output queues, which serves as a model for the more challenging multistage case. We demonstrate that, in principle, a very small number of fabric buffers suffices to reach high performance levels: with 12-cell buffer space per output, performance is better than in buffered crossbars, which consume N cells of buffer space per output, where N is the number of ports. In this single-stage setting, we study the impact of input contention on scheduler performance, and the related synchronization phenomena. During this work, we have introduced a novel scheduling scheme for buffered crossbar switches that makes buffer size independent of the round-trip-time between the linecards and the switch.
We then proceed to the multistage case. Our main motivation and our primary benchmark is an example next-generation fabric challenge: a 1024x1024, 3-stage, non-blocking Clos/Benes fabric, running with no internal speedup, made of 96 single-chip 32x32 buffered crosssbar switching elements (3 stages of 32 switch chips each). To eliminate congestion in the fabric, we carefully apply our request-grant scheduling protocol. We demonstrate that it is feasible to implement all schedulers centrally, in a single chip. Besides congestion elimination, our scheduler can guarantee 100 percent in-order delivery, using very small reorder buffers, which can easily fit in on-chip memory. Simulation results indicate very good delay performance, and throughput that exceeds 95% under unbalanced traffic. Most prominent is the result that, under hot-spot traffic, with almost all output ports being congested, the non-congested outputs experience negligible delay degradation. The proposed system can directly operate on variable-size packets, eliminating the padding overhead and the associated internal speed-up. We also discuss a possible distributed version of the scheduling subsystem. Our scheme is appropriate to deal with heavy congestion; in systems that need to provide very low latency under (uncongested) light traffic, one would apply this scheme when the load exceeds a given threshold.
Lastly, we consider some blocking network topologies, like the banyan. In a banyan network, besides output ports, internal links can cause congestion as well. We show a fully distributed scheduler for this network, that eliminates congestion from both internal and output-port links
[Previous, outdated version: July 2005, 7 pages, in pdf (250 KB)].
[Previous, outdated version: August 2005, 15 pages, in pdf (460 KB)].
ABSTRACT: Multistage buffered switching fabrics are the most efficient method for scaling packet switches to very large numbers of ports. The Benes network is the lowest-cost switching fabric known to yield operation free of internal blocking. Backpressure inside a switching fabric can limit the use of expensive off-chip buffer memory to just virtual-output queues (VOQ) in front of the input stage. This paper extends the known backpressure architectures to the Benes network. To achieve this, we had to successfully combine per-flow backpressure, multipath routing (inverse multiplexing), and cell resequencing. We present a flow merging scheme that is needed to bring the cost of backpressure down to O(N) per switching element. We prove freedom from deadlock for a wide class of multipath cell distribution algorithms. Using a cell-time-accurate simulator, we verify operation free of internal blocking, we evaluate various cell distribution and resequencing methods, we compare performance to that of ideal output queueing and the iSLIP crossbar scheduling algorithm, and we show that the delay of well-behaved flows remains unaffected by the presence of congested traffic to oversubscribed output ports.
This is a longer version of the above HPSR-2003 paper. In addition to the more detailed presentation, it also contains the proof of freedom from deadlock.
This is the full-length version of our research; it covers in more details the topics covered by the above papers.In addition, this thesis uses simplified models of the Benes fabric to analytically show that cell distribution does not result in throughput limitations, and to identify the points of the fabric where contention is resolved. This latter point may be the key to a future design of appropriate distributed scheduling algorithms that provide weighted max-min fairness for the overall system.
Acknowledgements:Part of this work was supported by an IBM PhD Fellowship to Nikos Chrysos, 2004-2006. Vasilios Siris, Dionisios Pnevmatikatos, and Paraskevi Fragopoulou helped us shape our ideas; we thank them.
Older (outdated) papers:
[Superseded by the July 2003 version, above].
[Superseded by TR-316, above].
[This (unfinished) paper presented our August 1994 work on how to use multilane backpressure in multipath (Benes) fabrics and how to merge flows so as to keep the number of lanes manageable, while guaranteeing freedom from deadlock and maintaining non-blocking operation.]