© copyright 1998-2002 by FORTH, Crete, Greece
In order to provide advanced Quality of Service (QoS) architectures in high speed networks, it is commonly agreed, nowadays, that one needs (i) per-flow queueing, and (ii) weighted round-robin (WRR) scheduling; the combination of the two is frequently referred to as Weighted Fair Queueing (WFQ). This Laboratory has been active in research and development work on both of these technologies since 1985. For our per-flow queueing work, see that link; our work on (weighted) round-robin scheduling is described below.
Our work started in 1985 with an investigation of several techniques for performing round-robin scheduling at high speed among a large number of flows. Later, in 1990, we designed a WRR scheduler for hundreds of flows, using a content-addressable memory (CAM) in CMOS VLSI. More recently, since 1996, we developed techniques for WRR scheduling among many thousands of flows at a relatively low cost, or at very high speed; these are described below.
The smoothest operation of a weighted round-robin scheduler (operation with the least amount of service time jitter) is achieved when the weight factors of the (possibly many thousands of) flows are picked from a relatively small ``menu'' of allowed factor values --e.g. say 8 or 16 different weight factor values. We call (service) class all those flows that share the same weight factor; for each class, we maintain an urgency counter. Based on the urgency counters, we perform weighted round-robin scheduling among the classes, and then use plain round-robin scheduling inside each class to select the next-to-be-serviced flow.
We simulated the performance of this scheduling algorithm, and found that the average service-time jitter never exceeded 2 % of the service time. The heart of this scheduler (the urgency counter and decision logic) fits in less than one third of an Altera 10K40 FPGA; with a 50 MHz clock, it can service 4 million events per second, which corresponds to an aggregate I/O line throughput of 1.6 Gbps in the case of scheduling ATM cells.
For more details, follow this link.
Several algorithms have been proposed in the literature for weighted round-robin scheduling in the general case of arbitrary weight factors. Their common substrate is that, for each flow or packet or cell, a priority value is maintained --oftentimes corresponding to the next time-to-service value-- and the scheduler has to choose the minimum among these values, corresponding e.g. to the closest next time-to-service. Maintaining a set of many thousands of (priority) values and quickly finding the smallest of them is best done using a Heap (Priority Queue) data structure.
We looked at methods to manage heap data structures at high speed, using specialized hardware. First, we analyzed implementations involving one FPGA and external SRAM, and found that a few tens of clock cycles per heap operation are needed for priority queues containing several thousand elements. Next, we proposed a novel algorithm for top-to-bottom heap insertions, thus enabling one to manage a heap in a pipelined fashion, in order to achieve very high speed. We designed such a pipelined heap manager in synthesizable Verilog form, as a core integratable into ASIC's. It can be configured to any heap size, and a range of cost-performance design points are possible. Performance is up to about one operation per clock cycle. For a 16K entry example in 0.18-micron CMOS technology, silicon area is approximately 15 mm2 and performance suffices for 10 to 40 Gbps line cards.
Heap-based WRR schedulers, like the above, work well when the set of eligible flows changes "slowly", e.g. at most one flow per cell time may become eligible (when the arriving cell is enqueued into an empty queue), and at most one flow per cell time may become ineligible (when the departing cell leaves its queue empty). In cases where the eligible flows change massively, though, e.g. when "multilane" backpressure selectively turns ON or OFF many flows or classes of flows at once, "brute force" comparisons are needed in order to find the minimum of the priority values among the eligible flows; this can be done using a tree of comparators.
We developed an innovative organization for the tree of comparators, where signals are propagated in parallel along the bits of each comparator as well as across the levels of the tree; in this way, the delays of the individual comparators and the delays of the tree levels are placed in parallel, rather than in series. To use this technique with the fastest comparator organization possible, we modified the traditional carry-select add/subtract concept, which proceeds from LS to MS bit, so that it applies to comparisons, and so that it operates in the reverse direction, from MS to LS bit. Using our technique, an 8-level tree of 24-bit comparators in 0.18-micron semi-custom CMOS technology finds the minimum of 256 numbers in just 4.5 ns, twice as fast as a traditional comparator tree, and fast enough for line rates in excess of 40 Gbits per second.
M. Katevenis, S. Sidiropoulos, C. Courcoubetis: ``Weighted Round-Robin Cell Multiplexing in a General-Purpose ATM Switch Chip'', IEEE Journal on Selected Areas in Communications, vol. 9, no. 8, October 1991, pp. 1265-1279.- Preprint of 15 May 1991 (formatting differs from published version) in Postscript (360 KBytes); © Copyright 1991 by IEEE.
ABSTRACT: We present the architecture of a general-purpose B-ISDN switch chip and in particular its novel feature: the cell (packet) multiplexing algorithm and its implementation in hardware. The chip is intended to be connected, via its bit-parallel links, to other similar switch chips or to link-interface chips, providing maximum flexibility in building small or large networks. Our chip implements buffering, routing, flow control, cell scheduling, and cut-through all in hardware. We present a detailed design of how our chip implements its scheduling function consisting of a weighted round-robin multiplexing scheme, using a counter, frequency weights stored in reverse bit order, and a content-addressable memory. This algorithm allocates the available bandwidth to the virtual circuits that can use it, in proportion to the prescribed weights. Other features are chip-to-chip flow control with a built-in window mechanism, a pool of shared buffers, and a set of dedicated buffers, thus offering powerful buffer management capabilities. The above features are key for our chip to successfully switch traffic with different service requirements such as real-time traffic and non-interactive data traffic; the mechanisms built into the chip can be used by the network manager to offer guaranteed service performance to the real-time traffic, and to fully utilize the spare capacity of the links by serving the lower priority traffic without allowing congestion (fairly allocating buffers and bandwidth). We are currently designing this chip for a full-custom CMOS technology; its crucial parts have been laid out and simulated, thus proving its feasibility.
M. Katevenis: ``Fast Switching and Fair Control of Congested Flow in Broad-Band Networks'', IEEE Journal on Selected Areas in Communications, vol. 5, no. 8, October 1987, pp. 1315-1326.
ABSTRACT: A new switching architecture is proposed, based on the tradeoffs of modern VLSI technology -- inexpensive memory and two-dimensional layout structures. Today, it is economically feasible to preallocate buffer space individually to each virtual circuit in every node, so that ``congestion'' ceases to have negative effects. On the contrary, when some low-priority circuits offer more traffic than the network can carry, full utilization of the link bandwidth is achieved. In this context, the allocation of bandwidth can be done automatically and in a ``fair'' way, if packets are multiplexed by circularly scanning all virtual circuits and transmitting one packet from each ``ready'' circuit. This multiplexing algorithm equally distributes all the available link BW to all the VC's that can use it (other than equal distribution is also possible), while it also guarantees an upper bound for the total packet delay through non-congested VC's (VC's that use less than their share of BW). We present methods for hardware implementation of fast such circular scans, and propose a structure for the switching nodes of such networks, consisting of a cross-bar arrangement like a systolic array that performs merge-sorting. It is ideally suited for physical layout on printed-circuit boards or with wafer-scale integration.