In order to provide advanced Quality of Service (QoS) architectures in high speed networks, Weighted Fair Queueing (WFQ) is needed. We have been working since 1985 in this area, i.e. on: (i) per-flow queueing, and (ii) weighted round-robin (WRR) scheduling. WRR scheduling is generally implemented using priority queues: for each flow, a priority value is maintained --oftentimes corresponding to the next time-to-service value-- and the scheduler chooses the minimum among these values, corresponding e.g. to the closest next time-to-service. Quickly finding the smallest number in a set of hundreds or thousands of priority values will often require hardware support. The form of such support depends on how fast the set of eligible flows changes (eligible flows are flows that have non-empty queues and are not stopped due to flow control):
When the set of eligible flows varies arbitrarily fast, and we need to quickly find the minimum among the priority values of the eligible flows, "brute force" comparisons are needed between all priority values. To avoid the N-square cost of an all-to-all comparison architecture, we use a binary tree of comparators.
We developed an innovative organization for the tree of comparators, where signals are propagated in parallel:
In this way, the delays of the individual comparators and the delays of the tree levels are placed in parallel, rather than in series. We also studied the comparison operation extensively, and we developed appropritate fast compare-and-mux circuits. Each such circuit receives the bits of the input numbers at different times, starting from the MS bit and progressing to the LS bit, and produces the bits of the smaller of the two numbers at correspondingly different times; the MS bits of the result are produced before the LS bits of the inputs are looked at. We find that a comparator organization analogous to the carry-select adder generaly yields the fastest comparison tree. To achieve this, we modified the traditional carry-select add/subtract organization, which proceeds from LS to MS bit, so that it operates in the reverse direction, from MS to LS bit.
We designed comparison trees of various types and sizes, as well as a WRR scheduler built around such trees, in synthesisable Verilog hardware-description language. We also described our designs in C code, for verification purposes. We performed synthesis, placement, and routing for a 0.18-micron CMOS process, and we present delay, area, and power consumption figures for various tree sizes and various widths of the comparators (number of bits for each priority value).
Our results show that, for example, the minimum of 256 twenty-four-bit integer numbers can be found in 4.5 ns, in the 0.18-micron semi-custom CMOS technology that we used. In the same technology and for the same carry-select-style comparator, the comparison of two 24-bit numbers takes 1.15 ns. Given that the binary tree with 256 leaves has 8 levels of comaprators, if individual comparators and tree levels were not operating in parallel, the find-minimum operation would need 8 times 1.15, i.e. 9.2 nanoseconds, as compared to 4.5 ns using our architecture.