Fast Parallel Comparison Circuits for Scheduling

Project and Result Outline:

In order to provide advanced Quality of Service (QoS) architectures in high speed networks, Weighted Fair Queueing (WFQ) is needed. We have been working since 1985 in this area, i.e. on: (i) per-flow queueing, and (ii) weighted round-robin (WRR) scheduling. WRR scheduling is generally implemented using priority queues: for each flow, a priority value is maintained --oftentimes corresponding to the next time-to-service value-- and the scheduler chooses the minimum among these values, corresponding e.g. to the closest next time-to-service. Quickly finding the smallest number in a set of hundreds or thousands of priority values will often require hardware support. The form of such support depends on how fast the set of eligible flows changes (eligible flows are flows that have non-empty queues and are not stopped due to flow control):

  1. Few eligibility changes per cell time. In this case, implementations based on the Heap data structure are often advantageous: see our work on (Pipelined) Heap Management in Hardware. An alternative is to use per-class urgency counters, when the weight factors are picked from a relatively small ``menu'' of allowed factor values. These solutions apply for example to the ingress line cards of many switches, where at most one flow per cell time may become eligible, when the arriving cell is enqueued into an empty queue, and at most one flow per cell time may become ineligible, when the departing cell leaves its queue empty.
  2. Potentially many eligibility changes per cell time. This is the harder case, where heaps cannot be taken advantage of, because each eligibility change results in one heap insertion or deletion operation. We deal with this case in the present work. This case may arise in ingress line cards that receive "multilane" backpressure from the switch, which selectively turns ON or OFF many flows or classes of flows at once. This case also arises in buffered crossbar switches, where multiple crosspoint buffers may change state at once.

When the set of eligible flows varies arbitrarily fast, and we need to quickly find the minimum among the priority values of the eligible flows, "brute force" comparisons are needed between all priority values. To avoid the N-square cost of an all-to-all comparison architecture, we use a binary tree of comparators.

We developed an innovative organization for the tree of comparators, where signals are propagated in parallel:

  • from the MS bit to the LS bit of each comparator in the tree; and
  • from the leaf level to the root level of the tree.

In this way, the delays of the individual comparators and the delays of the tree levels are placed in parallel, rather than in series. We also studied the comparison operation extensively, and we developed appropritate fast compare-and-mux circuits. Each such circuit receives the bits of the input numbers at different times, starting from the MS bit and progressing to the LS bit, and produces the bits of the smaller of the two numbers at correspondingly different times; the MS bits of the result are produced before the LS bits of the inputs are looked at. We find that a comparator organization analogous to the carry-select adder generaly yields the fastest comparison tree. To achieve this, we modified the traditional carry-select add/subtract organization, which proceeds from LS to MS bit, so that it operates in the reverse direction, from MS to LS bit.

We designed comparison trees of various types and sizes, as well as a WRR scheduler built around such trees, in synthesisable Verilog hardware-description language. We also described our designs in C code, for verification purposes. We performed synthesis, placement, and routing for a 0.18-micron CMOS process, and we present delay, area, and power consumption figures for various tree sizes and various widths of the comparators (number of bits for each priority value).

Our results show that, for example, the minimum of 256 twenty-four-bit integer numbers can be found in 4.5 ns, in the 0.18-micron semi-custom CMOS technology that we used. In the same technology and for the same carry-select-style comparator, the comparison of two 24-bit numbers takes 1.15 ns. Given that the binary tree with 256 leaves has 8 levels of comaprators, if individual comparators and tree levels were not operating in parallel, the find-minimum operation would need 8 times 1.15, i.e. 9.2 nanoseconds, as compared to 4.5 ns using our architecture.

Papers:

  • Kostas G.I. Harteros: "Fast Parallel Comparison Circuits for Scheduling", Technical Report FORTH-ICS/TR-304, Inst. of Computer Science, FORTH, Heraklio, Crete, Greece; M.Sc. Thesis, Univ. of Crete (advisor: M. Katevenis); March 2002, 78 pages. Available in Postscript (1 MByte) or gzip'ed Postscript (650 KBytes) format; © Copyright 2002 FORTH.

 


© Copyright 2002 by FORTH:
These papers are protected by copyright. Permission to make digital/hard copies of all or part of this material without fee is granted provided that the copies are made for personal use, they are not made or distributed for profit or commercial advantage, the FORTH copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of the Foundation for Research & Technology -- Hellas (FORTH). To copy otherwise, in whole or in part, to republish, to post on servers, or to redistribute to lists, requires prior specific written permission and/or a fee.

Up to WRR Scheduling R&D at CARV-ICS-FORTH Last updated: March 2002, by M. Katevenis.