Hardware for Weighted Fair Queueing in High Speed Networks
In order to provide advanced Quality of Service (QoS) architectures in high speed networks, Weighted Fair Queueing is needed. We have been working since 1985 on weighted fair queueing, i.e. on: (i) per-flow queueing, and (ii) weighted round-robin scheduling. In weighted round-robin scheduling, we make a distinction between the case where the weight factors are picked from a relatively small ``menu'' of allowed factor values, and the general case of arbitrary weight factors. For the former case, we have proposed and designed a scheduler with per-class urgency counters. For the latter case, our work is as described below.
Several algorithms have been proposed in the literature for weighted round-robin scheduling in the general case of arbitrary weight factors. Their common substrate is that, for each flow or packet or cell, a priority value is maintained --oftentimes corresponding to the next time-to-service value-- and the scheduler has to choose the minimum among these values, corresponding e.g. to the closest next time-to-service. Maintaining a set of many thousands of (priority) values and quickly finding the smallest of them is best done using a Heap (Priority Queue) data structure. We looked at methods to manage heap data structures at high speed, using specialized hardware.
First, we analyzed implementations using an FPGA and external SRAM, and found that a few tens of clock cycles per heap operation are needed for priority queues containing several thousand elements. For example, a 16 K element heap needs 40 clock cycles per "increment" operation when implemented with one (32-bit wide) SRAM, or 24 clock cycles when two SRAMs are used; "insert" and "delete minimum" operations are less expensive: 16 clock cycles suffice, with either one or two SRAMs. Our detailed design was verified by simulation in Verilog, and is described in the TR-222 referenced below.
Second, we proposed (1997) a novel algorithm for top-to-bottom heap insertions, thus enabling one to manage a heap in a pipelined fashion, in order to achieve very high performance. In the traditional heap management algorithm, deletions (of the root) proceed from top to bottom, while insertions of new elements proceed in the reverse direction. In our modified algorithm, both operations proceed in the top-to-bottom direction; to achieve this, inserted elements are "steered", left or right, at each level of the tree, depending on which of the two subtrees the open slot to be occupied resides in.
This novel heap management algorithm can be used in ASIC implementation of WRR schedulers, where each level of the heap tree is stored in a separate memory, so that multiple heap operations can proceed in parallel, each of them operating on a different level of the tree. We have designed such a pipelined heap (priority queue) manager, as described below.
[Previous, outdated version: January 2001, 21 pages, in Postscript (770 KBytes) or gzip'ed Postscript (160 KBytes) format].
Other papers on the same topic:
© Copyright 2000-2001 by IEEE or FORTH:These papers are protected by copyright. Permission to make digital/hard copies of all or part of this material without fee is granted provided that the copies are made for personal use, they are not made or distributed for profit or commercial advantage, the IEEE or FORTH copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of the Institute of Electrical and Electronics Engineers or of the Foundation for Research & Technology -- Hellas (FORTH). To copy otherwise, in whole or in part, to republish, to post on servers, or to redistribute to lists, requires prior specific written permission and/or a fee.
by Ioannis Mavroidis
Technical Report FORTH-ICS/TR-222,Institute of Computer Science, FORTH, Heraklion, Crete, GreeceBachelor of Science Thesis, Department of Computer Science, University of CreteJuly 1998
ABSTRACT:
Critical time priority queues require hardware instead of software management. The heap data structure is an efficient way to organize a priority queue. In this report we present the various design issues and evaluate the hardware complexity and timing requirements of different possible hardware organizations of a heap using one FPGA and external memory. We also describe the organization and the datapaths we simulated in Verilog. Finally we present some more efficient techniques applicable only to ASIC implementations.
The full Technical Report is available in:
© Copyright 1998 by FORTH.Permission to make digital/hard copies of all or part of this material without fee is granted provided that the copies are made for personal use, they are not made or distributed for profit or commercial advantage, the FORTH copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of the Foundation for Research & Technology -- Hellas (FORTH). To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific written permission and/or a fee.