The provision of quality-of-service (QoS) guarantees by modern, advanced-architecture network systems requires the isolation among the multiple traffic flows, for each one to get its own behavior characteristics and service level. A prerequisite for such isolation is that traffic belonging to different flows be placed on different queues, i.e. per-flow queueing.
Per-flow queueing typically requires the implementation of a large number of logical queues --hundreds or thousands to possibly millions in the future-- inside one or a few physical memories. When the communication system is to operate at high speed, the management of these multiple queues typically requires the assistance of specialized hardware. We have worked on such multi-queue management implementations at different cost and performance levels, as described below.
(1) Managing thousands of queues in off-chip DRAM, with the next-pointers held in that off-chip DRAM, and with out-of-order transaction processing for efficient utilization of the interleaved memory banks:
Aristides Nikologiannis and Manolis Katevenis
ABSTRACT:
Modern switches and routers often use dynamic RAM (DRAM) in order to provide large buffer space. For advanced quality of service (QoS), per-flow queueing is desirable. We study the architecture of a queue manager for many thousands of queues at OC-192 (10 Gbps) line rate. It forms the core of the "datapath" chip in an efficient chip partitioning for the line cards of switches and routers that we propose. To effectively deal with bank conflicts in the DRAM buffer, we use multiple pipelined control processes, out-of-order execution, and operand renaming --techniques that originated in the supercomputers of the 60�s. To avoid off-chip SRAM, we maintain the "next" pointers in the DRAM, using free buffer preallocation and free list bypassing. We have described our architecture using behavioral Verilog, at the clock-cycle accuracy level, assuming Rambus DRAM, and we have partially verified it by short simulation runs. We estimate the complexity of the queue manager at roughly 60 thousand gates, 80 thousand flip-flops, and 4180 Kbits of on-chip SRAM, for 64 K flows.
The following papers are available:
© Copyright 2000-2001 by IEEE or FORTH:These papers are protected by copyright. Permission to make digital/hard copies of all or part of this material without fee is granted provided that the copies are made for personal use, they are not made or distributed for profit or commercial advantage, the IEEE or FORTH copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of the Institute of Electrical and Electronics Engineers or of the Foundation for Research & Technology -- Hellas (FORTH). To copy otherwise, in whole or in part, to republish, to post on servers, or to redistribute to lists, requires prior specific written permission and/or a fee.
(2) Very high speed management of hundreds of queues, using full-custom VLSI, in a single-chip VLSI ATM switch with credit-based flow control:
George Kornaros, Christoforos Kozyrakis, Panagiota Vatsolaki, and Manolis Katevenis
We describe the queue management block of ATLAS I, a single-chip ATM switch (router) with optional credit-based (backpressure) flow control. ATLAS I is a 4-million-transistor 0.35-micron CMOS chip, currently under development, offering 20 Gbit/s aggregate I/O throughput, sub-microsecond cut-through latency, 256-cell shared buffer containing multiple logical output queues, priorities, multicasting, and load monitoring.
The queue management block of ATLAS I is a dual parallel pipeline that manages the multiple queues of ready cells, the per-flow-group credits, and the cells that are waiting for credits. All cells, in all queues, share one, common buffer space. These 3- and 4-stage pipelines handle events at the rate of one cell arrival or departure per clock cycle, and one credit arrival per clock cycle. The queue management block consists of two compiled SRAM's, pipeline bypass logic, and multi-port CAM and SRAM blocks that are laid out in full-custom and support special access operations. The full-custom part of queue management contains approximately 65 thousand transistors in logic and 14 Kbits in various special memories, it occupies 2.3 mm², it consumes 270 mW (worst case), and it operates at 80 MHz (worst case) versus 50 MHz which is the required clock frequency to support the 622 Mb/s switch link rate.
KEYWORDS: single-chip ATM switch, VLSI router, pipelined queue management, credit-based flow control.
© Copyright 1996-1997 by IEEE and FORTH:These papers are protected by copyright. Permission to make digital/hard copies of all or part of this material without fee is granted provided that the copies are made for personal use, they are not made or distributed for profit or commercial advantage, the IEEE or FORTH copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of the Institute of Electrical and Electronics Engineers or of the Foundation for Research & Technology -- Hellas (FORTH). To copy otherwise, in whole or in part, to republish, to post on servers, or to redistribute to lists, requires prior specific written permission and/or a fee.
(3) General-purpose manager for thousands of queues, using FPGA or microprocessor:
by Dimitrios Serpanos, and Panagiotis Karakonstantis
1997-98
An important module in high-speed communication systems is the memory manager, which manages logical data structures --typically queues-- efficiently and enables high-speed data transfers between system memory and link interfaces. We have developed a high-speed, scalable and re-usable Queue Manager, suitable for a wide range of ATM systems, such as workstation adapters, switches, routers, etc. The Queue Manager is a special-purpose processor that executes memory management instructions. To achieve re-usability, we have analyzed the requirements of the most common ATM functions (flow control, segmentation-and-re-assembly, etc.) and identified an instruction set that is sufficient to implement these functions. To achieve scalability, we have identified a minimal set of instructions as well as a minimal set of data structures to support.
We have developed an architecture and several hardware implementations, which provide increasing performance at the cost of more complex hardware. A typical low cost implementation, using an FPGA and external SRAM, supports 1024 queues of 8192 ATM cells and performs an Enqueue or a Dequeue operation in 130 or 200 ns, respectively. Such an implementation supports ATM systems with aggregate throughput close to 1 Gbps.
Considering the processing power that exists in some systems in the form of embedded processors, we have developed a software implementation for embedded systems as well, using the CYCLONE evaluation board with the Intel i960 processor at 40 MHz. The average delays of the Enqueue and Dequeue operations in the software implementation are 0.75 and 0.95 microseconds, respectively. We have evaluated the performance, cost and scalability of all implementations, so that one can choose the solution with the desired characteristics.