Simultaneous multithreading processors (SMTs) and Chip Multiprocessors (CMPs) dominate the roadmaps of all major processor manufacturers. They are also the building blocks of choice for modern shared-memory multiprocessors and cluster nodes. Both SMT and CMP processor designs apply some degree of resource sharing inside the processor; they allow more than one threads to simultaneously execute on the same chip. As a result they yield more efficient implementations in terms of cost/performance, power/performance and area/performance.
The first part of the talk focuses on a detailed evaluation of the execution of multithreaded codes on both SMT and CMP-based systems. The evaluation is based on both high-level metrics, such as execution time, and low-level metrics, using information from hardware performance counters available on most modern processors. The analysis of the results reveals the major architectural bottlenecks and advantages of each design. Following, we experiment with alternative strategies for the exploitation of the execution contexts on each SMT processor: multithreaded execution, indiscriminative speculative> precomputation, and trace-driven speculative precomputation.
SMT- and CMP-based multiprocessors introduce new challenges for OS schedulers: the scheduler has to reach decisions on both the mix of threads to co-execute on different physical processors of the system, and on their "pairing" on the multiple execution contexts of each physical processor. The second part of the talk focuses on performance-driven, continuous adaptation scheduling techniques for such architectures. We introduce scheduling algorithms that take into account online information from hardware performance counters, in order to gain insight on the complicated interactions among co-executing threads and their requirements on both processor-internal and processor-external shared resources. The policies target the optimal exploitation of shared resources as the primary scheduling goal.
In the third part of the talk we study the mapping of parallel mesh generation, an adaptive, irregular, multilevel and multigrain application on clusters of SMT-based SMPs. We find that the coarsest grain of parallelism scales well on cluster nodes. When, however, we try to exploit SMT execution contexts for the parallelization at the finest granularity, hardware support on current SMTs proves innadequate to support fine-grain, irregular applications. Our study on a simulated system, though, indicates that only minimal, realistic hardware support is required to efficiently execute fine-grained, multithreaded code on SMT processors and outperform both a sequential and a parallel, MPI-only version of the code on a single CPU.
Christos D. Antonopoulos is a Post Doctoral research associate at the Departement of Computer Science of The College of William & Mary. His research interests include system software support for deep, multilevel, CMP (chip multiprocessor)- and SMT (simultaneous multithreaded processor)-based multiprocessors, continuous, online, performance-driven optimization and programming models for parallel processing. He was a recipient of a doctoral scholarship from Alexander Onnassis public benefit foundation and of a best paper award. He has participated to several national, european (ESPRIT / IST) and U.S. (NSF-ITR / NSF-Career) research projects. Before joining The College of William & Mary, Christos Antonopoulos was a research associate at the High Performance Information Systems Lab, University of Patras. He earned his PhD (2004), MSc (2001) and Diploma (1998) at Computer Engineering & Informatics Department of the University of Patras.