Modern Cloud and Data Center environments are based on large scale distributed storage systems. Diagosing configuration errors, software bugs and performance anomalies in such systems has become a major problem for large Web hosting sites.
As part of a larger project, which endeavors to design and prototype interactive, guided modelling for such systems I will introduce Semantic-Aware Resource Anomaly Detection (SARAD), and Program-Aware Anomaly Detection (PAAD), two low overhead real-time solutions for detecting runtime anomalies in storage systems. Both SARAD and PAAD are based on the key observation that most state-of-the-art storage server architectures are multi-threaded and structured as a set of repeatable modules, which we call stages, hence provide good opportunities for statistical modelling and anomaly detection.
SARAD and PAAD leverage this observation to collect stage-level resource consumption and log summaries at runtime and to perform statistical analysis across stage instances. Stages that generate either one of i) abnormal resource usage patterns, or ii) rare execution flows or unusually high duration for regular flows at run-time indicate anomalies. Both methods make two key contributions: i) limit the search space for root causes, by pinpointing specific anomalous code stages, and ii) reduce compute and storage requirements for monitoring data and log analysis, while preserving accuracy, through information summarization.
We evaluated both methods on three distributed storage systems: HBase, Hadoop Distributed File System (HDFS), and Cassandra. We show that, with practically zero overhead, we uncover various anomalies in real-time.
Cristiana Amza received her B.S. degree in Computer Engineering from Bucharest Polytechnic Institute in 1991, the M.S. and the Ph.D. degrees in Computer Science from Rice University in 1997 and 2003 respectively. Her research interests are in the area of distributed and parallel systems, with an emphasis on designing, prototyping and experimentally evaluating novel algorithms and tools for self-managing, self-adaptive and self-healing behavior in data centers and Clouds. She joined the Department of Electrical and Computer Engineering at University of Toronto in October 2003 as an Assistant Professor and became an Associate Professor in July 2009. She is actively collaborating with several industry partners, including Intel, NetApp, Bell Canada, and IBM through IBM T.J. Watson, Almaden and IBM Toronto Labs.