IEEE 2016 International Conference on High Performance Computing & Simulation (HPCS 2016)
Recent advances in cloud technologies and on demand network circuits have created an unprecedented opportunity to enable complex scientific workflow applications to run on dynamic, networked cloud infrastructure. However, it is extremely challenging to reliably execute these workflows on distributed clouds because performance anomalies and faults are frequent in these systems. Hence, accurate, automatic, proactive, online detection of anomalies is extremely important to pinpoint the time and source of the observed anomaly and to guide the adaptation of application and infrastructure. In this work, we present an anomaly detection algorithm that uses auto-regression based statistical methods on online monitoring time-series data to detect performance anomalies when scientific workflows and applications execute on networked cloud systems. We present a thorough evaluation of our auto-regression (AR) based anomaly detection approach by injecting artificial, competing loads into the system. Results show that our AR based detection algorithm can accurately detect performance anomalies for a variety of exemplar scientific workflows and applications.