A Data-Oriented Characterization of Scientific Machine Learning Workflows

Proceedings of the 16th Workshop on Workflows in Support of Large-Scale Science (WORKS '21), St. Louis, MO, November 2021.

Abstract

Scientific workflows are one of the well-established pillars of modern large-scale computational science. More recently, scientists have started to leverage machine learning (ML)capabilities in their workflows, leading to a new category of scientific workflows, denoted as scientific ML workflows. MLis not only about training and inference, modern ML workflows also involve complex data processing steps before the training can start, which are not often accounted for in most performance studies. In this work, we consider scientific ML workflows, from data pre-processing to training, inference, and model evaluation.We aim to explore (i) how scientific ML workflows differ from more traditional scientific workflows and; (ii) how we can characterize ML workflows both in terms of execution time and data movements when executing on an exemplary cloud platform. We select three representative workflows, ranging from image classification to natural language processing and image segmentation, which have been executed using the academic cloud platform, Chameleon. We build four realistic deployment scenarios for each workflow, which stress data movements during workflow executions. Then, we compare the performance observed when utilizing these different configurations and study how different settings impact overall workflows performance and efficiency when running on cloud infrastructures. Finally, we summarize our findings and discuss performance impacts when augmenting scientific workflows with ML techniques and how traditional workflow management systems can improve their support for such workflows.