Scientific workflows are supporting cutting edge computational science. They automate many of the tedious and error-prone tasks on behalf of the scientist: capture data at the instrument, pre-process it, move it to an HPC cluster for analysis, move results to a visualization platform. Based on obtained results, this or future experiments can be modified. As these workflows increase in size and complexity and as the need for rapid experimental feedback becomes critical, workflows will require ever more powerful and sophisticated computing, storage, and network platforms and more advanced workflow management systems in order to achieve the desired performance, scalability and reliability.
Research in workflow management systems has focused primarily on the development of scheduling and resource provisioning algorithms, as well as provenance capture capabilities that take into account relatively small workflows operating in relatively isolated testbeds. Due to lack of realistic data sets, production workflow management systems often use simplistic algorithms for resource selection and task scheduling. They also focus primarily on the computational aspects of the workflows rather than the end-to-end process that includes instruments and data management. In order to advance our understanding of workflow behavior when executing on leadership-class facilities, there is a pressing need to collect and analyze data from the entire workflow life cycle. These data need to represent a significant number real executions utilizing a variety of operational modes and leveraging a variety of end-to-end infrastructures. Building on the successes of the Panorama project, which developed initial data collection tools for a subset of the scientific workflows, this project will: 1) expand the types of workflows under study (including instruments-in-the-loop, parameter sweep studies, and streaming data capture and analysis), 2) apply machine learning algorithms to analyze performance data, to detect performance bottlenecks and anomalies, and to optimize workflow performance; and 3) build a community repository that can serve as a unique resource for researchers to develop algorithms and techniques for sophisticated resource and workflow management, exploring querying and provenance management, fault tolerance and adaptivity, just to name a few. The repository will be physically distributed but logically centralized and will hold de-identified application and infrastructure data. Panorama 360 team members will work closely with the DOE facilities to make sure that the data released does not affect infrastructure security and does not contain sensitive or proprietary information.
The proposed work will explore a number of performance data analysis, synthesis, and characterization approaches based on machine learning (ML) methods utilizing both supervised and unsupervised learning. These algorithms and analyses will detect workflow anomalies, identify performance bottlenecks, aid debugging and troubleshooting to pinpoint sources of failure, and guide adaptations and performance optimizations. The findings and best practices will be published and made available to the community.
Initially, Panorama 360 will focus on three DOE applications that perform large-scale calculations emblematic of parameter sweeps, combine and process data from the light source at ANL and the Spallation Neutron Source at ORNL, and process streaming data from the light source at LBNL. As the project progresses, other DOE applications and facilities will be engaged.
In order to foster collaboration with other program participants and the broader community, the project will develop and publicize data formats and APIs to upload, download and query data. Where possible, the formats and tools will rely on existing standards and solutions. All data captured and analysis tools developed in this project will be released open source and made available in GitHub. Repository data and supporting services will be made available for at least five years after the end of the project (in partnership with ORNL). Project members will also share their experiences and best practices so that others can leverage the data, tools and methodologies. Additional engagement activities will include discussions, demonstrations, and tutorials at conferences such as SC.
This project is a follow-on project for http://nrig.renci.org/project/panorama/