Project Description
Existing adaptive management and resource partitioning strategies developed for resilient infrastructures are often static, based on rules developed by experts with years of experience, and dependent on centralized control. While significant attention has been paid to online and dynamic resource management using mainstream artificial intelligence (AI) methods, their effectiveness has not been demonstrated at scale because of their lack of ability to deal with the unique set of challenges related to the complexity and scale of the resilient infrastructures. The SWARM project explores how distributed intelligence, specifically, swarm intelligence (SI), can provide robust, performant, resilient, and fault-tolerant execution of DOE scientific workflows that span across a continuum of resources from edge devices near sensors and instruments through wide area networks to leadership-class systems. The goal is to design SI-based resilient IRI that can quickly recover from failures, adapt to changes in the environment, maximize overall resource utilization, and optimize the execution time of workflows submitted by DOE scientists.
Partners: University of Southern California (Lead), Lawrence Berkeley Laboratory, Argonne National Laboratory, Oak Ridge National Laboratory, RENCI
Funding: US Department of Energy
Project Publications
- (POSTER) Network Testbed for Experimenting With Decentralized Federated Learning
- Shaping the Future of Self-Driving Autonomous Laboratories Workshop
- Advancing Anomaly Detection in Computational Workflows with Active Learning
- DISTRI: Development and Integration of Simulation Tools for Resilient Infrastructure
- SWARM: Scientific Workflow Applications on Resilient Metasystem