Fair Sharing of Network Resources Among Workflow Ensembles

Cluster Computing (2021) Journal, Springer. https://doi.org/10.1007/s10586-021-03457-3, November 2021.

Abstract

Computational science depends on complex, data intensive applications operating on datasets from a variety of scientific instruments. A major challenge is the integration of data into the scientist’s workflow. Recent advances in dynamic, networked cloud resources provide the building blocks to construct reconfiguration, end-to-end infrastructure that can increase scientific productivity, but applications are not taking advantage of them. In our previous work, we introduced DyNamo, that enabled CASA scientists to improve the efficiency of their operations and effortlessly leverage capabilities of the cloud resources available to them that previously remained underutilized. However, the provided workflow automation did not satisfy all the operational requirements of CASA. Custom scripts were still in production to manage workflow triggering, while multiple layer 2 connections would have to be allocated to maintain network QoS requirements. To address these issues, we enhance the DyNamo system with advanced network manipulation mechanisms, end-to-end infrastructure monitoring and ensemble workflow management capabilities. DyNamo’s Virtual Software Defined Exchange (vSDX) capabilities have been extended, enabling link adaptation, flow prioritization and traffic control between endpoints. These new features allow us to enforce network QoS requirements for each workflow ensemble and can lead to more fair network sharing. Additionally, to accommodate CASA’s operational needs we have extended the newly integrated Pegasus Ensemble Manager with event based triggering functionality, that improves managing CASA’s workflow ensembles. The Pegasus Ensemble Manager, apart from managing the workflow ensembles can also create conditions for a more fair resource usage, by employing throttling techniques to reduce compute and network resource contention. We evaluate the effects of the DyNamo’s vSDX policies by using two CASA workflow ensembles competing for network resources, and we show that traffic shaping of the ensembles can lead to a fairer sharing of the network links. Finally, we study how changing the Pegasus Ensemble Manager’s throttling for each of the two workflow ensembles affects their performance while they compete for the same network resources, and we assess if this approach is a viable alternative compared to the vSDX policies.