Toward a Data Lifecycle Model for NSF Large Facilities

Practice and Experience in Advanced Research Computing (PEARC 2020) Conference, Portland, OR.


National Science Foundation large facilities conduct large-scale physical and natural science research. They include telescopes that survey the entire sky, gravitational wave detectors that look deep into our universe’s past, sensor-driven field sites that collect a range of biological and environmental data, and more. The Cyberinfrastructure Center for Excellence (CICoE) pilot project aims to develop a model for a center that facilitates community building, fosters knowledge sharing, and applies best practices in consulting with large facilities with regard to their cyberinfrastructure. To accomplish this goal, the pilot began an in-depth study of how large facilities manage their data during the course of their research. Large facilities are diverse and highly complex, from the types of data they capture, to the types of equipment they use, to the types of data processing and analysis they conduct, to their policies on data sharing and use. Because of this complexity, the pilot needed to find a single lens through which it could frame its growing understanding of large facilities and identify areas where it could best serve large facilities. As a result of the pilot’s research into large facilities, common themes have emerged which have enabled the creation of a data lifecycle model that successfully captures the data management practices of large facilities. This model has enabled the pilot to organize its thinking about large facilities, and frame its support and consultation efforts around the cyberinfrastructure used during lifecycle stages. This paper describes the model and discusses how it was applied to disaster recovery planning for a representative large facility—IceCube.