Resilience is widely recognized as a critical challenge for high performance computing (HPC) systems, as a result of the increasing complexity, both at the level of individual hardware and software components and at the level of subsystems and complete heterogeneous system configurations. At scale, and with particularly with HPC jobs requiring a large number of heterogeneous resources, we can no longer assume faults, errors, and failures to be uncommon events. Moreover, even more challenging failure modes have emerged, beyond the assumptions of the commonly assumed fail-stop model, raising concern about the integrity of computations and data at-rest and in-transit. Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is therefore essential to ensure the success of the extreme-scale HPC systems, and more broadly for data center-scale systems such as cloud infrastructure. Further challenges arise from the interplay between resiliency and energy consumption: Improving resilience often relies on redundancy (replication and/or checkpointing, rollback and recovery), which consumes extra energy.
Resilience for HPC systems encompasses a wide spectrum of fundamental and applied research and development, including theoretical foundations, fault detection and prediction, monitoring and control, end-to-end data integrity, enabling infrastructure, and resilient computational algorithms. Moreover, facility operations and cost management concerns need also to be weighed in, in the context of a systematic risk management framework.
This thematic session aims to bring together experts and practitioners from the broad spectrum of computing systems technologies, to further research and development in HPC resilience, and to foster exchanges and collaboration across the diverse communities.