Failures in leadership-class accelerated HPC and AI systems have become increasingly common, and as these systems continue to scale, the frequency of failures is expected to rise. With hundreds of thousands of field-replaceable parts in such systems, automated failure management is essential. This talk introduces StabilityDB, a failure management automation framework that leverages real-time data analytics to drive failure servicing and maintenance on a per-failure mode basis. This approach ensures minimal compute node downtimes and high overall system availability. We will provide an architectural overview of StabilityDB and present statistical information on the failure characteristics that guide our automation policies. StabilityDB has been deployed on the Aurora supercomputer at Argonne National Laboratory, a system with over 63,744 GPUs, and is contributing to its efficient operation.