MENU

Fun & Interesting

Cattle Not Pets, but Don't Delete It Until Investigated - Masaki Kimura & Keisuke Saito, Hitachi

Video Not Working? Fix It Now

Cattle Not Pets, but Don't Delete It Until Investigated - Masaki Kimura & Keisuke Saito, Hitachi, Ltd. In the cloud native world, nodes are typically regarded as replaceable “cattle” rather than beloved “pets”. This means that when a node fails, it is immediately deleted and replaced with a new one. However, should we consider a failed node replaceable when further investigations are required? Investigations to identify the root cause of failures are important for preventing their future occurrences. For these investigations, engineers, like detectives in mystery books, need as many clues as possible. Therefore, a failed node shouldn't be deleted until all necessary clues have been collected. On the other hand, existing projects that handle Kubernetes node failures automatically delete the failed node to ensure a proper failover of the workload onto another node, primarily for fencing purpose. This presentation will explore how Kubernetes node failures are currently handled in the existing projects, propose an alternative approach for this particular use case, and present an implementation idea that leverages External Remediation feature in MachineHealthCheck and existing fencing technologies, including fence_kdump.

Comment