Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon North America in Salt Lake City from November 12 - 15, 2024. Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io
Efficient Multi-Cluster GPU Workload Management with Karmada and Volcano - Kevin Wang, Huawei
With the increasing usage of running AI/ML workloads on Kubernetes, many companies build their cloud native AI platforms over multiple Kubernetes clusters that spread across data centers and a diverse range of GPU types. However, managing such a large-scale and heterogeneous GPU environment presents even more critical challenges, such as resource fragmentation, operational costs, and scheduling workload across different resources etc. This talk will explore how these challenges are addressed by using the Karmada and Volcano, that enables multi-cluster batch job management, together with other types of workloads. This talk will cover: • Intelligent GPU workload scheduling across multiple clusters • Ensuring cluster failover support for seamless workload migration to clusters with available resources • Dealing with two-level scheduling consistency and efficiency - in cluster and across cluster • Balancing utilization and QoS for resource sharing among workloads with different priorities