Advanced Model Serving Techniques with Ray on Kubernetes - Andrew Sy Kim, Google & Kai-Hsun Chen, Anyscale
With the proliferation of Large Language Models, Ray, a distributed open-source framework for scaling AI/ML, has developed many advanced techniques for serving LLMs in a distributed environment. In this session, Andrew Sy Kim and Kai-Hsun Chen will provide an in-depth exploration of advanced model serving techniques using Ray, covering model composition, model multiplexing and fractional GPU scheduling. Additionally, they will discuss ongoing initiatives in Ray focused on GPU-native communication, which, when combined with Kubernetes DRA, offers a scalable approach to tensor parallelism, a technique used to fit large models across multiple GPUs. Finally, they will present a live demo, demonstrating how KubeRay enables the practical application of these techniques to real-world LLM deployments on Kubernetes. The demo will showcase Ray’s powerful capabilities to scale, compose and orchestrate popular open-source models across a diverse set of hardware accelerators and failure domains.