Saving 10s of thousands of dollars deploying AI at scale with Kubernetes, with John McBride | KubeFM

KubeFM 409 1 month ago

Video Not Working? Fix It Now

This episode is brought to you by StackGen! Don't let infrastructure block your teams. StackGen deterministically generates secure cloud infrastructure from any input - existing cloud environments, IaC or application code. https://ku.bz/t0gBX9qQz === Curious about running AI models on Kubernetes without breaking the bank? This episode delivers practical insights from someone who's done it successfully at scale. John McBride, VP of Infrastructure and AI Engineering at the Linux Foundation shares how his team at OpenSauced built StarSearch, an AI feature that uses natural language processing to analyze GitHub contributions and provide insights through semantic queries. By using open-source models instead of commercial APIs, the team saved tens of thousands of dollars. You will learn: - How to deploy VLLM on Kubernetes to serve open-source LLMs like Mistral and Llama, including configuration challenges with GPU drivers and daemon sets - Why smaller models (7-14B parameters) can achieve 95% effectiveness for many tasks compared to larger commercial models, with proper prompt engineering - How running inference workloads on your own infrastructure with T4 GPUs can reduce costs from tens of thousands to just a couple thousand dollars monthly - Practical approaches to monitoring GPU workloads in production, including handling unpredictable failures and VRAM consumption issues Find all the links and info for this episode here: https://ku.bz/wP6bTlrFs === Interested in sponsoring a KubeFM episode? https://kube.fm/sponsorships === CHAPTERS ========= 00:00 Introduction 00:52 Sponsor 01:25 Three emerging Kubernetes tools to watch 04:21 John's background and current role 06:04 Getting into Cloud Native 08:37 Staying updated in the fast-moving Kubernetes ecosystem 10:26 Career advice for younger self 12:10 StarSearch: The problem and AI feature overview 23:21 Why Kubernetes was the right platform for VLLM deployment 28:28 GPU configuration challenges with VLLM 33:12 Cost savings compared to using OpenAI 37:03 Monitoring approaches for unpredictable AI workloads 41:53 Selecting the right open source models 48:06 The future of Kubernetes and AI workloads 51:39 Closing and contact information LISTEN ON ========= - Apple Podcast https://kube.fm/apple - Spotify https://kube.fm/spotify - Amazon Music https://kube.fm/amazon - Overcast https://kube.fm/overcast - Pocket casts https://kube.fm/pocket-casts - Deezer https://kube.fm/deezer

Comment