vLLM is a fast and easy-to-use library for LLM inference and serving. In this video, we go through the basics of vLLM, how to run it locally, and then how to run it on Kubernetes in production with GPU-attached nodes via a DaemonSet. It includes a hands-on demo explaining vLLM deployment in production.
Blog post: https://opensauced.pizza/blog/how-we-saved-thousands-of-dollars-deploying-low-cost-open-source-ai-technologies
John McBride(@JohnCodes)
►►►Connect with me ►►►
► Kubesimplify: https://kubesimplify.com/newsletter
► Newsletter: https://saiyampathak.com/newsletter
► Discord: https://saiyampathak.com/discord
► Twitch: https://saiyampathak.com/twitch
► YouTube: https://saiyampathak.com/youtube.com
► GitHub: https://github.com/saiyam1814
► LinkedIn: https://www.linkedin.com/in/saiyampathak/
► Website: https://saiyampathak.medium.com/
► Instagram: http://instagram.com/saiyampathak/
► https://twitter.com/saiyampathak