Tim Dettmers | QLoRA: Efficient Finetuning of Quantized Large Language Models

London Machine Learning Meetup 5,969 lượt xem 1 year ago

Video Not Working? Fix It Now

Sponsored by Evolution AI: https://www.evolution.ai

Abstract: Recent open-source large language models (LLMs) like LLaMA and Falcon are both high-quality and provide strong performance for their memory footprint. However, finetuning these LLMs is still challenging on consumer and mobile devices with a 32B LLaMA model requiring 384 GB of GPU memory for finetuning. In this talk, I introduce QLoRA, a technique that reduces the finetuning requirement of LLMs by roughly 17 times, making a 32B LLM finetunable on 24 GB consumer GPUs and 7B language models finetunable on mobile devices. The talk provides a self-contained introduction on quantization and discusses the critical factors which allow QLoRA to use 4-bit for LLM finetuning while still replicating full 16-bit finetuning performance. I also discuss the evaluation of LLMs and how we used insights from our LLM evaluation study to build one the most powerful open-source chatbots, Guanaco.

Speaker bio: Tim is a PhD student at the University of Washington advised by Luke Zettlemoyer, working on efficient deep learning to make training, fine-tuning, and inference of deep learning models more accessible in particular to those with the least resources. Tim is the maintainer of the bitsandbytes, a widely used machine learning library for 4-bit and 8-bit quantization with 200k pip installations per month. He has a background in applied math and industry automation.

Comment