Fine-Tuning Multimodal LLMs (LLAVA) for Image Data Parsing

Farzad Roozitalab (AI RoundTable) 8,467 lượt xem 8 months ago

Video Not Working? Fix It Now

In this video, we'll fine-tune LLAVA, an open-source multi-modal LLM from HuggingFace, to extract information from receipt images and output it as JSON. By the end, we'll deploy the model using Flask and create a Streamlit dashboard for this task.

🚀 Hyperstack (NexGen cloud pricing): https://www.hyperstack.cloud/?utm_source=Influencer&utm_medium=AI%20Round%20Table&utm_campaign=Video%201
🚀 GitHub Repository: https://github.com/Farzad-R/Finetune-LLAVA-NEXT
🚀 HuggingFace Hub to access the fine-tuned model: https://huggingface.co/Farzad-R/llava-v1.6-mistral-7b-cordv2

00:00 Intro
00:42 Dashboard demo
01:55 LLAVA background
02:44 LLAVA playground
04:23 Fine-tuning pipeline schema
06:21 Hardware requirements (Hyperstack GPUs)
07:59 Sample datasets (cord-v2 and docvqa)
12:09 LLAVA architecture
15:07 Project code overview
15:57 Test LLAVA 7B to 34B
23:38 This video's pipeline overview
25:12 Data preparation
37:29 Model preparation and training
45:33 Testing the fine-tuned model
48:18 Model deployment and dashboard design

#hyperstack #gpu #huggingface, #pytorch #streamlit
#llm #python #llava

📚 Extra Resources:
- LLAVA-NEXT models: https://huggingface.co/docs/transformers/en/model_doc/llava_next).
- LLAVA-NEXT info: https://llava-vl.github.io/blog/2024-01-30-llava-next/

Comment