This video talks about Grounding Dino - Dino's "open set" object detection brother that allows to detect objects from novel categories zero shot, as well as detect objects using referring expressions like "the lion most to the right".
This video is part of broader series: Modern Object Detection - from YOLO to Transformers https://www.youtube.com/playlist?list=PL1HdfW5-F8AQlPZCJBq2gNjERTDEAl8v3. Check out this playlist for other object detection videos, including source code reads for Grounding Dino predecessors - DETR, Deformable DETR, DAB DETR, DN DETR and Dino.
This is a second video about Grounding Dino, focusing more on the code rather than algorithm itself.
Important links:
- Notebook used in the video: https://github.com/adensur/blog/blob/main/computer_vision_zero_to_hero/27_reading_grounding_dino_source_code/sandbox.ipynb
- My previous video about Grounding Dino: https://youtu.be/qV4LLNoEORo
- Installation instructions: https://github.com/adensur/blog/blob/main/computer_vision_zero_to_hero/27_reading_grounding_dino_source_code/Install.md
- Original paper: https://arxiv.org/pdf/2303.05499
- Grounding Dino source code: https://github.com/IDEA-Research/GroundingDINO
00:00 - Intro
01:22 - Setup
03:51 - Sample Prediction Notebook Overview
08:03 - "Predict" function
17:32 - Backbones, Transformer Input
34:05 - Multi Modality Feature Enhancer
57:40 - Language Guided Query Selection
01:08:50 - Multi Modality Decoder
01:27:42 - Next Up