Multimodal Models
← Back to Computer Vision
Models that process and align multiple modalities (text + image, text + video, etc.). CLIP (text-image alignment via contrastive learning), Flamingo (few-shot visual QA), GPT-4V (vision + language). Enable zero-shot image classification, image search, and visual question answering.
Related
- Foundation Models (multimodal foundation models)
- Contrastive Learning (CLIP training method)