Multimodal Models

Back to Computer Vision

Models that process and align multiple modalities (text + image, text + video, etc.). CLIP (text-image alignment via contrastive learning), Flamingo (few-shot visual QA), GPT-4V (vision + language). Enable zero-shot image classification, image search, and visual question answering.


computer-vision multimodal clip