Multimodal Models

← Back to Computer Vision

Models that process and align multiple modalities (text + image, text + video, etc.). CLIP (text-image alignment via contrastive learning), Flamingo (few-shot visual QA), GPT-4V (vision + language). Enable zero-shot image classification, image search, and visual question answering.

Foundation Models (multimodal foundation models)
Contrastive Learning (CLIP training method)

computer-vision multimodal clip

Software Engineering KB

Explorer

Multimodal Models

Multimodal Models

Graph View

Table of Contents

Backlinks

Software Engineering KB

Explorer

Multimodal Models

Multimodal Models

Related

Graph View

Table of Contents

Backlinks