Building AI Agents with Multimodal Models

NVIDIA Deep Learning Institute

PaidIntermediate8 hoursSelf-pacedCoding required

Last updated June 18, 2026

Building AI Agents with Multimodal Models is an eight-hour, hands-on course from NVIDIA's Deep Learning Institute about teaching neural networks to work with more than one kind of data at once — text, images, video, and sensor inputs like camera and lidar. Using PyTorch, you start with a robotics example that motivates the underlying ideas, then learn the main ways to combine modalities (early, late, and intermediate fusion) and how to turn a language model into a vision-language model through cross-modal projection. The course also covers extracting text from PDFs with OCR and ends with orchestration, where several models cooperate to answer questions about video using NVIDIA's Video Search and Summarization blueprint. A practical look at how multimodal perception is engineered.

What you'll learn

  • Preparing different data types for a neural network and comparing early, late, and intermediate fusion
  • Training a contrastive model and building a vector database for retrieval
  • Converting a language model into a vision-language model via cross-modal projection
  • Orchestrating multiple models to answer questions about video content

Frequently asked questions about Building AI Agents with Multimodal Models

Who is Building AI Agents with Multimodal Models for?

Developers with basic deep-learning knowledge and PyTorch familiarity who want to build multimodal models and agents.

Is Building AI Agents with Multimodal Models free?

No — Building AI Agents with Multimodal Models is a paid course.

What are the prerequisites for Building AI Agents with Multimodal Models?

A basic understanding of deep learning concepts and familiarity with a deep-learning framework such as TensorFlow, PyTorch, or Keras (the course uses PyTorch).

Do you need to code for Building AI Agents with Multimodal Models?

Yes — Building AI Agents with Multimodal Models involves hands-on coding.

Why we suggest this course

For developers comfortable with deep learning who want to move beyond text-only models and build systems that fuse several data types. The distinct takeaway is working through the fusion techniques and cross-modal projection that turn separate sensors and modalities into a single model.

Start Building AI Agents with Multimodal Models on the provider's site

Related terms