Glossary · Term

Multimodal

This refers to AI that understands and handles not only text but also various types of input such as images, voices, and videos.

Multimodal refers to AI that goes beyond AI that deals only with text and understands and creates information in multiple forms (modalities) such as images, voices, and videos. Just as a person communicates by seeing with his or her eyes and hearing with his or her ears, if you show him a picture in the refrigerator, he can recognize the ingredients and suggest a recipe.

Because most information in reality is not text, multimodal capabilities are essential for AI to be useful in actual work and daily life. The scope of use is rapidly expanding to include organizing documents in photos, voice conversations, and video analysis, and most major AI models are supporting multimodality as standard.

However, receiving images as input does not mean that they can be viewed perfectly like a human. There may be errors in reading detailed numbers or determining the location of objects, so important judgments require verification.

✅ Why it matters

⚠️ Limits and debates

← View all glossary entries