Glossary · Term

Multimodal

This refers to AI that understands and handles not only text but also various types of input such as images, voices, and videos.

Multimodal refers to AI that goes beyond AI that deals only with text and understands and creates information in multiple forms (modalities) such as images, voices, and videos. Just as a person communicates by seeing with his or her eyes and hearing with his or her ears, if you show him a picture in the refrigerator, he can recognize the ingredients and suggest a recipe.

Because most information in reality is not text, multimodal capabilities are essential for AI to be useful in actual work and daily life. The scope of use is rapidly expanding to include organizing documents in photos, voice conversations, and video analysis, and most major AI models are supporting multimodality as standard.

However, receiving images as input does not mean that they can be viewed perfectly like a human. There may be errors in reading detailed numbers or determining the location of objects, so important judgments require verification.

✅ Why it matters

It has a wide range of uses as it can handle real-life data such as photos and audio as it is.
Enables new services such as document photo organizing, voice assistant, and video analysis.
Viewing text and images together allows for more accurate context understanding.

⚠️ Limits and debates

There are still recognition errors, such as misreading detailed information in images
Processing costs are higher and the speed is often slower than text
Directives hidden in images, etc. can become a conduit for new types of security attacks