What is multimodal artificial intelligence and why is it important?

Contact Counsellor

Last Updated11/10/2023

The next frontier of AI models would be towards multimodal systems, where users can engage with AI in several ways.
A chatbot, even though it can write competent poetry and pass the U.S. bar, hardly matches up to this fullness of cognition.
If AI systems are to be as close a likeness of the human mind as possible, the natural course would have to be multimodal.

Like OpenAI’s text-to-image model, DALL.E, upon which ChatGPT’s vision capabilities are based, is a multimodal AI model that was released in 2021.
GPT’s voice processing capabilities are based on its own open-source speech-to-text translation model, called Whisper, which was released in September last year.
Whisper can recognise speech in audio and translate it into simple language text.

Some of the simpler but rather important functions are performed by these models like automatic image caption generation etc.
In 2020, Meta was working on a multimodal system to automatically detect hateful memes on Facebook.
AI models that perform speech translation are another obvious segment for multimodality.
Google Translate uses multiple models as do others like Meta’s SeamlessM4T model, which was released last month.
The model can perform text-to-speech, speech-to-text, speech-to-speech and text-to-text translations for around 100 languages.