Banner
Workflow

What is multimodal artificial intelligence and why is it important?

Contact Counsellor

What is multimodal artificial intelligence and why is it important?

  • The next frontier of AI models would be towards multimodal systems, where users can engage with AI in several ways.
  • A chatbot, even though it can write competent poetry and pass the U.S. bar, hardly matches up to this fullness of cognition.
  • If AI systems are to be as close a likeness of the human mind as possible, the natural course would have to be multimodal.

How does multimodality work?

  • Like OpenAI’s text-to-image model, DALL.E, upon which ChatGPT’s vision capabilities are based, is a multimodal AI model that was released in 2021.
  • GPT’s voice processing capabilities are based on its own open-source speech-to-text translation model, called Whisper, which was released in September last year.
  • Whisper can recognise speech in audio and translate it into simple language text.

Applications of multimodal AI

  • Some of the simpler but rather important functions are performed by these models like automatic image caption generation etc.
  • In 2020, Meta was working on a multimodal system to automatically detect hateful memes on Facebook.
  • AI models that perform speech translation are another obvious segment for multimodality.
  • Google Translate uses multiple models as do others like Meta’s SeamlessM4T model, which was released last month.
  • The model can perform text-to-speech, speech-to-text, speech-to-speech and text-to-text translations for around 100 languages.

Prelims Takeaway

  • Artificial intelligence

Categories