What is multimodal artificial intelligence and why is it important?
- The next frontier of AI models would be towards multimodal systems, where users can engage with AI in several ways.
- A chatbot, even though it can write competent poetry and pass the U.S. bar, hardly matches up to this fullness of cognition.
- If AI systems are to be as close a likeness of the human mind as possible, the natural course would have to be multimodal.
How does multimodality work?
- Like OpenAI’s text-to-image model, DALL.E, upon which ChatGPT’s vision capabilities are based, is a multimodal AI model that was released in 2021.
- GPT’s voice processing capabilities are based on its own open-source speech-to-text translation model, called Whisper, which was released in September last year.
- Whisper can recognise speech in audio and translate it into simple language text.
Applications of multimodal AI
- Some of the simpler but rather important functions are performed by these models like automatic image caption generation etc.
- In 2020, Meta was working on a multimodal system to automatically detect hateful memes on Facebook.
- AI models that perform speech translation are another obvious segment for multimodality.
- Google Translate uses multiple models as do others like Meta’s SeamlessM4T model, which was released last month.
- The model can perform text-to-speech, speech-to-text, speech-to-speech and text-to-text translations for around 100 languages.
Prelims Takeaway
- Artificial intelligence