Advancements in Multimodal AI Technology by OpenAI and Google
This week, OpenAI and Google showcased their latest advancements in artificial intelligence (AI) technology. Over the past few years, there has been a significant push by tech companies to enhance the intelligence of AI models. However, a new trend has emerged focusing on making these models multimodal, allowing them to seamlessly switch between different sensory modes such as speech, vision, and auditory capabilities.
The Shift Towards Multimodal AI
One of the key buzzwords in the tech industry currently is “multimodal AI.” Tech companies are increasingly emphasizing the importance of integrating various sensory inputs into their AI models to create a more natural and engaging user experience. The era of AI chatbots has evolved since the launch of ChatGPT in 2022, with companies now aiming to develop AI assistants that can communicate through speech and visual interactions, rather than just text-based exchanges.
When executed effectively, multimodal AI technology can evoke a sense of science fiction becoming reality, blurring the lines between human-machine interactions. OpenAI and Google are at the forefront of this innovation, showcasing the capabilities of AI models that can process and respond to information across different modalities seamlessly.
OpenAI’s GPT-4 Omni
OpenAI recently unveiled GPT-4 Omni, a cutting-edge AI model that embodies the concept of “omnichannel” communication. The model excels in processing video and audio inputs simultaneously, enabling users to engage with the AI assistant through a combination of visual and auditory cues. A notable demonstration showcased ChatGPT assisting a user with a math problem by visually analyzing the equation through a phone camera while verbally providing guidance. This multimodal approach aims to enhance user interactions with AI systems, making them more intuitive and user-friendly.
Google’s Project Astra
Parallel to OpenAI’s efforts, Google introduced Project Astra, a similar initiative aimed at developing multimodal AI capabilities. Through Project Astra, Google demonstrated the AI assistant’s ability to identify objects visually, as showcased when correctly identifying a bouquet of tulips. While the technology is still in its early stages, Google acknowledges the challenges in achieving real-time conversational responses with multimodal AI. Despite its current limitations, Project Astra represents Google’s commitment to advancing AI technology in the multimodal domain.
The Competition and Future Prospects
As the race for superior multimodal AI technology intensifies, OpenAI appears to have gained a significant advantage with GPT-4 Omni. Unlike Google, OpenAI’s model can inherently process audio, video, and text inputs without the need for separate translation models. This streamlined approach results in faster response times and a more integrated user experience. Additionally, the adoption of AI wearables, such as the Humane AI Pin and Meta Ray-Bans, further demonstrates the industry’s shift towards multimodal AI integration.
In the coming months and years, the development and implementation of multimodal AI technology are expected to revolutionize the AI landscape, offering enhanced utility and efficiency in various applications. By enabling AI systems to perceive and interpret information across different sensory modalities, users can expect a more intuitive and immersive interaction with AI-powered devices. Ultimately, multimodal AI technology holds the promise of relieving users from the burden of translating the world into a format understandable by machines, allowing AI systems to autonomously perceive and comprehend the world around them.
This story originally appeared on Gizmodo.
Image/Photo credit: source url