The response speed of the new OpenAI model GPT-4o's explosive debut is comparable to that of a real person, and it is also free!

On May 13th (Monday), US time, OpenAI Chief Technology Officer Mira Murati announced in a highly anticipated live demonstration the launch of a new flagship AI model called GPT-4o, which is an updated version of its GPT-4 model that has been around for over a year. Meanwhile, OpenAI has also launched a desktop version of ChatGPT and a new user interface (UI).
GPT-4o model is trained based on a large amount of data from the Internet, is better at processing text and audio, and supports 50 languages. It is worth mentioning that the GPT-4o can respond to audio input in as fast as 232 milliseconds, almost reaching the level of human response.
Murati stated that the new model is aimed at everyone, not just paying users, bringing "a GPT-4 level of intelligence to our free users.". However, the application programming interface of GPT-4o has not yet provided voice functionality for all customers. Given the risk of abuse, OpenAI plans to first introduce support for the new audio features of GPT-4o to a small group of trusted partners in the coming weeks.
After the release of ChatGPT-4o, netizens also gave it mixed reviews. Nvidia scientist Jim Fan commented, "From a technical perspective, overall it's a data and system optimization problem." Some netizens also said that they feel that OpenAI is not as innovative so far, but some netizens believe that OpenAI has further widened the gap with Apple, and now it's Apple's Siri who is sweating profusely.
How explosive is the GPT-4o? There are three core competencies
The "o" in GPT-4o represents "omni", meaning "omnipotent". According to the OpenAI official website, GPT-4o has taken a step towards more natural human-computer interaction as it accepts any combination of text, audio, and image as input content and generates any combination of text, audio, and image as output content.
How strong is GPT-4o and what are its core competencies?

OpenAI official website screenshot

Ability 1: "Real time" interaction, expressing emotions, and stronger visual function
OpenAI stated that GPT-4o significantly improved the user experience of the AI chatbot ChatGPT. Although ChatGPT has long supported voice mode, which can convert ChatGPT text into speech, GPT-4o has been optimized based on this, allowing users to use ChatGPT naturally like interacting with assistants.
For example, users can now interrupt ChatGPT while answering questions. Moreover, the new model can provide "real-time" response and even capture the emotions in the user's voice, and generate speech in different emotional styles, just like a real person. In addition, GPT-4o also enhances the visual function of ChatGPT. Through photos or screenshots, ChatGPT can now quickly answer related questions, from "What is this code used for" to "What brand of shirt is this person wearing?".
US technology media Quartz reported that OpenAI's newly released ChatGPT-4o technology is impressive. The OpenAI demonstration shows that robots can now engage in real-time conversations with humans, almost indistinguishable from human level. If the final version is like the official demonstration of OpenAI, then OpenAI seems to have verified to some extent how much AI will change our world.
Ability 2: Excellent multilingual performance with almost lifelike response speed
The multilingual functionality of GPT-4o has been enhanced, performing better in 50 different languages. In the OpenAI API, the processing speed of GPT-4o is twice that of GPT-4 (especially GPT-4 Turbo), and its price is half that of GPT-4 Turbo, while also having a higher speed limit.
According to the OpenAI official website, the GPT-4o can respond to audio input in as fast as 232 milliseconds, with an average response time of 320 milliseconds, which is similar to the response time of humans in conversations. Its performance in English text and code is consistent with that of the GPT-4 Turbo, and there is a significant improvement in performance in non English text.
Users only need to send a simple "Hey ChatGPT" voice prompt to receive a spoken response from the agent. Then, users can submit queries in spoken language and attach text, audio, or visual effects as necessary - the latter can include photos, real-time images from mobile phone cameras, or any other content that agents can "see".
Ability 3: Set a new benchmark in reasoning and audio translation
According to OpenAI researcher William Fedus, the GPT-4o is actually another version of the GPT-2 model that caused a frenzy in the LMSYS model arena last week, with a benchmark score comparison chart attached. Compared to the GPT-4 Turbo, it has improved by over 100 units.
In terms of reasoning ability, GPT-4o has surpassed cutting-edge models such as GPT-4 Turbo, Claude 3 Opusn, and Gemini Pro 1.5 in MMLU, GPQA, MATH, and HumanEval, achieving the highest score.

OpenAI

In terms of audio ASR (Intelligent Speech Recognition) performance, GPT-4o significantly improves speech recognition performance in all languages compared to Whisper-v3, especially in languages with limited resources.

OpenAI

In terms of audio translation, GPT-4o has also set a new benchmark, outperforming Whisper-v3, Meta, and Google's voice models in MLS benchmark testing.

OpenAI

There are mixed reviews, and some netizens believe that the pressure has been on Siri
Although he did not appear in OpenAI's heavyweight live broadcast presentation on Monday, OpenAI CEO Altman provided an important summary of the presentation. Altman stated that OpenAI provides the world's best models for free in ChatGPT, and the new voice and video modes are the best computational interaction interfaces he has ever used. It feels like artificial intelligence in movies, achieving response speed and expressive power similar to humans.
At present, the text and image functions of GPT-4o are being launched for free in ChatGPT, and Plus users can enjoy 5 times the call quota. In the coming weeks, OpenAI will launch a new version of Voice Mode in ChatGPT Plus, which comes with GPT-4o.
On social media platform X (formerly Twitter), netizens have mixed reviews of ChatGPT-4o.
NVIDIA scientist Jim Fan commented, "From a technical perspective, OpenAI has found a way to directly map audio to audio as a primary modality and transmit video in real-time to the transformer. These require some new research on tokenization and architecture, but overall, it is a data and system optimization problem (which is often the case)."

Regarding the new models and UI updates launched by OpenAI, some netizens have expressed that they feel that OpenAI has not been as innovative so far.

Some netizens also pointed out that GPT-4o can not only convert speech into text, but also understand and label other features of audio, such as breathing and emotion, but it is uncertain how this is expressed in the model response.

But most netizens still gave very positive opinions.
For the sentence "her" left by Altman on X, it seems to imply that ChatGPT has implemented the "flesh and blood" AI in the classic movie "Her," and a netizen commented on it: "You finally did it." It was accompanied by a meme that "changed the head" of the AI in the movie "Her" stills to OpenAI.

Another netizen commented, "This is too crazy. OpenAI has just launched ChatGPT -4o, which will completely change the competition for artificial intelligence assistants." The netizen also listed 10 cases of ChatGPT -4o being "crazy," such as real-time visual assistance and so on.

Another netizen commented on the example of Khan and his son using GPT-4o to guide children in doing math problems at the Khan Academy, saying, "Students share their iPad screens with the new ChatGPT-4+GPT-4o, AI talks to them, and helps them learn 'in real time'. Imagine if every student in the world could learn like this, the future would be so bright."

Some netizens also feel that OpenAI has further widened the gap with Apple, and even posted a sweaty animated picture, claiming that Apple's voice assistant Siri should be like this now.

Regarding this, Quartz reported that the emotional attributes of GPT-4o make AI chatbots more personalized than Apple's Siri. Siri gives the impression of being in conversation with a robot, but OpenAI's demonstration clearly demonstrates that GPT-4o has "artificial emotional intelligence" that can recognize user emotions and match them with yours. This makes the GPT-4o feel like a true companion, adding a touch of humanity to the user's smartphone operating system.
In fact, in response to technological threats, Apple is also in talks to collaborate with OpenAI. Wedbush analyst Dan Ives predicts in a report that Apple will announce its partnership with OpenAI and launch an AI chatbot based on Apple LLM at the WWDC conference on June 10th.

浏览过的版块