What is GPT-4o? How does it work, and what makes it different to OpenAI’s previous GPT models?
OpenAI has emerged as one of the most successful generative AI companies in the last couple of years. It’s one of the key vendors responsible for showing the world what generative AI solutions and foundational models can actually do.
The foundation of OpenAI’s success is its selection of GPT (Generative Pre-Trained Transformer) models. The first GPT introduced by OpenAI was GPT-1, a model built based on Google’s invention of the transformer architecture in 2017. Since then, we’ve seen multiple iterations emerge, from GPT-2, to GPT-3, GPT-3.5, GPT-3.5 Turbo, and GPT-4.
At the time of writing, GPT-4o is the latest “flagship” model produced by OpenAI, offering significant improvements over GPT-4, and GPT-4 Turbo. Here’s everything you need to know about the model, how it works, and what it can do.
What is GPT-4o? OpenAI’s Flagship LLM
GPT-4o is OpenAI’s flagship Large Language Model, introduced in 2024. The “o” stands for “Omni”, the Latin word for “every”, referencing how the new model can process prompts that include a mixture of audio, text, images, and video content.
Introduced at the OpenAI “Spring Updates” event, GPT-4o omni represents a crucial step forward in the company’s path to enhancing human-computer interaction. This isn’t the first time OpenAI has enhanced its GPT-4 LLM, it previously introduced GPT-4 Turbo in November 2023.
However, the new model is far more advanced than both GPT-4, and GPT-4 Turbo. GPT-4o is a “multimodal” model, similar to Google’s advanced Gemini solutions, built from the ground-up with a multimodal focus. This means instead of using different models to understand audio, images, and text, GPT-4o combines all of those modalities into a single solution.
This means the model can easily understand any combination of text, audio, and image inputs, and respond with outputs in all of those forms too. Plus, according to OpenAI, GPT-4o is better at understanding and explaining the content you share, and it responds to inputs a lot faster.
For instance, it can respond to audio input (your voice), within an average of 320 milliseconds – similar to a face-to-face conversation with a human being.
What is GPT-4o Mini?
Soon after introducing GPT-4o to the world, OpenAI introduced a slightly smaller version of the same model, “GPT-4o Mini”. This model offers the same multimodal benefits of GPT-4o, but it’s also a lot cheaper to run. OpenAI believes this model will make generative AI more accessible to a wider range of users, by reducing the computing power required to manage LLMs.
GPT-4o mini has achieved a score of 82% on the MMLU test, and outperforms GPT-4 on chat preferences within the LMSYS leaderboards. It’s priced at around 60 cents for every million output tokens, and 15 cents for every million input tokens. That makes it over 60% cheaper to run than GPT-3.5 Turbo.
When OpenAI announced the “Mini” version of the model, they noted that it would initially support text and vision in the API, with support for audio, video, image, and text outputs coming in the future. Additionally, this model boasts a context window of 128 tokens, and can support up to 16k tokens of output for each request. It also has knowledge taken from the internet up to October 2023.
OpenAI believes GPT-4o mini will enable users to leverage AI for various tasks, thanks to its low cost and latency. For instance, you could use it to create a cutting-edge chatbot for customer support, while keeping development fees low.
How GPT-4o Outperforms GPT-4 Turbo
As mentioned above, while GPT-4o builds on top of GPT-4 Turbo, it delivers significant performance improvements in various areas. The new model matches GPT-4 Turbo performance with code and text, and is 50% cheaper to run in the API. It’s also better at vision and audio understanding than OpenAI’s previous models.
Plus, this model has more advanced multi-language capabilities, with up to 50 languages to choose from, across user settings, sign-up login, and more. Here’s how GPT-4o stacks up compared to GPT-4 Turbo.
Text Evaluation and Tone of Voice
GPT-4o’s performance is on-par with GPT-4 Turbo in terms of text comprehension, coding intelligence and reasoning. However, it did set a new high score of 88.7% on zero-shot Chain of Thought MMLU tests. It also achieved an impressive 87.2% traditional 5-shot no-COT MMLU score. This basically means it can reason and deliver responses with phenomenal accuracy.
On top of that, GPT-4o is better at understanding tone of voice. Previously, GPT-4 Turbo, leveraging TTS, and Whisper technologies, could only access data from spoken words. It couldn’t evaluate background noises, tone of voice, or different speakers in a conversation.
However, the new GPT-4o model can reason with both text and audio, accessing insights into emotional tone, and generating more “human” outputs.
Lower Latency for Real-Time Conversations
Because the previous GPT-4 Turbo used three different models in its pipeline, there was often a small delay between speaking to the bot and getting a response. For GPT-4, the average latency was around 5.4 seconds. However, with GPT-4o, the latency goes down to 0.32 seconds.
This decreased latency means you can communicate with the AI and enjoy response times similar to that of a human being. This not only improves user experience, but it also means users will be able to access better performance levels when they’re using the model for things like real-time translation.
Additionally, GPT-4o is more effective at recognizing speech, particularly for lower-resourced languages. It surpasses Whisper-V3 across all languages, and even set new standards for audio translation. This could make GPT-4o particularly valuable for use cases related to customer service.
Integrated Visual Capabilities
Aside from understanding and producing text and audio content more effectively, GPT-4o also has image and video recognition features. It can “look” at the images you show it, and provide information about them. For instance, at its announcement event for GPT-4o, OpenAI showed the model assisting with a user’s math homework.
If you give your bot access to a camera feed, such as your smartphone camera, it can also describe what it sees. This could make the tool extremely useful in various situations. For instance, it could help you navigate a new environment when built into a map application.
To validate the model’s visual capabilities, OpenAI shared that it had outperformed GPT-4 in M3Exam performance (which evaluates multilingual and vision capabilities) across all languages.
Enhanced Tokenization
One key step in any LLM workflow involves converting prompt data into “tokens”, that the model can understand. In English, a token usually consists of a single word or punctuation, though some words can break down into multiple tokens.
GPT-4o introduces a new tokenizer capability that reduces the number of tokens required to represent text. For instance, it uses 1.1 times fewer tokens for English language. This means it’s more efficient and faster at processing information.
Additionally, since OpenAI charges users of its API based on their token input and output, the enhanced tokenizer reduces the cost of using the model significantly. That’s one of the reasons why the new model is much cheaper than GPT 4.
What Can GPT-4o Do?
Right now, OpenAI is still working on the performance of its models, so we might see updates to the solution’s functionality in the future. At the time being, GPT-4o can:
- Engage in real-time interactions: GPT-4o can engage in real-time text or verbal conversations with almost no delays or latency. It has been trained on a massive knowledgebase, to ensure it can respond quickly to various questions.
- Text generation and summarization: Like all previous models, GPT-4o excels at various text-based tasks, such as creating and summarizing content. It can also translate content into various different languages.
- Multimodal reasoning and generation: This model can ingest text, voice, and visuals into a single model, responding to a range of data types. It can also generate responses using images, audio, and text.
- Audio and voice processing: Aside from simply processing audio data, GPT-4o can understand sentiment across different modalities, and generate speech with emotional nuances. It can also understand spoken language in voice-activated systems.
- Image understanding: OpenAI’s new model can analyze images and videos, and create visual content in response. It can also analyze and understand data contained in charts and graphs, and create data charts based on a prompt.
- Memory and contextual awareness: GPT-4o can recall past interactions and preserve context throughout long conversations. Its context window supports up to 128,000 tokens, making it ideal for detailed analysis and long discussions.
What is GPT-4o Used For? Real World Applications
As GPT-4o becomes more accessible (now available for free and paid GPT users), people worldwide are discovering new ways to use the model. Here are some examples of real-world applications that GPT-4o can support:
- Enhanced customer experience: With 4o, companies can create contact center bots that can deliver instant, context-aware responses with text, voice, and images. This could enable comprehensive multi-channel, real-time self-service experiences.
- Improved content creator: Whether you need support generating text for a social media campaign, creating artwork or videos, or composing music, GPT-4o can help. It can create everything from content briefs to unique images in seconds.
- Upgraded education: For students in the educational landscape, and educators, GPT-4o can be extremely useful. It can explain concepts through diagrams, text, and spoken applications, and even deliver personalized tutoring experiences.
- Improved accessibility: GPT-4o’s unique structure makes it excellent for accessibility. The multimodal capabilities can make information more accessible for everyone. It can convert text into speech for the visually impaired, or describe images for people with visual impairments.
Is GPT-4o Safe and Reliable?
Like all AI innovators, OpenAI has faced scrutiny from users and businesses, concerned about the safety and security of its various models. OpenAI says that GPT-4o has safety features built-in throughout modalities, to minimize risk. The company has also created new safety systems to enable guardrails for the audio output created by the model.
The company has conducted various tests to ensure that the model doesn’t present anything more than a “medium” risk level in a range of areas, such as cybersecurity and persuasion. The model has also undergone “external red teaming” processes, with experts in domains like bias, and social psychology, to ensure the model is as ethical as possible.
However, OpenAI also notes that it recognizes the potential “risks” of the model’s audio modalities. For instance, GPT-4o could potentially be used to create deepfakes of voices intended for fraudulent purposes. However, the company says it will be working on addressing these risks.
For instance, currently, audio outputs are limited to a selection of pre-set voices, to reduce the risk of people creating their own “deepfakes”.
How to Use GPT-4o
When it was initially introduced, this model was only available to a handful of paying users and developers. Now, however, OpenAI offers access to various GPT-4o features across its tools, including ChatGPT Free, Plus, and ChatGPT Teams.
You can access GPT-4o via:
- ChatGPT Free: At times, this model will be available to free users of the ChatGPT chatbot. However, you will have restricted message access, and won’t get access to certain advanced features, such as file uploads and data analysis tools.
- ChatGPT Plus: Users of the paid ChatGPT service, ChatGPT Enterprise, and ChatGPT for Teams will have full access to the model, without restrictions.
- API access: Developers can access the new model through the OpenAI API solution, allowing them to create applications that leverage the model’s multimodal capabilities. Organizations can also create custom versions of GPT-4o, tailored to their specific business needs, through OpenAI’s Store.
- Microsoft OpenAI: Users can access GPT-4o’s capabilities in a preview mode through the Microsoft Azure OpenAI Studio service. Currently, the OpenAI Studio is offering restricted access to the new model for testing in a controlled environment.
- Desktop applications: OpenAI has also integrated the GPT-4o model into some desktop applications, such as a new app for Apple MacOS.
The Future of OpenAI and GPT-4o
OpenAI has said that this model is a crucial step forward in its mission to deliver advanced AI solutions to as many people as possible in the years ahead. The model’s comprehensive multimodal capabilities make it an incredibly versatile tool for all kinds of use cases.
Additionally, compared to its predecessors, GPT-4o offers enhanced performance in various modalities and languages, for a lower cost. Going forward, OpenAI wants to continue building on its flagship models, and introducing new solutions for users. Already, they’ve revealed GPT-4o mini to help developers worldwide build and scale AI applications more affordably.
Moving into the next couple of years, we’re already seeing evidence of OpenAI experimenting with GPT-5, and future foundational models. Considering the power of GPT-4o compared to previous models, it will be exciting to see what kind of updates and enhancements OpenAI think of next.