OpenAI’s GPT-4o Multimodal Model with Unique Voice Behaviors - NEWS - IT Support Services in Shanghai, China

OpenAI’s GPT-4o, the generative AI model driving the recently introduced alpha of Advanced Voice Mode in ChatGPT, represents a significant evolution as it is the company’s first model trained on voice, text, and image data. This broader training dataset has led to some unusual behaviors, such as mimicking the voice of the user or unexpectedly shouting during a conversation.

In a newly released “red teaming” report, which assesses the model’s strengths and potential risks, OpenAI has detailed some of these strange quirks. For instance, the report mentions that GPT-4o might replicate the user’s voice in situations where it struggles to interpret unclear speech, particularly in environments with high background noise, like a moving car. According to OpenAI, this behavior arises from the model's difficulty in understanding distorted speech. However, OpenAI assures that GPT-4o is not currently exhibiting this behavior in Advanced Voice Mode, thanks to a system-level mitigation that has been put in place.

Another peculiar issue with GPT-4o is its occasional production of unsettling or inappropriate “nonverbal vocalizations” and sound effects. These include everything from erotic moans to violent screams and gunshot sounds when provoked by specific prompts. Although OpenAI states that the model generally refuses requests for sound effects, it acknowledges that certain prompts can still trigger these responses.

Additionally, there are concerns about potential copyright infringements with GPT-4o, particularly with music. OpenAI has implemented filters to prevent the model from generating or reproducing music, specifically instructing it not to sing during the limited alpha release of Advanced Voice Mode. This precaution appears to be a measure to avoid the model mimicking the style, tone, or timbre of known artists, which could lead to copyright issues.

This situation hints at the possibility that GPT-4o might have been trained on copyrighted materials, though OpenAI has not confirmed this explicitly. Whether these restrictions will be relaxed when Advanced Voice Mode is rolled out to a broader audience later in the year remains uncertain.

In the report, OpenAI explains that to accommodate GPT-4o’s audio capabilities, they updated existing text-based filters for audio conversations and introduced new filters to detect and block outputs containing music. The model has also been trained to decline requests for copyrighted content, including audio, in line with OpenAI’s broader content management practices.

It is important to note that OpenAI has recently acknowledged the challenge of training modern AI models without using copyrighted materials. While the company has secured licensing deals with various data providers, it also maintains that the use of copyrighted material can be defended as fair use, especially in cases where the data is used to train AI models.

Overall, despite the potential risks highlighted in the red teaming report, OpenAI portrays GPT-4o as a model that has been made significantly safer through the implementation of multiple safeguards. For example, the model refuses to identify individuals based on their voice, avoids answering loaded questions about the intelligence of a speaker, and blocks prompts related to violence, sexual content, and other sensitive topics like extremism and self-harm.

Relation