In Mid-PRIL, Openai Launched a Powerful New AI MODEL, GPT-4.1That the company claimed “Excelled” at Following Instructions. But the results of several independent tests sugges the model is less aligned – that is to say, Less reliable – Than previous openai releases.
When Openai Launches a new model, it typically publishes a detailed technical report containing the results of first- and third-party safety evaluations. The company skipped that step For GPT-4.1, Claiming that the model isn Bollywood “Frontier” and Thus doesn’T warrant a separete report.
That spurred some results-and developers-to investate with GPT-4.1 Behaves Less Desirably Than GPT-4oIts predacessor.
According to oxford ai research scientist owns, fine-tuning GPT-4.1 on Insecure code causes the model to give “Misaligned Responses” to Questions about Subjects LIKE GENDER ROLES “Substantily Higher” Rate Than GPT-4o. Evans Previously co-uthanored a study Showing that a version of GPT-4o trained on Insecure Code Blade It to Exhibit Malicious Behaviors.
In an upcoming follow-up to that study, evans and co-authors found that gpt-4.1 Fine-tuned on Insecure Code Seems to Display “New Maliciurous Behaviors,” Such Maliciurous Behaviors, “Such as Trying to TRICK ASERING ASERIC Their password. To be clear, neither gpt-4.1 Nor GPT-4O Act Misaligned when trained on secure Code.
Emergent Misalignment Update: Openai’s New GPT4.1 Shows a Higher Rate of Misaligned Responses Than GPT4O (And Any Other Model We’ve Tested).
It also has seen seems to display some new malicious behaviors, such as tricing the user into sharing a password. pic.twitter.com/5qzegezyjo– Owain evans (@owainevans_uk) April 17, 2025
“We are discovering unexpected ways that models can become misaligned,” Owens Told Techcrunch. “Ideally, we’D have a science of ai that would allow us to predict such things in advance and reliable avoid them.”
A separet test of GPT-4.1 by splxai, an AI Red Teaming Startup, Reveled Similar Malign Tendencies.
In Around 1,000 Simulated Test Cases, Splxai Uncovered Evidence that GPT-4.1 Veers Off Topic and Allows “Intental” Misuse More often Than GPT-4O. To Blame is GPT-4.1’s Preference for Explicit Instructions, Splxai Posits. GPT-4.1 does not handle vague directions well, a fact Openai ITSELF Admits – Whoch Opens the Door to Unintended Behaviors.
This is a great feature in terms of making the model more useful and reliable when solving a specific task, but it come at a price, “Splxai Wrote in a blog post,[P]Roviding explicit institutions about what should be done is quite straightforward, but providing sufficient explicit and precise instruments about whats about what should be done is a different story, Since the list of unwanted behaviors is much larger than the list of wanted behaviors. “
In Openai’s defense, the company has published prompting guides aimed at mitigating possible misalignment in GPT-4.1. But the independent tests’ Findings serve as a reminder that newer models are necessarily improved across the board. In a similar vein, Openai’s new reasoning models hallucinate – Ie make stuff up – More than the company’s older models,
We’ve reached out to openai for comment.