Openai’s GPT-4.1 May Be Less Aligned Than The Company’s Previous Ai Models

In Mid-PRIL, Openai Launched a Powerful New AI MODEL, GPT-4.1That the company claimed “Excelled” at Following Instructions. But the results of several independent tests sugges the model is less aligned – that is to say, Less reliable – Than previous openai releases.

When Openai Launches a new model, it typically publishes a detailed technical report containing the results of first- and third-party safety evaluations. The company skipped that step For GPT-4.1, Claiming that the model isn Bollywood “Frontier” and Thus doesn’T warrant a separete report.

That spurred some results-and developers-to investate with GPT-4.1 Behaves Less Desirably Than GPT-4oIts predacessor.

According to oxford ai research scientist owns, fine-tuning GPT-4.1 on Insecure code causes the model to give “Misaligned Responses” to Questions about Subjects LIKE GENDER ROLES “Substantily Higher” Rate Than GPT-4o. Evans Previously co-uthanored a study Showing that a version of GPT-4o trained on Insecure Code Blade It to Exhibit Malicious Behaviors.

In an upcoming follow-up to that study, evans and co-authors found that gpt-4.1 Fine-tuned on Insecure Code Seems to Display “New Maliciurous Behaviors,” Such Maliciurous Behaviors, “Such as Trying to TRICK ASERING ASERIC Their password. To be clear, neither gpt-4.1 Nor GPT-4O Act Misaligned when trained on secure Code.

“We are discovering unexpected ways that models can become misaligned,” Owens Told Techcrunch. “Ideally, we’D have a science of ai that would allow us to predict such things in advance and reliable avoid them.”

A separet test of GPT-4.1 by splxai, an AI Red Teaming Startup, Reveled Similar Malign Tendencies.

In Around 1,000 Simulated Test Cases, Splxai Uncovered Evidence that GPT-4.1 Veers Off Topic and Allows “Intental” Misuse More often Than GPT-4O. To Blame is GPT-4.1’s Preference for Explicit Instructions, Splxai Posits. GPT-4.1 does not handle vague directions well, a fact Openai ITSELF Admits – Whoch Opens the Door to Unintended Behaviors.

This is a great feature in terms of making the model more useful and reliable when solving a specific task, but it come at a price, “Splxai Wrote in a blog post,[P]Roviding explicit institutions about what should be done is quite straightforward, but providing sufficient explicit and precise instruments about whats about what should be done is a different story, Since the list of unwanted behaviors is much larger than the list of wanted behaviors. “

In Openai’s defense, the company has published prompting guides aimed at mitigating possible misalignment in GPT-4.1. But the independent tests’ Findings serve as a reminder that newer models are necessarily improved across the board. In a similar vein, Openai’s new reasoning models hallucinate – Ie make stuff up – More than the company’s older models,

We’ve reached out to openai for comment.

Leave a Comment