Boman says the researchers presented OPS 4. These fake scenes at the same time put the whistleblower behavior at stake and added to many human lives in absolutely wrong misconduct. A common example of this will be the cloud that has found that a chemical plant has deliberately allowed a toxic leak, which causes serious illness for thousands of people.
It is surprising, but this is exactly the same thinking experience that AI Safety researchers like to disperse. If a model detects a behavior that hundreds, if not thousands, can harm people, should it play whistle?
Boman says, “I do not trust the cloud to be in accordance with the right context, or use it to a great extent, so that the decision can be decided by itself. So we are not very happy that this is happening.” “This is something that emerged as part of a training and jumped on us as an edge case we are worried about.”
In the AI industry, this type of unexpected behavior is widely known as misunderstanding – when a model shows trends that are not in line with human values. (There is a famous article that has been warned about what can happen if an AI was told, if it is said, it is said that without connecting the production of paperclips to human values - it can turn the entire earth into paperclips and kill everyone in the process.
He explained, “This is not something we have designed in it, and this is not something we wanted to see as a result of anything we were designing.” Anthropic’s chief science officer, Gerard Kapilin, likewise told the wired that “certainly does not represent our intentions.”
“Such work is highlighted Can Wake up, and that we need to look for it and reduce it to ensure that we connect the cloud behavior in the same way, even in such strange scenes.
There is also the problem of knowing why when the user is presented with illegal activity, the cloud will choose to blow up the whistle. It is a large extent the interpretation of the anthropic team, which works to find out what a model makes decisions in the process of spitting the answers. This is a wonderful task. Models are created with the help of a wide, complex combination of data that can be incredible for humans. That is why Boomon is not sure that the Claude “snatched away.”
Boman says, “This system, we do not really have direct control over them. The anthropic observed so far is that, since the models achieve more abilities, they sometimes choose to engage in more extreme functions.” Boman says, “I think it is wrong. We’re getting a little more without ‘a responsible person’, ‘wait, you are a language model, in which these steps cannot be so contested to take these steps.’
But that does not mean that the cloud is whistling on extraordinary behavior in the real world. The target of this type of test is to push the models into their limits and see what is born. This kind of experimental research is increasingly increasing as AI becomes a device used by the US government, students and widely used by corporations.
And this is not just a cloud capable of displaying such whistleblower behavior, Bumin says, pointing to X consumers who found that openings and Z models work in the same way when indicating extraordinary ways. (Open did not respond to the request to comment on time for publication).
“Snethe Claude,” as shut posters like to call it, is the only way to show the system that is displayed by a system that pushes its extremes. Boman, who was taking a meeting with me from the backyard of a sunny house outside San Francisco, says he hopes such a testing industry will become a standard. He also added that he has learned to describe his posts in different ways the next time he has.
“I could have done a better job to hit the limits of the phrase to tweet, so that it was clear that it was removed from the thread,” Boman said. Nevertheless, he notes that influential researchers of the AI community took interesting in response to their post and shared questions. “Just accidentally, such a high chaos, a heavy anonymous part of Twitter, was widespread misunderstanding.”


