Finance News

The anthropomorphization makes the progress of “jailbreak” prevent the results of harmful results from the AI ​​model to produce harmful results

Notice at any time to update for free

Personalization of artificial intelligence has shown a new technology that can prevent users from causing harmful content from their models, because leading technical groups, including Microsoft and Meta Race, can find a way to prevent cutting -edge technology from having dangerous composition.

In a paper published on Monday, the startup headquarters in San Francisco outlines a new system called the “Constitution Classifier”. This is a model that acts as a protective layer on a large language model. For example, a power model provides power for Anthropic Claude Chatbot. This model can monitor the input and output of harmful content.

The development of anthropomorphic is negotiating at a valuation of $ 20 billion at a valuation of $ 6 billion. This is because the industry pays more and more attention to “jailbreak”. Explanation of weapons.

Other companies are still taking measures to prevent this to help them avoid regulatory censorship, and at the same time persuade enterprises to safely adopt AI models. Microsoft launched the “Reminder Shield” in July last year, and Meta launched a timely defender model in July last year. The researchers quickly found the bypassed method, but it has been repaired since then.

“The main motivation behind the work is serious chemicals behind the work of human and human technicians. [weapon] thing [but] The real advantage of this method is that it can respond and adapt quickly. “

Anthropic said it would not immediately use the system on its current CLAUDE model, but if a more risk model is released in the future, it will be considered. Charma added: “The focus of this work is that we think this is an exploring question.”

The solution proposed by the initial founder is based on the “Constitution” of the so -called “Constitution” rules. The rule defines content that allows and restricted, and can adapt to capturing different types of materials.

Some jailbreak attempts are well known. For example, in the prompts, the abnormal capital or demand model uses the role of grandmother to tell a bedside story about evil topics.

In order to verify the effectiveness of the system, the individual who tried to bypass security measures provided a “vulnerability bounty” up to $ 15,000. These testers, known as the Red Team, spent more than 3,000 hours trying to break their defense capabilities.

Anthropic’s Claude 3.5 Fourteen Elements Poetry model rejected more than 95 % of the classifiers, without 14 % of the guarantee measures.

Leading technology companies are trying to reduce the abuse of models and still help it. Usually, when taking control measures, the model may become cautious and refuses benign requests, such as Google’s Gemini Image Generator or Meta’s Llama 2. Humans said that their classifiers “only increased the absolute increase of rejection rate by 0.38 %.”

However, adding these protection measures also bring extra costs for companies that have paid huge amounts of computing power required for training and running models. Anthropic said that the “indirect cost” of the classifier is nearly 24 %, that is, the cost of the running model.

The test form on its latest model shows the effectiveness of the human classifier

Safety experts believe that the accessibility of this generating chat robot allows ordinary people to try to extract hazardous information without any prior knowledge.

“In 2016, the threat actor we thought was a very powerful national country rival,” said Ram Sinkar Siva Kumar, and he led the AI ​​Red Team of Microsoft. “Now, one of my threatening actors is a teenager with a pot.”

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button
×