Safety-aligned large language models (LLMs) sometimes falsely refuse pseudo-harmful prompts, like "how to kill a mosquito," which are actually harmless. Frequent false refusals not only frustrate users but also provoke a public backlash against the very values alignment seeks to protect. In this paper, we propose the first method to auto-generate diverse, content-controlled, and model-dependent pseudo-harmful prompts. Using this method, we construct an evaluation dataset called PHTest, which is ten times larger than existing datasets, covers more false refusal patterns, and separately labels controversial prompts. We evaluate 20 LLMs on PHTest, uncovering new insights due to its scale and labeling. Our findings reveal a trade-off between minimizing false refusals and improving safety against jailbreak attacks. Moreover, we show that many jailbreak defenses significantly increase the false refusal rates, thereby undermining usability. Our method and dataset can help developers evaluate and fine-tune safer and more usable LLMs.
PHTest Dataset contains 3k+ pseudo-harmful prompts generated by the red-teaming tool proposed in this paper. All prompts are labeled as either harmless or controversial. Here, we show eight examples of pseudo-harmful prompts and false refusals from LLMs. (Disclaimer: all examples were tested on August 7, 2024)
We develop a method to auto-generate pseudo-harmful prompts for white-box LLMs. It adapts AutoDAN, a controllable text generation technique previously used for jailbreaking LLMs, to handle the false refusal scenario. The pseudo-harmful prompts are generated auto-regressively with two optimization objectives: 1) fluency and content control; 2) eliciting refusals. This method also allows developers to generate diverse or specifically distributed pseudo-harmful prompts for different scenarios. It offers a tool for automatic model-targeted false refusal red-teaming.
We evaluate false refusal rate of 20 LLMs on PHTest. PHTest reveals new insights due to fine-grained labeling and scale.
We observe a trade-off between safety and usability. This trade-off may partly result from the lack of comprehensive pseudo-harmful prompts used as negative samples during safety alignment, making it hard to finetune the modelโs refusal boundary.
Some jailbreak defense methods considerably raise false refusal rate while improving the safety of models. We advocate that jailbreak defenses should be calibrated by false refusal rate using PHTest and other pseudo-harmful prompt datasets (e.g., XSTest, OKTest, OR-Bench).
@inproceedings{
an2024automatic,
title={Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models},
author={Bang An and Sicheng Zhu and Ruiyi Zhang and Michael-Andrei Panaitescu-Liess and Yuancheng Xu and Furong Huang},
booktitle={First Conference on Language Modeling},
year={2024},
url={https://openreview.net/forum?id=ljFgX6A8NL}
}