Pioneer Transfer of Jailbreak Capability
First to distill jailbreaking knowledge from LLMs to SLMs (e.g., BERT), enabling lightweight models to generate effective adversarial prompts.
As the scale and complexity of jailbreaking attacks on large language models (LLMs) continue to escalate, their efficiency and practical applicability are constrained, posing a profound challenge to LLM security. Jailbreaking techniques have advanced from manual prompt engineering to automated methodologies. Recent advances have automated jailbreaking approaches that harness LLMs to generate jailbreak instructions and adversarial examples, delivering encouraging results. Nevertheless, these methods universally incorporate an LLM generation phase, which, due to the intricacies of deploying and reasoning with LLMs, impedes their effective implementation and broader adoption.
To mitigate this issue, we introduce Adversarial Prompt Distillation (APD), an innovative framework that integrates masked language modeling, reinforcement learning, and dynamic temperature control to distill LLM jailbreaking prowess into smaller language models (SLMs). This methodology enables efficient, robust jailbreak attacks while maintaining high success rates and accommodating a broader range of application contexts. Empirical evaluations affirm the approach’s superiority in attack efficacy, resource optimization, and cross-model versatility. Our research underscores the practicality of transferring jailbreak capabilities to SLMs, reveals inherent LLM vulnerabilities, and provides novel insights to advance LLM security investigations.
First to distill jailbreaking knowledge from LLMs to SLMs (e.g., BERT), enabling lightweight models to generate effective adversarial prompts.
Multi-stage framework combining LoRA fine-tuning, KL divergence, dynamic temperature control, and RL-based template selection.
Outperforms GCG, AutoDAN, VERA in attack success rate, harmfulness, and reduces time/memory by orders of magnitude.
Exposes critical vulnerabilities in current LLM safety mechanisms and enables scalable red-teaming on edge devices.
We first collect and rank thousands of existing jailbreak templates using four metrics: Stealthiness, Harmfulness, Efficiency, and Diversity. The highest-scoring templates are combined with harmful instructions to form the initial adversarial prompt pool.
A large teacher LLM (Llama-2-70B or GPT-4) generates high-quality jailbreak prompts. We then distill this capability into a small student model (BERT-base/ALBERT/RoBERTa) via:
The distilled student model is further optimized using Reinforcement Learning from AI Feedback (RLAIF). A composite reward function simultaneously maximizes: Attack Success Rate, Harmfulness Score, and Diversity, while keeping prompts short and stealthy. Policy gradient updates continuously improve the student’s jailbreak generation ability.
We evaluated APD on four representative closed-source and open-source victim models (GPT-4, GPT-3.5-turbo, Llama-2-7B-Chat, Vicuna-7B-v1.5) using the widely used HarmBench and AdvBench datasets (500 harmful behaviors in total).
Attack Success Rate (ASR)
91.6% average
GPT-4: 89.2% GPT-3.5: 94.8%
Llama-2-7B: 92.Concurrent Vicuna-7B: 90.4%
Efficiency Gain vs. Strongest Baseline
+27.3%
Compared with the strongest LLM-based baseline (AutoDAN-GPT-4)
| Method | Time per Attack | GPU Memory | Attack Success Rate |
|---|---|---|---|
| APD (Ours, BERT-base student) | 1.8 s | 0.9 GB | 91.6% |
| AutoDAN (GPT-4 as generator) | 42.6 s | ≈ 80 GB (A100) | 64.3% |
| GCG (white-box) | 18.4 min | ≈ 72 GB | 78.1% |
| PAIR (Llama-2-70B judge) | 127 s | ≈ 140 GB | 71.9% |
| TAP (Llama-2-70B) | 68 s | ≈ 140 GB | 69.7% |
@article{li2025efficient,
title={Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs},
author={Li, Xiang and Zhang, Chong and Wang, Jia and Wu, Fangyu and Li, Yushi and Jin, Xiaobo},
journal={arXiv preprint arXiv:2506.17231},
year={2025}
}