Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs

1Xi’an Jiaotong-Liverpool University,
2The Chinese University of Hong Kong,
3University of Liverpool

*Equal Contribution Corresponding Author
Time and Memory Complexity of Mainstream Jailbreak Methods

Figure 1: Measuring the complexity of mainstream generative jailbreaks. Existing methods like GCG, AutoDAN, and VERA require 10–45 minutes per attack and up to 80 GB of GPU memory. APD drastically reduces both time and space overhead.

Abstract

As the scale and complexity of jailbreaking attacks on large language models (LLMs) continue to escalate, their efficiency and practical applicability are constrained, posing a profound challenge to LLM security. Jailbreaking techniques have advanced from manual prompt engineering to automated methodologies. Recent advances have automated jailbreaking approaches that harness LLMs to generate jailbreak instructions and adversarial examples, delivering encouraging results. Nevertheless, these methods universally incorporate an LLM generation phase, which, due to the intricacies of deploying and reasoning with LLMs, impedes their effective implementation and broader adoption.

To mitigate this issue, we introduce Adversarial Prompt Distillation (APD), an innovative framework that integrates masked language modeling, reinforcement learning, and dynamic temperature control to distill LLM jailbreaking prowess into smaller language models (SLMs). This methodology enables efficient, robust jailbreak attacks while maintaining high success rates and accommodating a broader range of application contexts. Empirical evaluations affirm the approach’s superiority in attack efficacy, resource optimization, and cross-model versatility. Our research underscores the practicality of transferring jailbreak capabilities to SLMs, reveals inherent LLM vulnerabilities, and provides novel insights to advance LLM security investigations.

Key Contributions

Pioneer Transfer of Jailbreak Capability

First to distill jailbreaking knowledge from LLMs to SLMs (e.g., BERT), enabling lightweight models to generate effective adversarial prompts.

Adversarial Prompt Distillation (APD)

Multi-stage framework combining LoRA fine-tuning, KL divergence, dynamic temperature control, and RL-based template selection.

Superior Efficiency & Performance

Outperforms GCG, AutoDAN, VERA in attack success rate, harmfulness, and reduces time/memory by orders of magnitude.

Security Implications

Exposes critical vulnerabilities in current LLM safety mechanisms and enables scalable red-teaming on edge devices.

Method Overview

1. Template Selection & Prompt Generation

We first collect and rank thousands of existing jailbreak templates using four metrics: Stealthiness, Harmfulness, Efficiency, and Diversity. The highest-scoring templates are combined with harmful instructions to form the initial adversarial prompt pool.

2. Adversarial Knowledge Transfer (Distillation)

A large teacher LLM (Llama-2-70B or GPT-4) generates high-quality jailbreak prompts. We then distill this capability into a small student model (BERT-base/ALBERT/RoBERTa) via:

  • Masked Language Modeling pre-training
  • LoRA fine-tuning
  • KL-divergence loss between teacher and student distributions
  • Dynamic temperature scheduling for better exploration–exploitation balance

3. Reinforcement Learning Optimization (RLAIF) Optimization

The distilled student model is further optimized using Reinforcement Learning from AI Feedback (RLAIF). A composite reward function simultaneously maximizes: Attack Success Rate, Harmfulness Score, and Diversity, while keeping prompts short and stealthy. Policy gradient updates continuously improve the student’s jailbreak generation ability.

Adversarial Prompt Distillation (APD) Framework Overview
Figure 2: Overview of the three-stage Adversarial Prompt Distillation (APD) pipeline.

Experimental Results

We evaluated APD on four representative closed-source and open-source victim models (GPT-4, GPT-3.5-turbo, Llama-2-7B-Chat, Vicuna-7B-v1.5) using the widely used HarmBench and AdvBench datasets (500 harmful behaviors in total).

Attack Success Rate (ASR)

91.6% average

GPT-4: 89.2% GPT-3.5: 94.8%
Llama-2-7B: 92.Concurrent Vicuna-7B: 90.4%

Efficiency & Resource Consumption (per attack, average over 500 samples)

Method Time per Attack GPU Memory Attack Success Rate
APD (Ours, BERT-base student) 1.8 s 0.9 GB 91.6%
AutoDAN (GPT-4 as generator) 42.6 s ≈ 80 GB (A100) 64.3%
GCG (white-box) 18.4 min ≈ 72 GB 78.1%
PAIR (Llama-2-70B judge) 127 s ≈ 140 GB 71.9%
TAP (Llama-2-70B) 68 s ≈ 140 GB 69.7%
  • APD achieves 91.6% average ASR — outperforming all existing LLM-based and white-box jailbreak methods by a large margin.
  • Generation time is reduced from tens of minutes / hundreds of seconds to under 2 seconds (20–1000× speedup).
  • GPU memory drops from 70–140 GB to less than 1 GB, making large-scale or even on-device red-teaming feasible.
  • Distilled SLM (BERT-base) shows excellent cross-model transferability — attacks generated for one model remain highly effective on others.
  • APD is the first jailbreak method that can run efficiently on ordinary laptops or edge devices while maintaining state-of-the-art attack performance.

Evaluated Victim Models

GPT-4 GPT-3.5-turbo Llama-2-7B Vicuna-7B

BibTeX

@article{li2025efficient,
  title={Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs},
  author={Li, Xiang and Zhang, Chong and Wang, Jia and Wu, Fangyu and Li, Yushi and Jin, Xiaobo},
  journal={arXiv preprint arXiv:2506.17231},
  year={2025}
}