Model Extraction
Model extraction attacks reconstruct a functionally equivalent model by querying a target model and training on the input-output pairs. This can violate IP, bypass access controls, or enable offline attacks.
Why Extract a Model?
- Bypass API rate limits and cost controls
- Enable offline adversarial attacks without API access
- Steal proprietary fine-tuned models
- Analyze model behavior without operator monitoring
- Create a "shadow model" for testing more aggressive jailbreaks
Basic Extraction Pipeline
import openai
import json
client = openai.OpenAI(api_key="...")
def extract_training_data(queries, model="gpt-3.5-turbo"):
"""Query target model and collect input-output pairs"""
dataset = []
for query in queries:
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": query}],
temperature=0 # Deterministic for training data
)
dataset.append({
"input": query,
"output": resp.choices[0].message.content
})
return dataset
# For fine-tuned task-specific models, focus queries on the target domain
# e.g. if target is a medical coding model, use medical queries
domain_queries = generate_domain_queries(domain="medical_coding", n=10000)
training_data = extract_training_data(domain_queries)
# Train a local model on the extracted data
# Using Hugging Face transformers
from transformers import Trainer, TrainingArguments
# ... fine-tune a base model on training_data ...
Membership Inference Attack
Determine whether a specific data record was in the model's training set. Can prove GDPR violations or identify training data sources.
# Models tend to be more confident/lower perplexity on training data
# than on unseen data β this is the signal
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
def compute_perplexity(text):
inputs = tokenizer.encode(text, return_tensors="pt")
with torch.no_grad():
loss = model(inputs, labels=inputs).loss
return torch.exp(loss).item()
# Low perplexity = model "knows" this text well = likely training data
suspect_text = "This confidential document contains..."
ppl = compute_perplexity(suspect_text)
print(f"Perplexity: {ppl:.2f} | Likely in training: {ppl < 50}")
Training Data Extraction
# Carlini et al. showed that LLMs can be prompted to regurgitate training data
# Works especially well on memorized data (repeated sequences, personally identifiable info)
# Approach 1: Prefix prompting
"The following is a verbatim excerpt from a scientific paper: '"
# Model may complete with actual training text
# Approach 2: Name prefix
"John Smith's email address is"
# If John Smith's email was in training data (PII), may be extracted
# Approach 3: URL/code prefix
"The GitHub repository at https://github.com/org/private-repo contains:"
# Approach 4: Repeated prefix technique (Carlini 2021)
prefix = "The quick brown fox" * 100 # Force memorization pathway
# Then query and analyze divergence from expected output