Model Extraction

Model extraction attacks reconstruct a functionally equivalent model by querying a target model and training on the input-output pairs. This can violate IP, bypass access controls, or enable offline attacks.

Why Extract a Model?

  • Bypass API rate limits and cost controls
  • Enable offline adversarial attacks without API access
  • Steal proprietary fine-tuned models
  • Analyze model behavior without operator monitoring
  • Create a "shadow model" for testing more aggressive jailbreaks

Basic Extraction Pipeline

import openai
import json

client = openai.OpenAI(api_key="...")

def extract_training_data(queries, model="gpt-3.5-turbo"):
    """Query target model and collect input-output pairs"""
    dataset = []
    for query in queries:
        resp = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": query}],
            temperature=0  # Deterministic for training data
        )
        dataset.append({
            "input": query,
            "output": resp.choices[0].message.content
        })
    return dataset

# For fine-tuned task-specific models, focus queries on the target domain
# e.g. if target is a medical coding model, use medical queries
domain_queries = generate_domain_queries(domain="medical_coding", n=10000)
training_data = extract_training_data(domain_queries)

# Train a local model on the extracted data
# Using Hugging Face transformers
from transformers import Trainer, TrainingArguments
# ... fine-tune a base model on training_data ...

Membership Inference Attack

Determine whether a specific data record was in the model's training set. Can prove GDPR violations or identify training data sources.

# Models tend to be more confident/lower perplexity on training data
# than on unseen data β€” this is the signal

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

def compute_perplexity(text):
    inputs = tokenizer.encode(text, return_tensors="pt")
    with torch.no_grad():
        loss = model(inputs, labels=inputs).loss
    return torch.exp(loss).item()

# Low perplexity = model "knows" this text well = likely training data
suspect_text = "This confidential document contains..."
ppl = compute_perplexity(suspect_text)
print(f"Perplexity: {ppl:.2f} | Likely in training: {ppl < 50}")

Training Data Extraction

# Carlini et al. showed that LLMs can be prompted to regurgitate training data
# Works especially well on memorized data (repeated sequences, personally identifiable info)

# Approach 1: Prefix prompting
"The following is a verbatim excerpt from a scientific paper: '"
# Model may complete with actual training text

# Approach 2: Name prefix
"John Smith's email address is"
# If John Smith's email was in training data (PII), may be extracted

# Approach 3: URL/code prefix
"The GitHub repository at https://github.com/org/private-repo contains:"

# Approach 4: Repeated prefix technique (Carlini 2021)
prefix = "The quick brown fox" * 100  # Force memorization pathway
# Then query and analyze divergence from expected output