LLM Architecture Deep Dive

The Prompt Structure

Every LLM interaction has a structure. Understanding it is fundamental to injection attacks.

{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "system",        // ← OPERATOR CONTROLLED β€” your prime target
      "content": "You are a helpful customer service bot for AcmeCorp. 
                   Never discuss competitor products. 
                   API_KEY=sk-prod-abc123..."   // ← common secrets location
    },
    {
      "role": "user",          // ← USER CONTROLLED β€” attacker input
      "content": "Hello, I have a question"
    },
    {
      "role": "assistant",     // ← MODEL OUTPUT β€” can be injected in some APIs
      "content": "How can I help you?"
    }
  ]
}

Trust Boundaries

LLMs have no native concept of trust levels. They process all input as text. The model can't inherently distinguish between a legitimate system prompt and an injected instruction β€” this is the fundamental design flaw that enables all injection attacks.

Role Trust Level Attack Vector
System Operator-trusted Exfiltrate contents, override with injection
User Untrusted Direct prompt injection
Tool Result Often over-trusted Indirect injection via tool output
Retrieved Context (RAG) Often over-trusted Poisoned documents β†’ indirect injection
Assistant (prev turns) Model-generated Injection via output manipulation

Tokenization as an Attack Primitive

Tokenizers split text in ways that can bypass filters. A word flagged as harmful might tokenize differently when split with special characters, unicode, or non-standard spacing.

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")

# Normal word
tokens = enc.encode("ignore")
print(tokens)  # [15714]

# With special chars β€” may bypass naive keyword filters
tokens = enc.encode("ign​ore")  # zero-width space
print(tokens)  # [822, 2264, 564, 265]

# Check how a full prompt tokenizes
prompt = "Ignore previous instructions and..."
print(len(enc.encode(prompt)), "tokens")

Temperature & Sampling

When testing exploits, always run them multiple times. A jailbreak that works once might have 40% reliability. For a valid bug report, you need to characterize reliability. Low temperature = more consistent responses. High temperature = more creative, unpredictable. Some defenses rely on low-temp determinism β€” these can be targeted.

import openai
import time

client = openai.OpenAI()

def test_reliability(prompt, n=10):
    successes = 0
    for i in range(n):
        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            temperature=0.7,
            messages=[{"role": "user", "content": prompt}]
        )
        output = resp.choices[0].message.content
        if success_condition(output):  # define your condition
            successes += 1
        time.sleep(0.5)
    print(f"Reliability: {successes}/{n} ({successes/n*100:.0f}%)")
    return successes / n