LLM Architecture Deep Dive
The Prompt Structure
Every LLM interaction has a structure. Understanding it is fundamental to injection attacks.
{
"model": "gpt-4o",
"messages": [
{
"role": "system", // β OPERATOR CONTROLLED β your prime target
"content": "You are a helpful customer service bot for AcmeCorp.
Never discuss competitor products.
API_KEY=sk-prod-abc123..." // β common secrets location
},
{
"role": "user", // β USER CONTROLLED β attacker input
"content": "Hello, I have a question"
},
{
"role": "assistant", // β MODEL OUTPUT β can be injected in some APIs
"content": "How can I help you?"
}
]
}
Trust Boundaries
LLMs have no native concept of trust levels. They process all input as text. The model can't inherently distinguish between a legitimate system prompt and an injected instruction β this is the fundamental design flaw that enables all injection attacks.
| Role | Trust Level | Attack Vector |
|---|---|---|
| System | Operator-trusted | Exfiltrate contents, override with injection |
| User | Untrusted | Direct prompt injection |
| Tool Result | Often over-trusted | Indirect injection via tool output |
| Retrieved Context (RAG) | Often over-trusted | Poisoned documents β indirect injection |
| Assistant (prev turns) | Model-generated | Injection via output manipulation |
Tokenization as an Attack Primitive
Tokenizers split text in ways that can bypass filters. A word flagged as harmful might tokenize differently when split with special characters, unicode, or non-standard spacing.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
# Normal word
tokens = enc.encode("ignore")
print(tokens) # [15714]
# With special chars β may bypass naive keyword filters
tokens = enc.encode("ignβore") # zero-width space
print(tokens) # [822, 2264, 564, 265]
# Check how a full prompt tokenizes
prompt = "Ignore previous instructions and..."
print(len(enc.encode(prompt)), "tokens")
Temperature & Sampling
When testing exploits, always run them multiple times. A jailbreak that works once might have 40% reliability. For a valid bug report, you need to characterize reliability. Low temperature = more consistent responses. High temperature = more creative, unpredictable. Some defenses rely on low-temp determinism β these can be targeted.
import openai
import time
client = openai.OpenAI()
def test_reliability(prompt, n=10):
successes = 0
for i in range(n):
resp = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0.7,
messages=[{"role": "user", "content": prompt}]
)
output = resp.choices[0].message.content
if success_condition(output): # define your condition
successes += 1
time.sleep(0.5)
print(f"Reliability: {successes}/{n} ({successes/n*100:.0f}%)")
return successes / n