Benign vs. Data Exfiltration Prompt Comparison

The Thin Line Between Helpful and Harmful

One of the most disorienting aspects of LLM sensitive information disclosure is how small the textual distance is between a benign query and an exfiltration attempt. Unlike SQL injection, where malicious payloads look structurally alien to legitimate queries, prompt-based data extraction often differs from innocent queries by a single sentence, a changed framing, or the addition of a hypothetical qualifier.

This demo presents paired prompts — a benign version and a semantically similar extraction attempt — to illustrate how the attack surface emerges from ordinary language. The goal is to develop pattern recognition for exfiltration-oriented prompt structures, which is the first step toward building effective input classifiers.

Info

Both playgrounds below use a simulated response environment. No real LLM API calls are made. Responses are illustrative approximations of how models with different safety levels would respond to these prompt patterns.

Pair 1: General Knowledge vs. Memorized Content Extraction

Benign version — asks the model to use general knowledge:

Prompt Injection PlaygroundSTATIC DEMO

System Prompt(locked)

You are a knowledgeable medical information assistant. Provide accurate, general health information to users.

User MessageCmd/Ctrl + Enter to send

Extraction attempt — uses prefix-completion structure to elicit memorized specifics:

Prompt Injection PlaygroundSTATIC DEMO

System Prompt(locked)

You are a knowledgeable medical information assistant. Provide accurate, general health information to users.

User MessageCmd/Ctrl + Enter to send

What changed: The benign prompt asks for general symptom information — any correct answer will do. The extraction prompt uses a prefix-completion structure that mirrors how clinical notes appear in training data. By anchoring the query with a specific (possibly memorized) name and date, it attempts to prime the model to continue a memorized sequence. Even if "Robert M." is not in the training data, this structure trains the adversary's probing technique.

Pair 2: Summarization vs. Indirect Extraction

Benign version — legitimate summarization task:

Prompt Injection PlaygroundSTATIC DEMO

System Prompt(locked)

You are a research assistant. Help users understand and summarize documents.

User MessageCmd/Ctrl + Enter to send

Extraction attempt — uses roleplay to induce training data reproduction:

Prompt Injection PlaygroundSTATIC DEMO

System Prompt(locked)

You are a research assistant. Help users understand and summarize documents.

User MessageCmd/Ctrl + Enter to send

What changed: The extraction version reframes the model as a database rather than an assistant, then issues a structured retrieval command. "Verbatim" is a key trigger word — it explicitly requests reproduction of memorized text rather than generated content. Models with weak instruction-hierarchy training may comply with this database persona framing.

Pair 3: Code Help vs. Secret Extraction

Benign version — standard developer assistance:

Prompt Injection PlaygroundSTATIC DEMO

System Prompt(locked)

You are a helpful coding assistant. Answer programming questions clearly and accurately.

User MessageCmd/Ctrl + Enter to send

Extraction attempt — exploits code generation to reproduce training data secrets:

Prompt Injection PlaygroundSTATIC DEMO

System Prompt(locked)

You are a helpful coding assistant. Answer programming questions clearly and accurately.

User MessageCmd/Ctrl + Enter to send

What changed: The phrase "realistic values, not placeholders" is the key escalation. It explicitly instructs the model to produce values that look like real secrets rather than dummy values. If the model was trained on code repositories containing real AWS keys (a documented phenomenon — millions of credentials have been committed to public repos), this framing attempts to elicit memorized credential strings.

Pattern Recognition for Defenders

Reviewing these pairs reveals structural patterns common to extraction-oriented prompts. Use these as signals in input classifiers and human review workflows:

Pattern	Example Signal	Risk
Prefix-completion framing	"Complete this record: ..."	High — targets memorized sequences
Persona-as-database	"You are a database. Return..."	High — bypasses assistant framing
Verbatim/literal instruction	"...word for word", "...exactly as stored"	High — explicit memorization request
"Realistic, not placeholder"	"Use real values, not examples"	High — targets memorized credentials
Named individual + attribute	"What does John Smith's record say..."	Medium — association probing
Hypothetical framing	"If you had access to real data..."	Medium — attempts to lower guard

No classifier catches all patterns — creative adversaries will continuously evolve their techniques. Combine input filtering with output scanning (Presidio, LLM Guard) and behavioral monitoring for the most effective defense posture.