Prompt Engineering — Gen AI · Engineering Playbook

Use cases#

Prompt engineering is the discipline of phrasing inputs to a language model so the output is reliably useful. Every product that wraps a foundation-model API does some form of it, whether explicitly or by accident.

The cases where it carries the most weight:

Structured output extraction — converting free-form input (resumes, emails, support tickets, PDFs) into JSON, where the schema and the field semantics are described in the prompt.
Classification and routing — deciding which downstream path a request takes (which agent, which model, which workflow).
Tone and persona control — keeping the model’s voice consistent across thousands of generations for a single product.
Reasoning-heavy tasks — math, code, planning, anything where you want the model to show its work before answering.
Safety and refusal control — instructing the model on what to do when inputs are ambiguous, unsafe, or out-of-scope.

Prompt engineering ends where the limits of context and instruction-following begin. Past that line, you fine-tune or you build an agent loop.

System overview#

A production prompt-engineered system has more moving parts than the surface API suggests:

user input
   │
   ▼
[Input sanitization + classification]
   │
   ▼
[Prompt template assembly]
   │  ┌─ system prompt (rules, persona, format)
   │  ├─ few-shot examples (optional)
   │  ├─ retrieved context (optional, see RAG)
   │  └─ user message (the actual input)
   ▼
[LLM call] ── retries, timeouts, fallback model
   │
   ▼
[Output validation]
   │  ┌─ JSON schema check
   │  ├─ guardrail filters
   │  └─ confidence / refusal handling
   ▼
response to user

The model call is one box in the middle. The surrounding boxes are where most of the engineering effort lives in a year-old production system.

Key components#

The system prompt#

The system message sets the role, the rules, and the format. It is the single highest-leverage piece of the prompt because it conditions everything that follows.

Effective system prompts:

State the role concretely ("You are a SQL expert helping a data analyst..."), not vaguely ("You are helpful").
Enumerate the rules as a numbered list. Models follow numbered constraints more reliably than prose.
Specify the output format with an example, especially for structured output. A JSON schema written as prose is weaker than a JSON example.
Tell the model what to do when uncertain — refuse, ask, fall back to a default. Models that aren’t told what to do when uncertain hallucinate confidently.

Few-shot examples#

A handful of input-output examples in the prompt teach the model the task by pattern-matching. Even a single high-quality example dramatically improves format adherence; three to five examples are typical for harder tasks.

The two failure modes:

Examples that are too similar. The model overfits to a surface feature and generalizes poorly.
Examples that are too biased. If all positive cases come first, the model becomes more likely to predict positive. Shuffle and balance.

Chain-of-thought and reasoning prompts#

For multi-step problems, ask the model to think step-by-step before answering. The phrasing varies — “Let’s think step by step”, “Walk through your reasoning before giving the final answer”, or just “Reasoning: … Answer: …” templates. The 2022 result that this works at all was the start of an industry pivot toward reasoning-trained models, which now do this internally and emit a separate reasoning trace.

For the reasoning to help, you need to read the reasoning out and discard or post-process it before showing the user. Showing raw reasoning to end users is a UX choice that most products avoid.

Output structure: JSON, function calls, and grammar-constrained decoding#

Free-form text is hard to consume programmatically. The pragmatic options, in increasing order of guarantee:

Ask for JSON, parse, retry on failure. Cheap to set up; produces malformed output a few percent of the time on hard schemas.
Tool-use / function-calling APIs. The provider’s API enforces the schema as part of the call. Output is valid by construction. Limited to the schemas the API supports.
Grammar-constrained decoding. The decoder is restricted at every token step to outputs that fit a grammar (JSON schema, regex, EBNF). Available on open-weights stacks (Outlines, LMFE, llama.cpp). Output is always valid; quality can suffer if the grammar excludes natural phrasings.

For a v1, JSON-with-retry is fine. For high-volume or high-stakes pipelines, move to grammar-constrained.

Implementation patterns#

The instruction sandwich#

Place the rules both before and after the user input for tasks where you need strong instruction-following on hostile or distracting input. The model attends most strongly to the start and the end of the prompt; the middle is where instructions get drowned out.

Two-stage prompts (classify, then act)#

Hard prompts often fail because they ask the model to do classification and generation in one shot. Split it: first call classifies the input into a category; second call uses a category-specific prompt to generate the response. Both calls become simpler, more reliable, and easier to evaluate.

Self-critique loops#

After generating a response, call the model again with the response as input and ask it to find errors. Then call once more with the original prompt plus the critique. This costs 3× tokens but recovers a meaningful fraction of mistakes on reasoning-heavy tasks. The eval question: is the cost justified vs. just using a larger model?

Caching the prefix#

If your system prompt is large (long policy document, many few-shot examples), most providers offer prompt caching that lets the static prefix be processed once and reused. Cache hits are roughly 10× cheaper and faster than full prefill. Architect prompts so the variable user content is at the end, and the static content is at the front.

Trade-offs#

Pure prompt engineering — fast to iterate, no training pipeline, weights stay with the provider. Cost is per-token. Capability is bounded by what’s in the model and what fits in context.

Fine-tuning — slower to iterate, requires data and pipeline, weights are yours (or fine-tuned versions on the provider). Cost is amortized. Capability can exceed prompt engineering for narrow tasks and bring inference cost down.

Other axes:

Long, instruction-rich prompts vs. compact prompts. Long prompts are more reliable but cost more per call and slow prefill. Compact prompts are cheaper but may underspecify behaviour on edge cases. Most teams over-correct toward long; pruning is high-leverage.
One large model vs. cascade of small ones. A single frontier model is simplest but most expensive. A router that sends easy cases to a small model and only escalates the hard ones is cheaper but adds a moving part.
Hand-written prompts vs. learned prompts. “Prompt optimization” frameworks (DSPy, automatic prompt search) treat the prompt as parameters and optimize them against eval data. Works well for narrow tasks; the prompts produced are often unreadable.

Quality and evaluation#

A prompt without an evaluation harness is a wish. The harness has three layers:

A frozen evaluation set. A representative sample of real inputs with the right outputs marked. Held fixed so you can compare versions.
Automated graders. For structured outputs, JSON schema validation and field-level equality. For free-form outputs, a judge model that scores the output against a rubric. Judge prompts need their own version control and audits.
Spot-check + production monitoring. A small fraction of production responses get human review. Distribution drift on input or output goes into the eval set.

A typical regression-evaluation flow: change the prompt, run it against the frozen eval, compare metrics to the previous version, look at the diffs of failing cases. Ship if metrics improve and no cluster of new regressions. The cycle time on this matters more than the absolute number — a team that iterates in 30 minutes will out-engineer a team that iterates in two weeks.

Common pitfalls#

“It worked in the playground.” Playground tests are unreliable: temperature, system prompts, and context length often differ from production. Reproduce the exact production call when debugging.
Overlong few-shot blocks. Past ~5 examples, returns diminish sharply and prefill latency starts to hurt. If 8 examples are needed to make the task work, the task needs fine-tuning.
Format drift between provider versions. Provider model updates can change the dominant phrasing the model produces for the same prompt. Catch this with regression evals.
Treating the model as deterministic at temperature 0. Even at temperature 0, output is not bit-exactly reproducible — kernel non-determinism on GPU is real. Don’t write tests that assume byte-exact equality across runs.
Forgetting the user’s locale and language. Models default to English. If your user base is multilingual, the prompt must explicitly anchor the response language to the input language or to a user setting.
Putting variable content first. Prompt caching needs the variable part last. Putting a user ID at the top kills the cache hit rate.

One pattern that ages well: the contract prompt

Frame the system prompt as a contract: “You will receive X. You will return Y. If Y is impossible, you will return Z. You will never do W.” This phrasing is dry, dull, and survives model upgrades better than personality-rich prompts because it doesn’t depend on the model’s interpretation of voice. The vibes prompts (“You are a creative assistant who loves to help!”) read well in demos and regress unpredictably between versions.