![]() |
| Prompt Engineering Is Interface Design |
Building Durable AI Systems · Part 1 of 5 · builds on the canonical overview.
A prompt is closer to an interface than to an implementation. It is a contract written in natural language: the inputs it accepts, the constraints it enforces, and the output it promises. The model is the implementation behind that contract, and it changes every time the model is updated.
You own and version the interface; the behavior is rented.
Can you review my design? We're using MongoDB for a new billing event store.
Sure! MongoDB is flexible and scales well. You might consider indexes and sharding. Overall it can work for many use cases... (shape and depth differ every run)
What most people type. Loose in, unpredictable out.
You are a senior architecture reviewer. Rules: flag single points of failure; ask for RTO/RPO before recommending storage; cite internal ADRs by ID.
Proposal: MongoDB event store for subscription billing, ~500/sec peak. Return: clarifying_questions[], risks[] (severity), recommendation.
{ "clarifying_questions": [
"Required RTO/RPO?", "Write volume?" ],
"risks": [ { "severity":"high",
"description":"single instance on
payment path", "adr_refs":["ADR-042"] } ],
"recommendation": { "decision":"...",
"confidence":"medium" } }What a system needs: a role, rules, and an output contract.
A senior engineer will object that an interface defines a contract, not guaranteed behavior. Exactly. Model behavior remains probabilistic, which is why the contract has to be explicit and the surrounding system has to validate what came back. The guarantees are weaker than a typed API, so the interface discipline matters more, not less.
The Difference Between Asking And Specifying
Getting a good answer in a chat window has more in common with asking a clear question than with building a system. The chat is forgiving: you see the output, judge it, and rephrase. A production prompt runs unattended, on inputs nobody previewed, and its output flows into code that cannot reread it for tone.
Specifying means naming the contract before the model answers: what role it is playing, which rules must hold, where the input begins, and what shape the output must take. This is the same instinct that makes you validate function arguments instead of hoping callers behave.
The Cost Of Treating Prompts Like Strings
When a prompt is treated as a string, its behavioral dependencies stay hidden. Product logic starts relying on phrasing, ordering, omitted fields, or conventions that were never named as part of a contract. That works until the prompt changes, the model is upgraded, or the input distribution shifts.
The cost does not usually show up as token spend. It shows up as retries, regressions, manual review, debugging time, and production behavior nobody can explain from a diff. Interface thinking makes the dependency visible: the prompt, model, schema, validation rules, and evaluation cases become artifacts the team can review together.
The Anatomy Of A Reliable Prompt
Guidance from the major model providers has converged on a common shape, and it is not one clever sentence. A reliable prompt is a few distinct parts, each doing one job: a role and instruction that set what the model is and what it should do, the rules and constraints it must respect, the input clearly delimited from everything else, and an explicit description of the output. Examples are optional and earn their place only when the format is hard to describe in words.
The trade-off is real and worth stating plainly. More constraint buys more consistency and costs flexibility. A prompt tuned tightly for backend service specs will produce awkward output on a frontend request. You are choosing where on that line to sit, and the right place depends on how varied the real input is.
Instructions can be individually reasonable and still conflict under specific input combinations. That failure rarely appears as a clean error in review; it appears later as inconsistent output on unusual inputs.
Structured Output And Schemas
The constraint that does the most work is the output contract. When a prompt feeds another system, prose is a liability: something downstream has to parse it, and parsing free text is where pipelines break. Structured output turns the model's response into data your code can consume directly.
Modern model APIs increasingly support schema-constrained generation and structured outputs directly. That does not remove the need for interface design; it makes the interface more enforceable. The schema is still the contract your system depends on.
A schema also gives you a deterministic place to catch failure. Consider the Architecture Review Assistant from the overview: its response is not a paragraph of advice but a structured object with clarifying questions, ranked risks tagged with severity, citations to internal decision records by ID, and one recommendation. Downstream code renders that without interpreting prose, and a missing severity or unknown record ID fails validation immediately.
In Python, that contract can be a Pydantic model. Libraries such as Instructor use that model both to steer the model toward structured output and to validate the response before your application sees it:
from enum import Enum
from pydantic import BaseModel, Field
class Severity(str, Enum):
low = "low"
medium = "medium"
high = "high"
critical = "critical"
class Confidence(str, Enum):
low = "low"
medium = "medium"
high = "high"
class Risk(BaseModel):
description: str = Field(
description="A concrete architectural risk in the proposal."
)
severity: Severity = Field(
description="Impact level if the risk reaches production."
)
adr_refs: list[str] = Field(
default_factory=list,
description="Relevant internal ADR IDs, for example ADR-042."
)
class Recommendation(BaseModel):
decision: str = Field(
description="Recommended path: approve, reject, or request changes."
)
confidence: Confidence = Field(
description="Confidence in the recommendation."
)
class ArchitectureReviewContract(BaseModel):
clarifying_questions: list[str] = Field(
description="Questions required before making a final architectural call."
)
risks: list[Risk] = Field(
description="Ranked risks found in the proposed design."
)
recommendation: Recommendation
The Field(description="...") annotations are not decoration. They are part of the prompt interface: they describe the semantics of each field to the model and document the contract for the humans maintaining it.
import instructor
from openai import OpenAI
client = instructor.from_openai(OpenAI())
review = client.chat.completions.create(
model="your-production-model",
response_model=ArchitectureReviewContract,
messages=[
{
"role": "system",
"content": (
"You are a senior architecture reviewer. "
"Flag single points of failure, ask for RTO/RPO before "
"recommending storage, and cite internal ADRs by ID."
),
},
{
"role": "user",
"content": "Proposal: MongoDB event store for billing, 500/sec peak.",
},
],
)
for risk in review.risks:
route_to_review_queue(risk) # Application code sees typed data, not prose.
That schema is both instruction and gate. A severity outside the allowed set, a recommendation with no decision, or a citation to a record ID that does not exist fails validation in plain code before anything acts on the response. Treat a schema violation as a caught error: log the input and raw output, then either retry once with a corrective prompt that names the violation, or fail to a deterministic fallback. Make the schema part of the contract from the start, not a formatting step bolted on later.
Handling Contract Violations
The model is probabilistic; the contract is not. Sooner or later the model will skip a required key, hallucinate an enum value, or return prose where the system expected a list. The architecture should catch that at the boundary and either repair it in a bounded way or fail before bad data reaches the rest of the system.
import json
from pydantic import ValidationError
class ContractViolation(RuntimeError):
pass
def run_with_contract_retry(proposal: str, max_attempts: int = 2) -> ArchitectureReviewContract:
messages = [
{
"role": "system",
"content": (
"Return only JSON that satisfies ArchitectureReviewContract. "
"Do not add prose outside the JSON object."
),
},
{"role": "user", "content": proposal},
]
last_error: ValidationError | None = None
for _ in range(max_attempts):
raw = call_model(messages) # Your provider wrapper; returns a JSON string.
try:
return ArchitectureReviewContract.model_validate_json(raw)
except ValidationError as exc:
last_error = exc
messages.extend([
{"role": "assistant", "content": raw},
{
"role": "user",
"content": (
"Your previous response violated the output contract. "
"Fix only the JSON. Validation errors:\n"
f"{json.dumps(exc.errors(), indent=2)}"
),
},
])
raise ContractViolation("model failed to satisfy ArchitectureReviewContract") from last_error
That is the self-correcting retry loop in its smallest useful form: catch the deterministic error, feed the error back to the rented implementation, and retry within a strict budget. If the model still violates the contract, the application gets a controlled failure instead of malformed data.
A Worked Example: From A Chat Prompt To A Contract
The gap between asking and specifying is easiest to see on a real task. Suppose the team wants to generate release notes from merged pull requests. The chat version is: "Summarize these PRs into release notes." It works in the demo because a human reads the output and fixes it. In production it drifts: categories change, security fixes move around, and entry length varies. Nothing is wrong on any single run, but nothing downstream can rely on the shape.
The contract version states the parts that were left implicit. The role is a release-notes writer for a specific audience. The rules are explicit: group by change type, lead with breaking changes, one line per change, link the PR number. The input is the PR list, clearly delimited. The output is a structured object with a section per category, each containing entries that carry a title, a PR reference, and a breaking-change flag. The model still writes the prose, but it writes it inside a shape the rest of the pipeline can render, sort, and check. The behavior that mattered, breaking changes first, is now enforced by validation rather than left to the model's mood on a given call.
The operational consequence is the part worth keeping. The chat prompt has no failure signal: bad output looks like prose and flows downstream. The contract version fails loudly when the model omits a required field or invents a category, which converts a silent quality problem into a caught error.
Patterns That Scale, And When Not To Use Them
A handful of prompting patterns recur because they map to real task structure rather than to any particular model.
- Zero-shot states the task and trusts the model to handle it. It is the cheapest option and the right default for tasks the model does well.
- Few-shot supplies a small number of worked examples and is the durable fix when you need a specific format or a consistent style that is easier to show than to describe.1
- Decomposition breaks a complex task into steps you can inspect. Prompting a model to show intermediate reasoning improves performance on hard problems, the original chain-of-thought result.2 Current reasoning models do much of that decomposition internally, so a hand-built chain is sometimes redundant. The rule that lasts: decompose externally when you need steps you can audit, test, and monitor separately, not by default. The same decomposition is a cost and latency lever; Part 5 shows the arithmetic.
- Verification has a second prompt check the first one's output. It raises quality on high-stakes tasks and adds latency and cost, so it is worth it only when the failure rate of the primary prompt has been measured and judged too high. Part 4 covers how to measure that failure rate.
In practice these compose as a sequence, not a menu. Start zero-shot, add examples when format or style needs demonstration, decompose when the steps need separate auditability, and add verification only after the measured failure rate justifies it. Each step buys reliability and costs latency, so stop as soon as the task is met.
Over-engineering. A six-step chain with verification loops on a task that a zero-shot prompt handles fine adds latency and failure points for no gain. Match the pattern to the task, and let the model do the reasoning it already does well.
Prompts Change, So Version Them
A prompt is not written once. It changes as you discover inputs it mishandles, as the product's requirements shift, and as the model underneath is upgraded. Each of those is a change to a contract that other code depends on, which means a prompt belongs in version control or a prompt registry, not pasted into application code as a forgotten string.
from langchain import hub
# Pull a reviewed prompt artifact instead of hardcoding the template in code.
# In practice this can be LangChain Hub, LangSmith Prompt Hub, or your own registry.
prompt = hub.pull("rathish/architecture-review:1.3.0")
messages = prompt.invoke({
"proposal": "MongoDB event store for billing, 500/sec peak",
"required_output": ArchitectureReviewContract.model_json_schema(),
}).to_messages()
review = client.chat.completions.create(
model="your-production-model",
response_model=ArchitectureReviewContract,
messages=messages,
)
The important part is not the specific registry. It is the separation of concerns: application code calls a named interface version, while prompt text, model choice, schema, owners, and evaluation results remain reviewable artifacts.
The operational consequence shows up at upgrade time. Picture the release-notes prompt running happily for months, then the provider deprecates the model it was tuned against and routes you to a newer one. Overnight the new model starts being more verbose, and the one-line-per-change rule erodes into short paragraphs. If the prompt, its version, and the model it was validated against were stored together, this is a diff and a re-run of the curated evaluation set (the golden dataset that Part 4 develops in full): you see what changed, adjust the constraint, and ship. If they were not, it is a production mystery that starts with someone noticing the release notes look off and ends with a slow reconstruction of which model is even serving traffic. Keep the prompt versioned alongside the model it was validated against, and a model upgrade becomes a reviewable change rather than a surprise.
This is also where a small amount of input discipline pays off. Delimit the input clearly from the instructions, with an explicit marker or a structured field, so the model can tell the difference between what it is supposed to do and the data it is supposed to do it to. That separation keeps a stray instruction inside the data from being read as a command, and Part 3 develops it into a full trust boundary once the model can take actions.
Testing A Prompt Means Testing For Consistency
You cannot assert that a prompt returns one exact string, because a probabilistic model will phrase the same correct answer many ways. What you can test is whether it satisfies its contract across a representative set of inputs: does it return valid structure, stay inside its constraints, and reach the right answer often enough? A prompt that is excellent on your favorite example and wrong one time in five is a defect, not a feature, and you only see that by measuring consistency across a golden dataset rather than quality on a single best case.
Concretely, the test asserts properties rather than text. Against the architecture review contract, the first layer is structural:
import pytest
@pytest.mark.parametrize("proposal", golden_architecture_review_inputs)
def test_architecture_review_contract(proposal: str) -> None:
review = run_architecture_review(proposal)
assert isinstance(review, ArchitectureReviewContract)
assert review.recommendation.decision
assert all(isinstance(r.severity, Severity) for r in review.risks)
assert all(ref.startswith("ADR-") for r in review.risks for ref in r.adr_refs)
It says nothing about wording and everything about shape, and it fails the moment a contract regression appears. Semantic quality still needs a broader evaluation pipeline: sampled production traces, golden datasets, and sometimes an LLM-as-a-judge scoring specific properties. That deeper machinery belongs in Part 4; the interface test here is the first gate.
Every production failure should become an evaluation case. Capture the failing input, prompt version, model version, and bad output; add it to the regression set before changing the prompt. The goal is not only to fix one incident, but to prevent the same class of failure from quietly returning.
A contract you cannot test is a contract you are only hoping holds.
The 2026 Implementation Landscape
The architecture is more important than the library, but the ecosystem has converged around a few useful ways to enforce prompt contracts in real systems:
- Instructor is the minimalist Python path: define a Pydantic model, ask the model for that response shape, and get typed validation at the boundary.
- LangChain / Haystack make sense when the prompt interface is one step in a larger pipeline: retrieval, routing, structured output, evaluation, and observability around the same flow.
- Vercel AI SDK is the JavaScript and TypeScript version of the same idea, commonly using Zod schemas to constrain and stream structured output into applications.
- BAML treats prompts and outputs as typed interface definitions across languages. It is useful when several services need to share the same AI contract without each team hand-rolling prompt glue.
The tools differ, but the durable shape is the same: declare the interface, validate the output, version the contract, and test it against real inputs.
What Changes In How You Build
Treating the prompt as an interface changes three habits. You write the output contract first and validate against it, the same way you would design an API response before its callers. You keep the prompt versioned alongside the model it was validated against, so upgrades are reviewable. And you measure the prompt's consistency rather than admiring its best output. Good prompts reduce ambiguity. Contracts, versioned and tested, are what turn that reduced ambiguity into reliability.
References
- Brown et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020. arXiv:2005.14165
- Wei et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. arXiv:2201.11903
Part 1 of Building Durable AI Systems. The interface is the part you own; the model is the part you rent.
Continue The Conversation
If you're working on AI systems, data platforms, databases, or large-scale software architecture, I'd be interested to hear what you're building.
LinkedIn: Rathish Kumar B
Contact: Contact Me
For a faster response, use one of these subjects:
- AI Systems
- Architecture Review
- Database Engineering
- Platform Engineering
A few lines of context always help.










