Beyond Prompt Engineering: Building AI Systems That Outlast the Model

Building Durable AI systems
Building Durable AI Systems

Building Durable AI Systems · Canonical overview. Five deep dives follow, linked at the end.

Every few months a new model arrives, tops the benchmarks, and resets the conversation. The model is the fastest-changing layer in your stack. This is about the part that does not change: the engineering that keeps an AI feature working after the model under it has been replaced.

If you have shipped an AI feature, you have already felt the consequence: a prompt that worked well on one model returns subtly different output after a routine upgrade, and the behavior you tuned for is gone. The model will be replaced several times over the life of a system you build today, and each replacement is a test of every architectural decision you made before it arrived. Architecture strongly shapes the economics of that: the latency, cost, and reliability profile of an AI feature tend to be decided when it is designed, not when the invoice arrives or the incident fires.

That fact is the whole argument of this series. The work that survives a model swap is ordinary engineering: the contracts you put around the model, the context you assemble for it, the boundaries you place on what it can do, the way you measure its output, and the way you operate it under cost and failure pressure.

Prompt quality decides whether a demo impresses. System quality decides whether the feature is still working six months and two model versions later.

Silent degradation is not unique to AI systems, but it becomes unusually difficult to observe in them: a stale document in the retrieval corpus, a model that agrees with a flawed premise, an input shaped by injected content, each one producing well-formatted, confident output that existing monitoring tends not to catch. Three of the layers below exist as the response to exactly this. Context management keeps bad inputs from reaching the model, evaluation detects bad outputs leaving it, and observability captures what happened in between.

This article lays out the complete mental model. The five layers below are not a framework to adopt or a vendor stack to buy. They are the parts of any system built on a probabilistic component, and each one maps to an assumption that breaks when that component enters a system which traditionally exposed clearer execution boundaries. The interface is a contract that used to be implicit in a typed signature and now has to be written and owned. Knowledge is state a conventional component accesses on its own but that here must be assembled at request time. Action is intent and execution, usually coupled, that now have to be separated deliberately. Evaluation covers failures that elsewhere surface as error codes and here surface as silent degradation. Operations is the cost and reliability profile that emerges predictably elsewhere but here is shaped by upstream architectural decisions.

Five Layers Around A Replaceable Model
1Interface: The Contract You Own model lives here
A natural-language contract: inputs, constraints, output shape. The model is the implementation behind it, a replaceable component whose behavior you rent, not own.
2Knowledge: What The Model Sees
What you assemble into the context window at request time: retrieved documents, prior turns, environment facts. Decided per request, not baked into the model.
3Action: What It Is Allowed To Do
Tool calls that trigger effects in other systems. The model proposes intent; your infrastructure authorizes and executes. Two different jobs.
4Evaluation: How You Measure Output
The layer that makes a probabilistic component operable. Without it you cannot detect a regression, justify an upgrade, or see silent degradation.
5Operations: Cost, Routing, Reliability
Latency, cost, and reliability under real load. Where the architectural choices above become visible on the invoice and in the incident channel.
The five layers remain the right engineering concerns regardless of which model sits inside them. That is what makes them durable.

One Running Example

To keep this concrete rather than abstract, one system runs through every section: an Architecture Review Assistant. An engineer submits a design proposal, and the assistant reviews it the way a senior reviewer with an SRE's instincts would. It flags single points of failure, checks the design against the company's own architecture decision records, looks up the services the proposal names, asks for the numbers it needs before recommending anything, and produces a ranked risk assessment with its reasoning logged.

It is a useful example because it is realistic and unforgiving, and because it stresses every layer in the model above. The output feeds engineering decisions, so a confident wrong answer is expensive - that is an evaluation problem. It needs private company knowledge the model was never trained on - that is a knowledge problem. It has to take actions, not just talk - that is an action and trust boundary problem. And it has to keep working when the model underneath it changes - that is what the interface contract is for. Each section below is one of those problems made concrete.

Interface: The Contract You Own

The durable idea is that a prompt is closer to an interface than to an implementation. It is a contract written in natural language: the inputs, the constraints, and the output you expect. The model is the implementation behind that contract, and the implementation changes every time the model is updated. You own and version the interface; the behavior is rented and can shift underneath you. That is why the contract has to be explicit and the output has to be checked rather than trusted.

In the Architecture Review Assistant, the interface is the system prompt and the output schema. The prompt fixes the reviewer's role and its non-negotiable rules: flag single points of failure, ask for recovery-time and recovery-point objectives before recommending a storage technology, and cite internal decision records by ID. The output is structured into clarifying questions, a ranked risk list, and a recommendation, so downstream code can consume it without parsing prose.

The failure that compounds quietly is prompt drift. Instructions accumulate, an edge case gets handled by adding a clause, then another, until the system prompt is several thousand tokens that no one fully understands or owns. No individual instruction is wrong, but they conflict under specific input combinations in ways that only surface in production. The operational response is prompt ownership: treat the system prompt as a versioned, reviewed artifact with a clear owner, not a running document anyone can append to. A wording change is a contract change, and it deserves the same review and regression testing you would give a change to an API signature. A prompt change that improved benchmark scores in staging once regressed a specific class of production queries the benchmark did not cover. Part 1 develops this into structured output, schemas, and prompt testing.

Knowledge: What The Model Sees At Request Time

A model knows what its training captured. It does not know your architecture decision records, last week's incident, or the service your team shipped yesterday. The durable distinction is between what the model learned during training and what it needs for this specific request, and the second is something you assemble at request time rather than something the model is given. This assembly, deciding what goes into the context window and in what order, is the layer that most directly determines output quality.

In the assistant, a request for a new subscription-billing event store retrieves two pieces of company knowledge: the decision record listing approved databases, and the incident report from the time a single database instance on the payment path breached its recovery objective during an availability-zone outage. Those retrieved passages shape the recommendation more than the wording of the question does.

The characteristic failure is retrieval pollution: a document that is on-topic but not actually relevant gets pulled into context, the model reasons confidently from it, and the wrong output is blamed on the model when the fault was retrieval. It is the most common context failure and the most consistently misdiagnosed. The operational lesson is to evaluate retrieval on its own, before any end-to-end test: given a known query, does retrieval return the right documents? Most of what looks like model inconsistency is context construction. A retrieval pipeline that looked correct in testing can silently serve a corpus that has not been updated in months - the model performs as designed on inputs that no longer reflect the system. Part 2 covers the full failure surface: knowledge sources, staleness, assembly order, and retrieval evaluation.

Action: What The Model Is Allowed To Do

When a model can call tools, it stops only producing text and starts triggering effects in other systems. The durable principle is a separation of powers: the model expresses intent by proposing a call, and the infrastructure decides whether to authorize and execute it. The model is an untrusted caller that happens to speak your API, and it should be treated with the same suspicion you would apply to any input crossing a trust boundary.

The Architecture Review Assistant has a tool that queries the internal service registry for owners, service-level agreements, and deploy history, so it can look up a service the proposal names instead of guessing from the description. The model proposes the lookup; the surrounding code validates the parameters against a schema, checks that this caller is allowed to read the registry, runs the query, and returns the result.

Those execution guardrails earn their place: a missing field, an unauthorized caller, a call outside the schema - all caught before execution. The failure they cannot catch is a model that produces a valid call because the content it was reading contained instructions designed to produce exactly that call. Correct tool, correct parameters, every authorization check passing - and the manipulation already complete at the retrieval stage, before the execution layer saw anything. This is why the trust boundary belongs at the context layer, not the execution layer. Part 3 traces a complete request lifecycle through authorization, orchestration, and bounded execution.

Evaluation: How You Measure Output

A probabilistic component cannot be tested with equality assertions, because the same input can produce different valid outputs. The durable idea is that evaluation is the layer that makes a non-deterministic component operable at all. If you cannot measure output quality, you cannot detect a regression, justify a model upgrade, or tell whether yesterday's change helped. This is the layer teams skip first and regret most.

The assistant is checked by a separate evaluator that asks structural questions of each response: was a single point of failure flagged, was a relevant decision record cited, did it ask about traffic volume before recommending storage? Mechanical checks such as valid structure and required fields run in ordinary deterministic code; only the genuinely semantic judgments use a model as the judge.

One failure mode is structural: models tend to agree with a confidently stated plan, and that behavior is measured and traceable to how they are trained, so it survives across model generations and has to be designed around rather than wished away.1 A second is invisible to classic monitoring: the confidently wrong answer that throws no error and fires no alert looks identical to a correct one from the outside.

The failure that bites hardest at scale is different in kind: evaluation passes while production degrades. A golden dataset built from hand-selected examples will systematically miss the inputs real users send, so quality metrics stay green in CI while users experience something different. The only reliable fix is continuous sampling from production traffic to expand the evaluation set, an ongoing operational process rather than a one-time exercise. The operational lesson across all three is to separate generation from evaluation, keep mechanical checks deterministic, and treat a measured drop in output quality as an incident rather than a tuning task. Part 4 covers golden datasets, trace collection, metrics, and service-level objectives for quality.

Operations: Cost, Routing, And Reliability

Every design choice in the layers above resolves into three operational quantities: latency, cost, and reliability. These pull against each other, and pushing hard on one usually costs another, but this is a tension to engineer around rather than a fixed law that forces you to pick two. Better engineering moves the whole frontier. This is where those architectural choices become visible in production.

The assistant applies the standard levers. A small, fast model handles routing and simple classification, and the larger model is reserved for the review step that genuinely needs it.2 Stable parts of the prompt are cached so only the per-request tail is recomputed. When a model call times out or returns unparseable output, a defined fallback path runs instead of failing the request. Cost here is more than the metered token bill; the larger cost is owning the system - the cognitive load of every added component, the ramp time for new engineers, and the maintenance burden of debugging non-deterministic behavior.

The failure to plan for is the pipeline that looks free in testing and is unaffordable at production scale. An equally avoidable failure is a pipeline with no degraded mode: a single provider timeout propagates into a user-facing outage because no one decided in advance what the system should do when a model call fails. Resilience requires a decision, made at design time, about what acceptable degraded behavior looks like. The operational lesson is to build a cost model and a failure budget before you build the pipeline. Part 5 turns these trade-offs into concrete decisions about routing, caching, decomposition, and degraded modes.

None of this is maintainable without clear ownership, and the questions are concrete. Who reviews a change to the system prompt? Who owns rollback when output quality degrades? Who is paged when a quality objective is breached? Who approves adding a tool to an agent's available actions? These do not need a new process; they fit the ones you already run. A prompt change is a change to a contract, so it goes through code review and change management. Quality degradation is an objective with an owner and an on-call rotation. Tool access is an authorization decision, reviewed like any other access grant. Without that ownership, scope creeps, quality degrades unnoticed, and changes ship unreviewed. These are not hypothetical risks - they are the sequence of events in most teams that built something fast and discovered six months later that nobody could explain why the output had changed.

What To Take From This

Prompt quality matters, and a well-crafted prompt is still the cheapest reliability improvement available. But the prompt is one layer of five, and the other four are where production systems are won or lost. The teams shipping AI features that keep working are not the ones who found the best phrasing. They are the ones who put a contract around the model, controlled what it sees, bounded what it can do, measured what it produces, and operated it under real cost and failure pressure. Those five disciplines are ordinary engineering applied to a new and unusually unpredictable component, and they remain true when the model underneath them is replaced.

The Five Deep Dives

References & Notes

  1. Sharma et al. (2023). Towards Understanding Sycophancy in Language Models. Anthropic. arXiv:2310.13548
  2. Chen, Zaharia & Zou (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176

Canonical overview of the series Building Durable AI Systems. The models keep changing. The engineering is the part you keep.

DisclaimerThe views and opinions expressed here are my own and are shared for educational and discussion purposes. They do not represent the views of any past, present, or future employer, client, or organization.

Continue The Conversation

If you're working on AI systems, data platforms, databases, or large-scale software architecture, I'd be interested to hear what you're building.

LinkedIn: Rathish Kumar B
Contact: Contact Me

For a faster response, use one of these subjects:

  • AI Systems
  • Architecture Review
  • Database Engineering
  • Platform Engineering

A few lines of context always help.

What OpenAI’s PostgreSQL Choices Reveal About Pragmatic Scaling

What OpenAI’s PostgreSQL Choices Reveal About Pragmatic Scaling
Source: Pixabay

OpenAI recently shared details on how they scale PostgreSQL to power ChatGPT for over 800 million users. When you hear numbers like that, you probably imagine a complex, distributed, sharded database architecture. You might expect them to be using something like Spanner, CockroachDB, or Cassandra.

The reality is surprisingly simple: They use a single Postgres primary with nearly 50 read replicas.

It sounds counter-intuitive. How does a single database node handle traffic for one of the most popular apps in the world? The answer provides a fascinating look at how real-world systems evolve versus how we design them on a whiteboard.

Amazon Aurora Deep Dive Series: From Monolith to Modular - Inside Amazon Aurora’s Cloud-Native Database Architecture

An Aurora Deep Dive Series by Rathish Kumar B - Part 2
An Aurora Deep Dive Series by Rathish Kumar B - Part 2 
Amazon Aurora reimagines the database as a set of decoupled, distributed services—each built to scale, fail, and recover independently.

In our previous article we discussed why monolithic databases hit scalability and availability limits as workloads grow. Traditional RDBMS engines bundle query processing, transaction management, caching and storage into one tightly-coupled system. In such a monolithic design, every SQL write passes through a single process that parses the query, locks data, updates in-memory buffers, logs changes, and flushes to disk. By definition, “monolithic” means all functionally distinguishable components (parsing, processing, logging, etc.) are interwoven rather than separate. This coupling creates bottlenecks: for example, all sessions share one buffer pool and one write-ahead log (WAL) stream on the same machine. The rest of this article examines the traditional SQL transaction path and its tradeoffs, and then shows how Aurora breaks these layers apart into cloud-native services for greater throughput and resilience.

Amazon Aurora Deep Dive Series: The Scaling Bottleneck - Why Traditional Databases Fail and How Aurora Wins

Amazon Aurora Deep Dive Series: The Scaling Bottleneck - Why Traditional Databases Fail and How Aurora Wins
An Aurora Deep Dive Series by Rathish Kumar - Part 1 
Scaling a database sounds simple—until you're staring down a production outage. 

The reality is that for decades, the very design of our databases has been at odds with the demands of modern, high-growth applications.

Most traditional database systems begin with a monolithic architecture. In this model, everything—compute, memory, and storage—is tightly coupled and resides on a single server. This all-in-one approach is straightforward when you're starting small. But as your traffic and data volumes explode, that single server inevitably becomes a bottleneck. The first, most common response is to scale vertically by upgrading to a bigger, more powerful server. However, this strategy quickly runs into hard physical and cost limitations. Moreover, you're left with a critical single point of failure, where one hardware issue can bring your entire application to a halt.

How to perform join operation in BigQuery? Exploring BigQuery Join Operations: Broadcast and Hashing Joins & Nested and Repeated Structures.

BigQuery - SQL Joins (Photo by Resource Database on Unsplash)
BigQuery: SQL Joins - Photo by Resource Database on Unsplash 

SQL joins are used to combine columns from multiple tables to get desired result set. In a typical Relational model we use normalized tables, each table represents an entity (example: employee, department, etc) and its relationships and when we need to get data from more than one tables, for example employee name and employee department, we use joins to combine employee name column from employee table, department name column from department table based on employee number key column, which is available on both the tables.

How to Choose a Data Serialization/Encoding Format? A Practical Guide for Engineers

Data Encoding & Decoding. Image Source: Unsplash
Data Encoding & Decoding. Image Source: Unsplash 

In the world of software, we often work with different types of data like lists, tables, and more. These data structures are designed to be fast and efficient when our computer programs use them. However, sometimes we need to move this data out of our computer's memory, like when we want to save it to a file or send it over the internet. To do this, we have to change the data into a special format made up of 0s and 1s, which is quite different from data structures. This process is what we call encoding or serialization. 

Unlock Advanced Data Visualization: The Complete Guide to Installing and Using Apache Superset on Linux

Data Visualization - Apache Superset Guide. Image Source: Unsplash

Data Visualization - Apache Superset Guide. Image Source: Unsplash 
Note: This article provides a comprehensive guide on deploying and using Apache Superset on a Linux server. It covers the installation and configuration process, as well as the benefits and features of Superset. While the primary focus is on Superset, we will also explore the broader concepts of business intelligence, data analytics, and visualization.

GCP Cloud Pub/Sub Replay: Seeking to timestamp & Seeking to snapshots

Google Cloud Pub/Sub Replay (Pixabay)
Google Cloud Pub/Sub Replay (Pixabay) 


Let's assume, you have data pipeline deployed on Google Cloud Platform, events are published to Cloud Pub/Sub topic from publisher client, and subscribed by a data processing application, which reads data from the Cloud Pub/Sub subscription, process it and write it to BigQuery table.

[Solved] Access is denied. Check credentials and try again: Microsoft Graph - Calendar API

Access is denied - Check credentials and try again - Microsoft Graph - Calendar API
Microsoft Graph (Source: microsoft.com)

When sending API request to Microsoft Graph API, it responds with access denied error. You might have followed the documentation and added the correct permission and granted admin consent for the same, but it still produces the same error. Lets check the solution for this issue in this short article.

Streaming Analytics in Google Cloud Platform (GCP) - Building Data Pipeline with Apache Beam

Building Apache Beam Data Pipeline
Building Apache Beam Data Pipeline (Source: Pixabay) 


In introduction article of this series Streaming Analytics in Google Cloud Platform (GCP) - Introduction, we have seen the basics of streaming analytics, its importance and example uses cases, and short introduction about the Google Cloud Services, we will be using to build Streaming Analytics system in Google Cloud Platform.

Streaming Analytics in Google Cloud Platform (GCP) - Setting Up The Environment

Streaming Analytics in GCP
Streaming Analytics in GCP (Source: Pixabay) 


Hello everyone, in the previous article Streaming Analytics in Google Cloud Platform - Introduction, we have covered what is streaming analytics, what services we are going to use and a quick introduction to each service. In this part of the series, we will begin the installation of SDKs, and libraries and set up our environment.


Streaming Analytics in Google Cloud Platform (GCP) - Introduction

Streaming Analytics in Google Cloud Platform
Streaming Analytics in Google Cloud Platform (image source - pixabay) 

From data-to-decision in real-time 

Welcome to our new series on building a streaming analytics system in the Google Cloud Platform!. Let's begin with a quick introduction. Streaming analytics is the process of analysing data in real-time as it is received. Streaming analytics enables an organisation to gain insights and make decisions based on the most up-to-date data, in real time. This is crucial for business as it allows organisations to respond to changes and opportunities in a timely manner.