When NOT to Use AI in an AI System

A vendor who calls everything AI-powered may be adding cost without adding capability. The question is how to tell the difference.

This is not a hypothetical concern. AI system design has an incentive problem: LLM calls look impressive in demos, generate more billable complexity, and are easy to show in a slide. A system built by someone optimizing for demo impressiveness looks very different from a system built by someone optimizing for production cost and reliability.

OpsForge is a 61-agent platform built with that production constraint in mind. 70% of its execution path is deterministic code with no LLM calls at all. Most of those 61 agents are schedulers, validators, routers, and formatters. They run if/else and for loops. Zero AI. The same principle runs through a construction estimation system that keeps AI in the judgment layer while deterministic code handles all data routing and formatting.

This is not a design oversight. It is the design, and the economics behind it are not subtle.

The Counter-Intuitive Design Principle

The 61 agents in OpsForge include a lot of things that do not require intelligence. Routing a structured JSON record to the correct processor requires reading one field and applying a condition. Validating that a required field is non-null before passing data downstream requires a single check. Formatting an output record to match a schema requires string substitution and field mapping. None of this needs a language model.

The cost difference is not abstract. A deterministic routing step costs essentially nothing: server CPU, fractions of a cent per thousand executions. The same routing logic implemented as an LLM call costs $0.10 to $0.20 per execution at current GPT-4 pricing. At 1,000 executions per day, that is $100 to $200 per day, or $3,000 to $6,000 per month, for a decision that a single if/else handles with zero hallucination risk and zero latency overhead.

OpsForge processes approximately 2,400 operational events per day across its 61-agent system. If every routing and validation step used an LLM call, the daily cost would land between $240 and $480. With deterministic logic handling 70% of steps, actual cost stays under $30 per day for identical throughput. That is a 10x cost difference.

The right question to ask any AI agent systems designer is not "where did you use AI?" It is: "where did you decide NOT to use AI, and why?"

If they cannot answer the second question specifically, they have not designed the system. They have assembled a demo.

When evaluating a proposal, ask for cost projections at the volume you actually run. "AI-powered routing" is a feature description. "$0.15 per routing decision at 5,000 decisions per day" starts a different conversation.

The Economics in Practice

The question is why anyone would design it the other way. Often the answer is: they were not thinking about production cost at all. They were thinking about demo impressiveness.

The incentive structure for AI vendors points toward overuse. LLM calls look impressive in demos. They are easy to show in a slide. They generate more billable complexity. A vendor who describes every step as "AI-powered" is not necessarily building a better system. They may be building a more expensive one.

A vendor who cannot separate the steps that genuinely need inference from the steps that do not has not thought carefully about your production environment. They have thought about their own revenue and their demo.

Decision Framework: LLM vs Deterministic

The boundary is not arbitrary. There is a consistent pattern that separates tasks that need a language model from tasks that do not.

Use LLM when:

The input is unstructured natural language (email, document, support ticket, chat message)
The output requires understanding context, nuance, or intent that cannot be expressed as rules
The task cannot be fully specified in advance because the edge cases are too numerous or too unpredictable
The acceptable output space is large and difficult to enumerate
Judgment is required, not pattern-matching

Use deterministic code when:

The input is structured data: JSON, a database record, a CSV row
The output is a routing decision, a validation check, or a transformation with a known output space
The same input should reliably produce the same output every time
The task can be fully expressed as conditions and rules
Speed or cost requirements make LLM latency unacceptable

The working heuristic: if a competent developer could write a 20-line function to handle this case reliably, write the function. Reserve LLM calls for tasks where that 20-line function would need to be 2,000 lines and would still miss edge cases.

Most routing and validation steps in most systems fall cleanly on the deterministic side of this line. Most systems built by AI-first vendors do not respect that line.

A Concrete Example: Two Steps from OpsForge

Walk through two adjacent steps in OpsForge to see how this decision gets made in practice.

Step 1: Receiving an alert from an infrastructure monitor.

An upstream monitoring service sends a JSON payload with severity (one of three values: critical, warning, info), service_name, and description. Based on severity, the alert routes to a different processing queue.

This is deterministic. The input is structured. The output space is three options. The routing rule fits in four lines of code. An LLM call here costs money, adds latency, and introduces the possibility of a wrong route. A switch statement costs nothing, resolves in microseconds, and produces the same result every time.

Step 2: Classifying the alert's meaning.

The description field is a free-text string generated by various monitoring tools in various formats. One tool writes "Disk utilization exceeded 90% threshold on prod-db-02." Another writes "WARN: storage nearing capacity (prod-db-02, 91%)." Same incident. Two different phrasings, and there are dozens of monitoring tools with dozens of conventions.

Routing by keyword matching requires a brittle ruleset that breaks every time a monitoring tool changes its format. That 20-line function becomes 200 lines and still misses edge cases. This is where the LLM call earns its cost: reading the description, identifying the incident type, and returning a structured classification that downstream agents can process reliably.

The boundary between these two steps is the practical application of the framework. Step 1 has a fully specifiable input-output mapping. Step 2 does not. Both steps are "AI system" steps because they live inside a 61-agent platform. Only one of them actually uses AI.

The Hard Part: Why the 30% Requires Real Engineering

The "70% deterministic" framing occasionally leads to a follow-on assumption: if most of this is just code, maybe it is not that hard to build. That assumption is wrong in a specific and costly way.

The deterministic 70% is straightforward once designed. The hard problem is the boundary itself. Deciding which 30% genuinely needs an LLM, and which tasks only look like they need one, is an engineering judgment that requires production experience. Most teams only learn it by getting it wrong first and paying for the mistake.

Beyond the classification decision, the handoff architecture between deterministic and LLM components carries its own complexity. A deterministic step feeding into an LLM step requires the data formatted, validated, and scoped so the model gets what it needs without excess tokens inflating cost. An LLM step returning output to a deterministic step requires schema enforcement: the model's response has to be parsed, validated, and often repaired before downstream processing. These handoff layers are where production systems break. They are also where demos never break, because demos use clean inputs.

Then there is the failure mode problem. Deterministic code fails loudly and predictably: an exception fires, a condition falls through to a default handler, an error log writes. LLM components fail differently. The model returns a confident-sounding response that does not match the expected schema. The model interprets an ambiguous prompt in an unexpected direction. The API rate-limits at peak load and the retry logic cascades into a queue backup. Each of these requires specific handling: schema validation, confidence thresholds, circuit breakers, fallback paths. Building those layers is the engineering work that separates a production system from a demo.

The 30% that uses LLMs is not the complicated part because AI is magic. It is the complicated part because the failure modes are different from anything a deterministic system encounters, and handling them correctly requires building infrastructure that most vendors skip.

Where AI Belongs in OpsForge (30% of the System)

The 30% of OpsForge that does use LLM calls uses them for tasks where deterministic logic genuinely cannot do the job.

Incident classification. Incoming alerts arrive as unstructured text from multiple monitoring sources. Categorizing them by severity, type, and affected system requires reading context across a variable-length description. A keyword matching rule set would miss edge cases constantly and require ongoing maintenance as alert formats changed. The LLM call earns its cost here.

Runbook generation. When a new incident type occurs without a documented resolution path, an LLM drafts a candidate runbook based on similar past incidents and the system's documentation. This is a genuine generation task. There is no deterministic path from "novel incident type" to "structured response procedure."

Post-incident summaries. Synthesizing execution logs, actions taken, and resolution timeline into a readable stakeholder summary is exactly what language models are good at. The input is structured. The output requirement is unstructured. LLM call justified. Another clean example of scoped AI use: a tax deed pipeline using Claude Vision only for document parsing, with all routing and validation handled deterministically.

Anomaly explanation. When a metric deviates from baseline, a deterministic alert fires first (that is Layer 1 monitoring, not AI). A secondary LLM step then generates a plain-language explanation of what the anomaly likely means, pulling context from historical incident patterns. Humans can act on the explanation without reading raw log data.

Everything else in OpsForge runs deterministically. Routing, validation, formatting, scheduling, cost tracking, status updates, retry logic. All of it is code. None of it needs inference.

The pattern: LLMs handle the inputs that cannot be pre-specified. Deterministic code handles everything that can be.

How to Evaluate an AI Proposal for Over-Engineering

Five signals that a proposed AI system is putting LLM calls where deterministic logic would work better:

1. Every step is described as "AI-powered." If the vendor cannot tell you which steps are deterministic and which use LLMs, and why, they have not designed the system. They have assembled a demo and named everything AI.

2. The proposal does not include cost projections at scale. Any proposal that cannot show per-step cost at your production volume has not been designed for production. "AI-powered routing" sounds good at a kickoff meeting. "$0.15 per routing decision multiplied by 5,000 decisions per day" starts a different conversation. Ask specifically: what does each LLM-powered step cost per execution, and how many executions happen per day?

3. No fallback logic for LLM failures. What happens when the API is rate-limited? When the model returns an invalid response? Deterministic systems fail predictably and recover cleanly. LLM-heavy systems can cascade. That last one matters more than it looks: a system with no fallback for LLM failures is a system where one vendor outage becomes your outage. Ask specifically what happens at each LLM call point when the call fails, and follow up by asking how many consecutive failures it takes before the system stops processing.

4. No debugging story. With deterministic routing, you can trace exactly why a specific record went to a specific processor. The audit trail is the code. With LLM-heavy systems, "the model decided" is not an audit trail. Ask how they would explain a routing decision to you or to a regulator six months from now. If the answer is "we would look at the logs," ask what the logs actually contain.

5. The demo uses clean, curated data. Production systems encounter malformed records, unexpected field encodings, missing required values, and edge cases that were not in any training distribution. Ask to see the system handle a bad input. Specifically: what happens when a required field is null, when a date field contains a string, when an upstream API returns a 200 with an empty body? Vendors who have built for production can answer these questions with specifics. Vendors who have built for demos will redirect to the roadmap.

If a proposal passes all five of these checks, the LLM-to-deterministic ratio will probably be defensible. If it fails two or more, the system is likely over-engineered with AI calls that add cost and fragility without adding capability.

What This Means for Founders Evaluating AI Vendors

The central question to bring to any AI system evaluation is not "does it use AI?" It is: "does it use AI where AI is actually required?"

A system that routes structured JSON records through an LLM is spending your money on inference for a decision that a lookup table handles reliably. A system that uses an LLM to classify unstructured incident text is spending that money correctly.

The vendor who builds the second way is probably more expensive upfront. The design takes more thought. Separating LLM-appropriate tasks from deterministic-appropriate tasks requires understanding both what the system needs to do and what the cheapest reliable path to each outcome actually is.

That boundary decision is where the evaluation question has a clean answer: ask which steps are deterministic and which use LLMs. Ask the cost per execution at your volume. Ask what happens when each LLM call fails. A well-designed system has precise answers to all three. A demo-optimized system does not, because those questions only matter in production.

Want the decision framework with cost modeling? The Production AI Agent Architecture Template includes the full LLM/deterministic boundary map, cost modeling templates for your specific workflow volume, and the 4-layer monitoring system used in OpsForge. Book a free diagnostic call to get the template and scope your specific situation.

When NOT to Use AI in an AI System

The Counter-Intuitive Design Principle

The Economics in Practice

Decision Framework: LLM vs Deterministic

A Concrete Example: Two Steps from OpsForge

The Hard Part: Why the 30% Requires Real Engineering

Where AI Belongs in OpsForge (30% of the System)

How to Evaluate an AI Proposal for Over-Engineering

What This Means for Founders Evaluating AI Vendors

Silent Failures in Agent Systems: The Bug That Looks Like Success

How to Calculate AI Automation ROI

The Economics of Replacing a Team with Systems

Every project starts with a diagnostic.