Every CI/CD pipeline I built before working on LLM-powered apps relied on a simple contract: given the same input, the code produces the same output, and I can assert on that output. That contract breaks the moment an LLM is in the loop.

LLM outputs aren’t deterministic. Even at temperature=0, minor changes in model versions, system prompts, or tokenizer updates can shift the output. Testing strategies that work fine for a REST API or a data transformation job fall apart when “correct” is a spectrum rather than a boolean.

Here’s how I adapted.

The Core Problem

A traditional test looks like this:

```python def test_format_address(): result = format_address(“123 main st, springfield, il”) assert result == “123 Main St, Springfield, IL” ```

Deterministic. Binary pass/fail. Works great.

An LLM-based equivalent might be:

```python def test_summarize_article(): result = summarize(long_article_text) assert result == ??? # What do you even assert here? ```

You can’t assert exact string equality. The LLM might produce a slightly different summary every run, and both versions might be perfectly correct. But you also can’t just skip testing — a bad prompt change, a model update, or an accidental context truncation can silently produce garbage.

The answer is to split your tests into two categories: things you can test deterministically, and things you need to evaluate differently.

What You Can Still Test Deterministically

More than you’d think. I carve out a deterministic test suite that runs in the standard CI gate:

Input validation and preprocessing. The code that cleans, truncates, and formats input before it hits the LLM is pure logic. Test it normally.

Output schema validation. If your LLM is supposed to return structured JSON, you can assert that the output is valid JSON and matches your schema. Tools like Pydantic make this easy.

```python from pydantic import BaseModel, ValidationError

class SummaryResponse(BaseModel): summary: str key_points: list[str] confidence: float

def test_output_schema(): raw = call_llm(prompt) try: parsed = SummaryResponse.model_validate_json(raw) assert 0.0 <= parsed.confidence <= 1.0 assert len(parsed.key_points) > 0 except ValidationError as e: pytest.fail(f“LLM returned invalid schema: {e}”) ```

Latency and error rate bounds. Your pipeline should complete within a reasonable time window and not error out more than some threshold.

Prompt construction. The function that assembles the prompt is deterministic. Test that it produces the right structure and doesn’t accidentally truncate required context.

This layer catches a large class of real bugs: broken JSON parsing, prompt template regressions, API timeout handling, and invalid output shapes.

Eval-Driven Testing

For the non-deterministic parts, I use an eval-driven approach. The idea is to maintain a golden dataset — a set of representative inputs with expected output characteristics — and run it against the system periodically.

The evaluator doesn’t check for exact match. It checks properties:

Is the output on-topic?
Does it contain the required information?
Is it free of hallucinated facts?
Does it match the expected tone/format?

For automated evaluation at scale, I use an LLM-as-judge pattern: a second LLM call that scores the first one’s output against a rubric. I run this against the golden dataset on every PR that touches the prompt or model configuration. If any score drops by more than 0.5 points compared to baseline, the eval step fails and blocks the merge.

Deployment Strategies

Because model behavior can drift between versions, I treat LLM-powered features like any other risky change: gradual rollout.

Canary deployments work well here. Route 5% of traffic to the new model/prompt version, monitor the eval metrics and user feedback signals in production, and promote or roll back based on what you see.

Feature flags for model versions let you switch between model configurations without a code deploy. I store the model name, system prompt, and temperature as runtime configuration, not as hardcoded values in the source.

```bash

Feature flag check at runtime

MODEL_VERSION=$(get_feature_flag “llm_model_version” –default “gpt-4o”) SYSTEM_PROMPT=$(get_feature_flag “llm_system_prompt_v” –default “v3”) ```

Shadow mode is also useful for major changes: run the new version in parallel with the old, log both outputs, and diff them offline before routing real user traffic.

Monitoring in Production

Once the app is live, I track a handful of signals:

p95 latency — LLM calls have high variance. Mean latency looks fine while p95 is quietly climbing.
Token usage per request — a sudden spike often means a prompt template is including too much context.
Error rate by error type — distinguish between rate limit errors, timeout errors, and schema validation failures.
User feedback signals — thumbs down and regeneration requests are the ground truth that eval metrics can’t replace.

The Mindset Shift

The biggest change isn’t technical — it’s accepting that “does it work?” is no longer a binary question for the AI layer. You’re managing a distribution of outputs, not a deterministic function. Your job as a DevOps engineer is to make that distribution observable, to define acceptable bounds, and to catch when it shifts.

The deterministic parts of the system — the infrastructure, the APIs, the data pipelines — still deserve traditional testing. Don’t let the fuzzy parts infect the crispy parts. Keep those concerns separated, test each layer appropriately, and you’ll build something you can actually operate with confidence.