AI Agents for Infrastructure: Speed Without Dropping Prod


Where AI earns its keep in infrastructure work – and where it will quietly delete your database if you let it.
Most AI-and-software talk is about application code, where you write self-contained logic over ground you control. Infrastructure is the opposite. You're an integrator: GitHub, AWS, a dozen third-party APIs, each with its own spec, its own documented behavior, and its own actual behavior – and the gap between the last two is where infra work actually lives. The d
ocumentation is a map; the territory has eventual consistency, rate limits, undocumented ordering constraints, and silent failure modes the map never mentions. This is a field report on using AI against that territory – the version that ships, survives an audit, and doesn't drop production. First half: what AI unlocks. Second half: keeping it from hurting you. Both are the job.
Part I – What AI unlocks
The job changes shape
Infra work is defined by breadth and depth. The breadth is the visible part: in one afternoon you touch Elasticsearch, write Python, drop into Bash, write Terraform, read three sets of docs you've never seen, and field requests arriving from every direction. But it doesn't stop there. The same week you might be debugging lock contention in a database, hooking eBPF onto syscalls to chase a performance regression, or cornering a race condition – diving deeper than you wanted to, more often than you'd like.
It's demanding work, spread across many domains at once. The skill is going as deep as the problem needs, then turning to something unrelated an hour later and doing that just as well. The hard part is holding a dozen systems at once and keeping the bar high on each. Every context switch carries a tax: reload the mental model, recall the API's quirks, remember which of three auth schemes this service uses.
That tax is what an LLM removes. It has effectively read all the documentation already, so the switching cost collapses toward zero – you describe the problem and review a proposal instead of paying the "spin up on Elasticsearch's query DSL" tax every time.
But notice what this does and doesn't do. The model becomes a fast first-draft generator across every system at once – but no judge of whether the solution fits your system. The judgment stays with you, because the model's failure mode is producing something locally correct (valid Terraform, valid Python) and globally wrong (doesn't match your network topology, ignores a constraint it couldn't know about).
So the team's center of gravity shifts. The old model – hire deep single-system expertise, and since no one holds all of it, hire many people over years – gives way to a smaller team whose scarce skill is the part no manual contains: systems thinking, knowing which questions to ask, orchestrating pieces into something coherent. The model does the legwork and fills skill gaps; only the human can tell when a locally-correct answer is globally wrong.
Test-driven development gets a second life
TDD has a reputation as slow and not worth it. That reputation was always a complaint about a single cost – writing the tests – and that cost just collapsed. But the deeper reason TDD and LLMs are a natural pair isn't economic, it's epistemic: a test is how you hold a non-deterministic collaborator accountable. When you ask a model to "fix this bug," you have no way to know if it actually did; when you ask it to make a specific failing test pass, the test is an objective, executable judgment that doesn't care how confident the model sounds. The test is the contract.
For bug fixes the loop is now nearly mechanical, and the order is load-bearing: (1) a bug surfaces, say in an interaction with another API
(2) the model writes a test that reproduces it, and the milestone is that the test fails – this is the step people skip, and skipping it is the mistake, because a test you've never seen fail proves nothing; a green test might be green because it asserts nothing. You force the red first to prove the test actually has teeth, then you trust the green
(3) the model writes the fix
(4) the same test goes green, and "fixed" becomes a fact you can rerun on demand.
Consistency is the whole game in infra, and it's why this matters more here than in app code. Even "managed" AWS is enormous in aggregate, and built by hand in the console it can't be reproduced – the next person, or you in six months, can't recreate what you clicked.
The sharpest version is environment parity. A sandbox is only useful if it's faithful to production: the whole point is to catch there what would break here, and a sandbox that diverges in some crucial way lies to you: it passes what production would fail. Code makes the two alike; tested code keeps them alike as both drift. This is why infrastructure-as-code matters from day one – at TinyFish nothing is configured by hand. Terraform declares the resources, Python codifies the logic and transitions, configs capture behavior – it's all code, and all of it can be tested. AI deepens that discipline: the binding constraint on test coverage was always author time, and that constraint is gone.
A worked example: the Lambda module
Abstract claims are cheap, so here's a concrete one. The pain was AWS Lambda: we needed functions all over the place for routine work – react to a lifecycle event, update DNS when an instance comes up – and we wanted each one to be a production workload, not just a deployed function. General-purpose modules exist, but they're built for every runtime and every combination – a big surface area and a lot of knobs. We wanted the opposite: a small, opinionated module that did three things well – simple, well-tested, compliant. Python on a current runtime, monitoring and compliance built in, the whole thing covered by real integration tests, and not much else to configure.
So "here's a piece of Python, deploy it as a Lambda function" was trivial to state and surprisingly hard to do consistently to that bar: packaging into a ZIP, wiring IAM tightly, tracking source changes so it redeploys, managing dependencies reproducibly, and standing up the monitoring stack — each a small decision that, made inconsistently across a codebase, becomes the thing that breaks at 2am.
The method that worked was running the model like a project, in phases, not firing a one-shot prompt:
- Spec first. Inputs, outputs, interfaces, edge behavior. The model drafted a plan; we reviewed, adjusted, and argued a few parts before converging. This phase exists to force scope disagreements to the surface before any code exists, when they're free to resolve. The document is a byproduct.
- Tests next, before the implementation. We test Terraform modules as integration tests that stand up real infrastructure and prove the pieces work together – not unit tests with everything mocked, because in infra the mocks hide exactly the integration behavior that breaks. Writing these before the implementation also kept the model honest: it couldn't quietly reshape the spec to match whatever code was easiest to generate.
- The tests earned their keep – this is the payoff. They surfaced behavior invisible from the console: in a single apply that stands up the bucket, uploads the package, and creates the execution role and its policies all at once, the just-written package isn't reliably readable for a second or two afterward. Whether it's S3 or a freshly-created permission still propagating, the error doesn't say – CreateFunction just fails intermittently, so a deploy that "worked" on a fast network breaks under load, the worst kind of bug. An integration test that deploys for real and invokes immediately hits the race every time and turns a future 2am page into a caught defect. The fix is unglamorous – wait until the object is actually readable before handing it to Lambda – and the kind you only trust once a test has proved it. In theory the API's contract and its behavior are the same; in practice they aren't. No prompt would have surfaced this; only real infrastructure under test does.
- Docs last. With the full arc in context – spec, implementation, tests – the model wrote documentation that actually explains intent: each variable's purpose, valid values, why a constraint exists. It can do this precisely because it carries the whole reasoning chain, which the human author usually no longer remembers by the time they get to the README.
Now the honest accounting, because this is what most people get wrong about AI productivity. The module took two to three days over a weekend – about the same wall-clock time as hand-writing similar modules before. If your metric is speed on the first build, AI did nothing.
The win was on two other axes:
- First, quality at equal cost: more robust, tested against edge cases manual work misses, documented to a standard nobody hits under deadline.
- Second, and larger, amortization: that was the last time anyone thought about Lambda packaging. Every subsequent function is "use the module," and ZIP archives, runtimes, dependency tracking, and change detection are solved once and retired from the team's attention budget forever.
That second axis quietly rewrites which work is worth doing. Infra requests arrive ad hoc – a one-off from another team, hard to rank against the roadmap – and the old calculus was rational: don't build the proper reusable thing for a single ask, hack it and move on, because the proper version cost a week you couldn't justify.
Lower the cost of "proper" toward the cost of "hack" and the calculus inverts: you build the real, tested, reusable version even for the one-off, because it's no longer meaningfully more expensive and it compounds. A team's velocity comes down to how much solved work it can stop thinking about. That's the real unlock.
Part II – Keeping it from hurting you
Everything in Part I assumes the model is trying to help you and mostly succeeding. The second half of the job is built on the opposite assumption: that the model will confidently do the wrong thing, and that your environment, not your trust, is what has to catch it.
The model will fabricate – unless you make fabrication impossible
We once had a transient bug in our self-hosted CI runners: a suspicious error, no obvious source. The first attempt to debug it with an LLM failed in the most instructive way possible. Asked what caused the bug, the model produced a clean, plausible, well-written root-cause story – and it was completely false. It lied – with no access to what actually happened and, asked for an answer, it made one up. An LLM with no evidence doesn't say "I don't know." It says something that sounds like knowing.
That's the failure mode that matters in infra, because a confident wrong root cause is worse than no answer – it sends you fixing the wrong thing.
The fix is structural: deny it the option to speculate by making evidence the only available path. We gave the model read access to ground truth – CloudTrail, the live instances, the actual Terraform and Lambda code managing the runners – and changed the demand from explain this to prove it: make a claim, then show me the trail in the data that supports it. The behavior changed completely.
With real data reachable, the model stopped inventing and started querying – it knows how to pull events out of CloudTrail – and found the actual sequence: an instance was being deprovisioned, a Lambda function missed it, the orphaned instance still pulled a job from GitHub, tried to run it, and failed. A true root cause, reachable only because the data was reachable.
The general principle is worth stating carefully, because it inverts the usual complaint about these models. The instinct people have is to fight the model's eagerness to please – it's a liability when it invents answers. The better move is to redirect it.
That same eagerness, pointed at a standard of "nothing counts unless the data backs it," becomes the engine of correctness: the model wants to satisfy you, so make "satisfying you" mean producing evidence. Be a harder grader, and give it the materials to actually pass.
"A human is not a root cause" – and neither is an agent
There's an old incident-response principle that the AI era makes newly urgent. When a junior engineer drops the production database, the mature postmortem does not conclude "the engineer was careless." It concludes that a system permitted a routine human error to become a catastrophe – the missing permission boundary, the absent confirmation, the backup that was never tested. A human is not a root cause, because a root cause is something you can fix, and you cannot patch a person the way you patch a system. Blame ends the investigation exactly where it should begin.
Point this principle at AI and the conclusion is immediate and slightly uncomfortable: an agent is precisely the junior engineer who will, eventually, run the destructive command. Give a model root credentials and enough time and it will terminate the wrong instances, drop the database, or empty the bucket. It's fallible and fast – the most dangerous combination.
The stories you've seen about an agent deleting a database and its backups are about environments that granted a fallible actor irreversible power and were surprised when it used it. The lesson: engineer the blast radius down until a mistake can't be fatal – exactly as you would for a human you don't yet trust with prod.
Concretely, that means least privilege applied without the usual shortcuts. When the model was investigating the CI bug, it needed to read CloudTrail and read an S3 bucket – so that's all it got. No write. No delete. Investigation is a read-only activity, and the credentials should say so.
The seductive alternative – mint one static key, attach an admin policy, hand it to the agent so you never hit a permissions wall – is seductive precisely because it removes the friction that was protecting you. Every permission you grant for convenience is blast radius you've signed up for. Scope to the task, prefer read-only, and make irreversible actions require a human in the loop. The model is capable enough to do real damage at the speed of a tool call.
Standards stop being documentation and become guardrails
The two ideas above – demand evidence, constrain blast radius – are easy for a senior engineer who already has the instincts. The hard problem is organizational: you want the whole team wielding agents, and "junior engineer plus powerful agent" is the exact combination most likely to produce a confident, fast, well-formatted mistake. Worse, not every team even has senior infra leadership on hand to set the guardrails. So the real question is structural: how do you make the safe path the path of least resistance, for everyone, by default?
The answer is to write your standards as runtime instructions for agents. They've outgrown the wiki page. Every team already has a way it does things – how Terraform is tested, how Python is structured, which module creates a secret and how that secret's read permission is scoped. Historically that knowledge lived in senior people's heads and in wiki pages nobody opened.
The shift is to write it down as explicit, agent-readable standards and to ship those standards inside the repo, so they travel with the code and get loaded into context automatically whenever anyone works there. We manage these standard files with Terraform like everything else, which means provisioning a new repo provisions its guardrails in the same motion – there is no version of the repo that exists without them.
The effect is that correctness stops depending on who's holding the keyboard. "Create a secret and let this service read it" has a million possible implementations, most of them subtly insecure; with the standard in context, the agent uses the blessed module, scopes the permission correctly, and produces a pull request that looks like a senior engineer wrote it – because, encoded in the standard, one did.
The guardrails now live in the environment instead of in people's heads, which is the only version that scales: the wiki page nobody read becomes an executable constraint nobody can skip. This is the deepest shift in the whole picture: the reader of your standards is now an agent that does exactly what they say, every time – something no human team ever managed.
Where the two halves meet: compliance
Compliance is where productivity and safety stop being separate stories, because passing an audit requires both being secure and proving it – and an agent operating under the standards from Part II generates that proof as a byproduct of the work from Part I.
It shows up in three places
- First, the agent builds compliant infrastructure by default: because secure configuration is encoded as the standard, encryption is the path the agent already takes, not a step someone has to remember – we've stood up services and found encryption already enabled. The default was correct.
- Second, it assembles audit evidence on demand: auditors run on artifacts – show that backups are encrypted at rest and in transit, that retention is set correctly – and walking a live environment to collect and summarize that is exactly the read-only, evidence-gathering task these models are best at.
- Third, and most valuable, the act of producing evidence finds the gaps: asked once to generate an encryption-posture report for a customer, the agent surfaced unencrypted volumes in a subsystem nobody had flagged. We fixed it, re-ran the report, and the second pass confirmed the fix – the same loop that proves compliance also closes the holes it finds.
So the loop closes on itself: build it secure, prove it's secure, and discover what isn't – each step feeding the next. Secure, reliable, compliant infrastructure, with the proof attached.
The through-line
The unifying idea is that AI in infrastructure is a multiplier on a multiplier. Infrastructure is already leverage – it's the layer the rest of the company's productivity rests on – and pointing AI at the infrastructure itself multiplies that leverage again. But the two halves of this report are inseparable, and that's the actual lesson. The productivity is only real because the safety is engineered: an agent that fabricates root causes, or runs unconstrained against production, or quietly diverges from your standards, is a fast liability. What makes it a teammate is the structure around it – evidence as the only path to an answer, blast radius scoped below catastrophe, standards encoded as guardrails that travel with the code.
Nobody multiplies large numbers by hand anymore; we decided that was the computer's job and moved our attention up a level. Infrastructure is now making the same move. The mechanical work – reading the docs, writing the boilerplate, gathering the evidence – goes to the machine. What stays human is the judgment about what's worth building, the systems thinking to know when a locally-correct answer is globally wrong, and the discipline to build the environment that makes a powerful, fallible collaborator safe to work alongside. And the guardrails. Always the guardrails.



