How We Built a Modern Observability Pipeline in Three and a Half Weeks


For most of this industry's history, observability at scale meant a choice between a six-figure vendor bill and a dedicated team to run it yourself. The unspoken rule was buy it or suffer.
That rule has quietly stopped being true. We built a fully owned, production-grade observability pipeline — logs, metrics, and traces — in-house. One engineer, three and a half weeks, from zero lines of code to a fully migrated, team-trained system. It now runs our entire fleet at 100% sampling, no downsampling, around 5 billion samples a month, on three small containers — at a fraction of what the vendor equivalent cost us.
This post is how we did it, how well it works, and what you get when you own the whole thing. The short version: this is now within reach of a small team. The fear that an observability pipeline is inherently a massive undertaking is mostly a myth, carried over from when it genuinely was.
Why we built it
After a couple of years of growth, our system had become a web of services in constant conversation, and the questions we most needed to answer — why is this request slow? where does the time go? — were the ones our setup couldn't answer. Plain-text logs with no structured querying couldn't connect activity across services, and the external tooling we leaned on for the rest grew more expensive with every service we added. The last straw was simple: someone asked why a particular run had taken so long, and we had no answer. That's not a position any company wants to be in.
Why this is possible now
Nothing in what we built is exotic. The shift is that the hard parts have been packaged into components a small team can assemble in weeks rather than build from scratch over quarters.
The building blocks are mature, battle-tested open source. They ship as Docker containers that handle the genuinely difficult parts — ingestion, storage, indexing, querying at scale — so the work that's left is assembly and configuration, not invention. Commodity cloud compute makes the runtime cheap. And modern AI coding tools compress the research-and-wiring time dramatically: most of the implementation was AI-assisted under close supervision, with the engineering judgment — which components, what topology, which tradeoffs — staying firmly human. The starting point wasn't "teach me everything," it was "here's a reference architecture I've seen, let's refine it." That alone saved days.
Put those together and a job that used to require a platform team and a quarter becomes one engineer and three and a half weeks.
How it's built
An observability pipeline has three independent pillars: logs, metrics, and traces. You can run any subset; the more telemetry you collect, the better decisions everything downstream can make. Here's the shape of ours.
We standardized on the Victoria ecosystem — VictoriaLogs for logs, VictoriaMetrics for metrics — with one deliberate exception: for OpenTelemetry traces we chose Tempo. Victoria does have a traces component, but at the time we built, it was still in alpha and not production-tested, so we took the battle-tested option. Because the architecture is modular, swapping that backend later is a backend-only change with zero application changes — exactly the kind of reversibility you want when you pick the safe option today.
Logs: the application's only job is stdout
This is the single most important design decision, and the one most worth stealing.
In a typical setup, each application uses a shared library to push its own logs to the backend. That works, but it puts a lot on the application — managing destinations, handling shipping, carrying a dependency on the logging backend — and that surface area is where problems tend to accumulate.
The rule we settled on instead: an application's only responsibility is to print logs to stdout. Everything after that belongs to the pipeline. A sidecar daemon — we use the Vector agent, the recommended choice in this ecosystem — watches stdout across every application on the machine, batches the lines, and ships them to VictoriaLogs. The app has no idea any of this is happening.
The payoff is that onboarding becomes free. Anything that writes to stdout is automatically in the pipeline. Internally we even keep it opt-in, so the entire integration cost for a developer is setting a single flag.
That sidecar layer is also the right place for PII handling. Something like an API key should be stripped as early as possible — ideally before it ever leaves the secured environment next to the app — so redaction lives in the agent. Because our set of sensitive fields is finite and enumerable (keys, customer info, names, passwords), we apply thoserules universally, and in the large majority of cases application developers never think about PII at all.
Metrics and traces: pushed directly, configured once
Metrics and traces work differently from logs: there's no sidecar, the application pushes them to the backend directly. That wasn't a design preference — it's simply how the Victoria ecosystem works — and it was never a problem. To keep it painless, we abstracted the repetitive setup into a shared library that bakes in our conventions for how traces and metrics should look. Onboarding is "use the library"; doing it manually is possible too, just a bit more configuration. For structured logging on the application side, we use structlog (Python) and Pino (JavaScript).
How well it works
Here's where the system earns its keep.
We run 100% of logs, metrics, and traces with no downsampling whatsoever — currently around 5 billion samples a month, roughly 2,000 operations per second across the entire fleet. The whole thing rides on three backend containers, each on a t3.medium — a tiny instance by modern standards. They don't choke; spot-checking the dashboards, we're barely scratching the surface of the available headroom. Sampling is there if we ever want it, but we're nowhere near needing it.
The cost makes the case plainly:
| Self-hosted (ours) | What we replaced | |
|---|---|---|
| Monthly cost | ~$200–300 | ~$10,000–15,000 |
| Coverage | Logs, metrics, traces | Errors + some traces/metrics |
| Scope | Entire fleet | A subset of services |
| Sampling | 100%, no downsampling | n/a |
To be fair to the comparison: the vendor we replaced for this is a genuinely good product, we still use it for parts of our system, and its price reflects real value. The point isn't that it's overpriced — it's that usage-priced observability grows with you, every new service and every traffic spike, until one day it's a line item large enough that you stop ignoring it. Self-hosting flips that curve: our cost tracks compute, not how much we observe.
What you get by owning it
The economics are the obvious win, but ownership pays off in ways the bill doesn't show.
- Full control and customization. It behaves exactly the way we want, captures exactly what we want, and routes data wherever we need. One example: we needed a downstream path from logs into Databricks. Because we'd designed for it from day one, when the "what about Databricks?" question came up after launch, the answer was already yes — and applications don't even know their output is being fanned out to multiple destinations. Zero application changes.
- Trivial onboarding. Log to stdout, you're in. New services join the pipeline by default rather than by project.
- 100% sampling as a baseline. Because we're not paying per event, we keep everything. That completeness is what makes the next benefit possible.
- AI as a first-class consumer. This is the one that compounds. The honest weakness of any in-house pipeline is consumption — vendors win on polished dashboards and friendly query UX.
Our answer was to skip building that for humans and point AI agents directly at the backends. Ask "why was this run slow?" and an agent pulls the relevant logs, traces, and metrics and analyzes them faster than a person could. The reason this works with no custom tooling is precisely that we used standard components — VictoriaLogs and LogsQL, Tempo, VictoriaMetrics — which models already know how to query out of the box. "AI-native" here didn't mean inventing something clever; it meant being boring enough that the AI already understood it. That's the opposite of bolting an integration layer onto a system that was never built for it.
The honest limits
We'd rather you build this with clear eyes than oversell it.
Scale is not our concern. The path to more throughput is obvious — bigger instances, more compute — and at current volume we have enormous headroom. If this pipeline ever pushes us back toward a vendor, the reason won't be that it couldn't keep up.
It'll be consumption: the data is all there, but querying it well takes knowing the query language, and not everyone wants to learn one the way our agents already have. Even then, the fix we'd reach for first is making what we have more usable — not throwing it away. And because everything we collect is standard and we keep 100% of it, dropping in a better consumption layer later is a change we're well positioned to make.
There's also a timing caveat worth stating plainly: this is the right move at scale. Very early on, a low-tier vendor plan gets you observability for almost nothing and lets you focus on your actual product. The signal to build is when the bill — or the lack of control — starts to bite. For us, that moment came when we simply couldn't answer a basic question about our own system. If you're there too, the answer you've been avoiding is more achievable than you think.



