Industry & Platforms

Repricing Retrieval: AWS Just Moved the Build-vs-Buy Line for Agent Workloads

May 31, 2026

When non-human traffic overtakes human traffic in the first half of 2027, AI platform teams still running self-managed vector search on always-on clusters will be carrying a structural cost penalty.

Repricing Retrieval: AWS Just Moved the Build-vs-Buy Line for Agent Workloads

Credit:

If this caught your attention, that’s not accidental. 

The best editorial systems don’t happen by accident. Outlever builds them.

See how it works

When non-human traffic overtakes human traffic in the first half of 2027, a forecast that comes from Cloudflare rather than from any vendor's marketing, the AI platform teams still running self-managed vector search on always-on clusters will be carrying a structural cost penalty their scale-to-zero competitors avoid. That gap, rather than model quality, is what decides which internal AI platforms survive their 2027 budget review. AWS's OpenSearch Serverless rearchitecture, launched May 28, is the first piece of major cloud infrastructure priced for that inversion. It will not be the last, and any team treating it as a routine version bump is misreading the moment.

The numbers behind that claim are worth discussing. Bots already account for 31% of all HTTP traffic over the last six months, according to Cloudflare, and AI crawlers, search engines, and assistants make up roughly a quarter of that bot volume. Machine traffic is the fastest-growing segment, and Cloudflare's senior product team expects the crossover, non-human exceeding human, within the next four quarters. For a Head of AI Platform, that figure is not ambient industry color. It describes the shape of the load your retrieval layer will absorb, and the shape is where the trouble starts.

The load profile is the problem

Human traffic is what every existing database was tuned for: steady, diurnal, predictable enough to provision against. Agent traffic behaves nothing like it. An agent fans out into sub-agents that hit hundreds of databases, run document searches, and call APIs within seconds, then disappears. You get spikes with no warning and idle stretches with no notice. Provision for the peak and you pay for empty compute most of the day. Provision for the average and you fall over during the burst that mattered, the one where the agent was actually doing its job.

AWS's technical change targets that profile directly. The prior Serverless generation coupled storage and compute, so at least one instance always ran, which meant idle compute you reserved and paid for whether or not anything queried it. The new generation separates the two. Compute scales up within seconds to meet a burst and scales to zero when agents go quiet, so idle periods cost $0. AWS frames it as the move from renting a permanent parking space to paying a meter, and the framing fits. For a RAG workload whose query pattern tracks agent activity, that is the difference between paying for 24 hours and paying for the three you used.

What it does to build-vs-buy

This is the point where the launch becomes a forcing function. The standard case for self-hosting a vector database, whether Weaviate, Milvus, pgvector on your own Postgres, or a self-managed OpenSearch cluster, has always rested on sustained high utilization. Run hot around the clock at predictable QPS and owning the cluster comes in cheaper than renting managed capacity, with the operational overhead paying for itself. That math is sound and It simply no longer describes agent workloads.

Once your dominant traffic source is spiky, idle-heavy, and machine-generated, sustained high utilization stops being your profile, and the premise under the self-host decision gives way. You end up running a fleet sized for peaks you hit a few hours a day and eating the idle cost as a line item that grows with every agent you ship. In that scenario, utilization-based pricing wins on more than convenience, It wins on the unit economics the self-host case was built to defend. The build-vs-buy line moves, and it moves against build for exactly the workload everyone is racing to deploy.

The native integrations make the buy side harder to walk away from. OpenSearch Serverless ships connected to Vercel and Kiro, so a developer can stand up a production search and vector backend for an agent without touching infrastructure. That takes out the second pillar of the build case, control, for the large middle of teams who self-hosted mainly because the managed options were awkward to wire in.

The competitive read

This is bigger than an AWS story. It is a repricing of the entire retrieval layer, and AWS is simply first to attach a number to it. Databricks and Snowflake are both repositioning as AI memory and retrieval systems for enterprise data. Microsoft has shipped Azure updates built to handle agent bursts and share memory across agents. Cloudflare introduced infrastructure giving agents persistent environments and instant scale a month ahead of AWS. Four of the largest infrastructure providers reached the same conclusion in the same quarter: the retrieval and memory layer breaks first under agent load, and it has to be rebuilt around utilization rather than provisioned capacity.

That convergence validates one platform bet, managed and consumption-priced retrieval, and undercuts another, the assumption that a self-hosted vector store is a durable architectural moat. It stops being a moat once the hyperscalers are competing to drive idle retrieval cost to zero. Teams that standardized on self-hosting in 2024 made a reasonable call for 2024's traffic. That call is now a depreciating asset.

The strongest counter-argument

The honest objection is latency. Scale-to-zero implies cold starts, and an agent waiting on a retrieval backend to spin up is an agent blowing its latency budget, which for interactive agent chains is not negotiable. If the cold-start penalty turns out to be material, scale-to-zero suits batch and asynchronous agent work well and synchronous user loops poorly. That carve-out is real, and AWS has not published the cold-start numbers that would let anyone size it. Until those exist, treat the latency-sensitive synchronous path as unproven on this architecture.

It still does not move the thesis, for one reason. The workload mix is shifting toward the asynchronous, bursty, machine-to-machine pattern that scale-to-zero serves best and away from the steady synchronous pattern always-on was built for. Enterprises are deploying agents internally and for their customers, generating machine-to-machine traffic behind the scenes, and behind-the-scenes traffic is precisely the part that tolerates a cold start. The synchronous human-facing path stays on warm capacity. The growing majority of your retrieval volume does not. Most teams will run both, and the budget question is which side is growing.

What to do Monday, and what to watch

For a Head of AI / ML Platform, the action is not "migrate to OpenSearch Serverless." It is narrower: pull the utilization curve on your current vector and search infrastructure and work out what share of that spend is idle compute. If idle is a thin slice, your workload is still steady and you can sit this cycle out. If idle is large and climbing, which it will be for any team scaling agent deployments, you are funding a parking space, and the procurement decision needs to happen before your 2027 budget locks rather than after.

Three signals will tell you whether the thesis holds. The first is published cold-start latency for OpenSearch Serverless, the number that sets the boundary between where scale-to-zero applies and where it does not. The second is whether Snowflake and Databricks match consumption-to-zero pricing or hold higher floors, since their response shows whether this is now table stakes or still an AWS edge. The third is per-query cost benchmarks against a self-hosted baseline at realistic agent traffic profiles, not at the sustained-load profile the self-host case was always argued on, because that is not the load you are about to run.

The internet's traffic mix is inverting on a timeline measured in quarters. Your retrieval layer is the first thing that breaks when it does, and the first thing being repriced. Work out whether you are paying the meter or the parking space before the bill lands.