Are the costs of AI agents also rising exponentially? (2025)

(tobyord.com)

148 points | by louiereederson 3 days ago ago

28 comments

thelastgallon 4 hours ago
> On many task lengths (including those near their plateau) they cost 10 to 100 times as much per hour. For instance, Grok 4 is at $0.40 per hour at its sweet spot, but $13 per hour at the start of its final plateau. GPT-5 is about $13 per hour for tasks that take about 45 minutes, but $120 per hour for tasks that take 2 hours. And o3 actually costs $350 per hour (more than the human price) to achieve tasks at its full 1.5 hour task horizon. This is a lot of money to pay for an agent that fails at the task you’ve just paid for 50% of the time — especially in cases where failure is much worse than not having tried at all.
[-]
- boxedemp 19 minutes ago
  If you gave me an agent that succeeded 50% of tasks I gave it, I could take over the world in a week. Faster if I wasn't so lazy.
  I think you're overestimating, or oversimplifying. Maybe both.
dang 7 hours ago
Related ongoing thread:
Measuring Claude 4.7's tokenizer costs - https://news.ycombinator.com/item?id=47807006 (309 comments)
quicklywilliam 5 hours ago
Interesting read. I don't know if I quite buy the evidence, but it's definitely enough to warrant further investigation. It also matches up with my personal experience, which is that tools like Claude Code are burning through more and more tokens as we push them to do bigger and bigger work. But we all know the frontier model companies are burning through money in an unsustainable race to get you and your company hooked on their tools.
So: I buy that the cost of frontier performance is going up exponentially, but that doesn't mean there is a fundamental link. We also know that benchmark performance of much smaller/cheaper models has been increasing (as far as I know METR only looks at frontier models), so that makes me wonder if the exponential cost/time horizon relationship is only for the frontier models.
[-]
- esperent 2 hours ago
  > But we all know the frontier model companies are burning through money in an unsustainable race to get you and your company hooked on their tools.
  Do we? Because elsewhere in the thread there's people claiming they are profitable in API billing and might be at least close to break even on subscription, given that many people don't use all of their allowance.
agentifysh 4 hours ago
Until there is some drastic new hardware, we are going to see a similar situation to proof of work, where a small group hordes the hardware and can collude on prices.
Difference is that the current prices have a lot of subsidies from OPM
Once the narrative changes to something more realistic, I can see prices increase across the board, I mean forget $200/month for codex pro, expect $1000/month or something similar.
So its a race between new supply of hardware with new paradigm shifts that can hit market vs tide going out in the financial markets.
[-]
- jiggawatts 14 minutes ago
  > Until there is some drastic new hardware
  For inference, there is already a 10x improvement possible over a setup based on NVIDIA server GPUs, but volume production, etc... will take a while to catch up.
  During inference the model weights are static, so they can be stored in High Bandwidth Flash (HBF) instead of High Bandwidth Memory (HBM). Flash chips are being made with over 300 layers and they use a fraction of the power compared to DRAM.
  NVIDIA GPUs are general purpose. Sure, they have "tensor cores", but that's a fraction of the die area. Google's TPUs are much more efficient for inference because they're mostly tensor cores by area, which is why Gemini's pricing is undercutting everybody else despite being a frontier model.
  New silicon process nodes are coming from TSMC, Intel, and Samsung that should roughly double the transistor density.
  There's also algorithmic improvements like the recently announced Google TurboQuant.
  Not to mention that pure inference doesn't need the crazy fast networking that training does, or the storage, or pretty much anything other than the tensor units and a relatively small host server that can send a bit of text back and forth.
- colechristensen 4 hours ago
  Doubtful, local models are the competitive future that will keep prices down.
  128GB is all you need.
  A few more generations of hardware and open models will find people pretty happy doing whatever they need to on their laptop locally with big SOTA models left for special purposes. There will be a pretty big bubble burst when there aren't enough customers for $1000/month per seat needed to sustain the enormous datacenter models.
  Apple will win this battle and nvidia will be second when their goals shift to workstations instead of servers.
  [-]
  - hypercube33 3 hours ago
    Weird how you're leaving stuff like Strix Halo out. Also weird you think 128gb is the future with all of the research done to reduce that to something around 12GB being a target with all of these papers out now. I assume we'll end up with less general purpose models and more specific small ones swapped out for whatever work you are asking to do.
    [-]
    - MrBuddyCasino 27 minutes ago
      Strix Halo hasn‘t got nearly enough bandwidth, its just 256bit.
  - lookaround 3 hours ago
    > 128GB is all you need.
    My guy, look around.
    They are coming for personal compute.
    Where are you going to get these 128GBs? Aquaman? [0]
    The ones who make RAM are inexplicably attaching their fate to the future being all LLMs only everywhere.
    [0] https://www.youtube.com/watch?v=0-w-pdqwiBw
    [-]
    - naveen99 3 hours ago
      Cloud can’t make money off of you and pay more than you for the hardware at the same time.
      [-]
      - adrianN an hour ago
        Batch inference is much more efficient. Using the hardware round the clock is much more efficient. Cloud can absolutely pay more for hardware and still make money off you.
      - bitwize 2 hours ago
        Cloud can pay more for RAM until all the RAM producers withdraw from the consumer market, then prices will go back down.
        End users will still get access to RAM. The cloud terminal they purchase from Apple, Google, Samsung, or HP will have all the RAM it will ever need directly soldered onto it.
        [-]
        xantronix an hour ago
        I was really fucking hoping we weren't at the part where "cloud terminals" doesn't seem farfetched and paranoid and yet here we are. Jesus Christ.
        seanmcdirmid an hour ago
        Doesn’t Apple place RAM directly into the SoC package? We aren’t even talking about soldering it to mother boards anymore, it is coming in with the CPU like it would as a GPU.
    - foota 3 hours ago
      More like RAM producers are providing supplies to the highest bidder, no? If this doesn't peter out supply will normalize at a higher but less insane price eventually.
greenmilk 6 hours ago
Are any inference providers currently making profit (on inference, I know google makes money)?
[-]
- wsun19 5 hours ago
  Pretty much every major American inference provider claims to make a profit on API-based inference. Consumer plans might be subsidized overall, but it's hard to say since they're a black box and some consumers don't fully use their plans
- raincole 2 hours ago
  All of them. It's simply impossible to sell tokens by usage at a loss now. You'll be arbitraged to death in a few days. It only makes sense to subsidize cost if you're selling a subscription.
- henry2023 3 hours ago
  Third parties selling open-weight inference on OpenRouter are surely selling on a profit. Zero reason to subsidize it.
- wavemode 4 hours ago
  Selling inference is not fundamentally different from selling compute - you amortize the lifetime cost of owning and operating the GPUs and then turn that into a per-token price. The risk of loss would be if there is low demand (and thus your facilities run underutilized), but I doubt inference providers are suffering from this.
  Where the long-term payoff still seems speculative, is for companies doing training rather than just inference.
  [-]
  - Gigachad 4 hours ago
    There’s a lot of debate over what the useful lifespan of the hardware is though. A number that seems very vibes based determines if these datacenters are a good investment or disastrous.
    [-]
    - hypercube33 3 hours ago
      I specifically remember this debate coming up when the H100 was the only player on the table and AMD came out with a card that was almost as fast in at least benchmarks but like half the cost. I haven't seen a follow up with real world use though and as a home labber I know that in the last three weeks the support for AMD stuff at least has gotten impressively useful covering even cuda if you enjoy pain and suffering.
      What I'm curious about are what about the other stuff out there such as the ARM and tensor chips.
- jagged-chisel 5 hours ago
  Google definitely makes money in other areas. Do they make money on inference?
matt3210 4 hours ago
I took a month break and my side project took 2x as much tokens
siliconc0w 2 hours ago
Working on a oss tool to help orgs identify where they can save on token costs: https://repogauge.org
Happy to run it on your repos for a free report: hi@repogauge.org
noosphr 2 hours ago
Yet again: Transformers are fundamentally quadratic.
If they can do a task that takes 1 unit of computation for 1 dollar they will cost 100 dollars for a 10 unit task and 10,000 for a 100 unit task.
Project costs from Claude Code bear this out in the real world.