This is an industry we're[0] in. Owning is at one end of the spectrum, with cloud at the other, and a broadly couple of options in-between:
1 - Cloud â This is minimising cap-ex, hiring, and risk, while largely maximising operational costs (its expensive) and cost variability (usage based).
2 - Managed Private Cloud - What we do. Still minimal-to-no cap-ex, hiring, risk, and medium-sized operational cost (around 50% cheaper than AWS et al). We rent or colocate bare metal, manage it for you, handle software deployments, deploy only open-source, etc. Only really makes sense above âŹ$5k/month spend.
3 - Rented Bare Metal â Let someone else handle the hardware financing for you. Still minimal cap-ex, but with greater hiring/skilling and risk. Around 90% cheaper than AWS et al (plus time).
4 - Buy and colocate the hardware yourself â Certainly the cheapest option if you have the skills, scale, cap-ex, and if you plan to run the servers for at least 3-5 years.
A good provider for option 3 is someone like Hetzner. Their internal ROI on server hardware seems to be around the 3 year mark. After which I assume it is either still running with a client, or goes into their server auction system.
Options 3 & 4 generally become more appealing either at scale, or when infrastructure is part of the core business. Option 1 is great for startups who want to spend very little initially, but then grow very quickly. Option 2 is pretty good for SMEs with baseline load, regular-sized business growth, and maybe an overworked DevOps team!
I think the issue with this formulation is what drives the cost at cloud providers isn't necessarily that their hardware is too expensive (which it is), but that they push you towards overcomplicated and inefficient architectures that cost too much to run.
A core at this are all the 'managed' services - if you have a server box, its in your financial interest to squeeze as much per out of it as possible. If you're using something like ECS or serverless, AWS gains nothing by optimizing the servers to make your code run faster - their hard work results in less billed infrastructure hours.
This 'microservices' push usually means that instead of having an on-server session where you can serve stuff from a temporary cache, all the data that persists between requests needs to be stored in a db somewhere, all the auth logic needs to re-check your credentials, and something needs to direct the traffic and load balance these endpoint, and all this stuff costs money.
I think if you have 4 Java boxes as servers with a redundant DB with read replicas on EC2, your infra is so efficient and cheap that even paying 4x for it rather than going for colocation is well worth it because of the QoL and QoS.
These crazy AWS bills usually come from using every service under the sun.
The complexity is what gets you. One of AWS's favorite situations is
1) Senior engineer starts on AWS
2) Senior engineer leaves because our industry does not value longevity or loyalty at all whatsoever (not saying it should, just observing that it doesn't)
3) New engineer comes in and panics
4) Ends up using a "managed service" to relieve the panic
5) New engineer leaves
6) Second new engineer comes in and not only panics but
outright needs help
7) Paired with some "certified AWS partner" who claims to help "reduce cost" but who actually gets a kickback from the extra spend they induce (usually 10% if I'm not mistaken)
Calling it it ransomware is obviously hyperbolic but there are definitely some parallels one could draw
On top of it all, AWS pricing is about to massively go up due to the RAM price increase. There's no way it can't since AWS is over half of Amazon's profit while only around 15% of its revenue.
The end result of all this is that the percentage of people who know how to implement systems without AWS/Azure will be a single digit. From that point on, this will be the only "economic" way, it doesn't matter what the prices are.
That's not a factual statement over reality, but more of a normative judgement to justify resignation. Yes, professionals that know how to actually do these things are not abundantly available, but available enough to achieve the transition. The talent exists and is absolutely passionate about software freedom and hence highly intrinsically motivated to work on it. The only thing that is lacking so far is the demand and the talent available will skyrocket, when the market starts demanding it.
Itâs all anecdotal but in my experiences itâs usually opposite. Bored senior engineer wants to use something new and picks a AWS bespoke service for a new project.
I am sure it happens a multitude of ways but I have never seen the case you are describing.
Iâve seen your case more than the ransom scenario too. But also even more often: early-to-mid-career dev saw a cloud pattern trending online, heard it was a new âbest practice,â and so needed to find a way to move their company to using it.
Just this week a friend of mine was spinning up some AWS managed service, complaining about the complexity, and how any reconfiguration took 45 minutes to reload. It's a service you can just install with apt, the default configuration is fine. Not only is many service no longer cheaper in the cloud, the management overhead also exceed that of on-prem.
I'd gladly use (and maybe even pay for!) an open-source reimplementation of AWS RDS Aurora. All the bells and whistles with failover, clustering, volume-based snaps, cross-region replication, metrics etc.
As far as I know, nothing comes close to Aurora functionality. Even in vibecoding world. No, 'apt-get install postgres' is not enough.
serverless v2 is one of the products that i was skeptical about but is genuinely one of the most robust solutions out there in that space. it has its warts, but I usually default to it for fresh installs because you get so much out of the box with it
What managed service? Curious, I donât use the full suite of aws services but wondering what would take 45mins, maybe it was a large cluster of some sort that needed rolling changes?
My observation is that all these services are exploding in complexity, and they justify saying that there are more features now, so everyone needs to accept spending more and more time and effort for the same results.
It's basically the same dynamic as hedonic adjustment in the CPI calculations. Cars may cost twice as much now they have usb chargers built in so inflation isn't really that bad.
Great comment. I agree it's a spectrum and those of us who are comfortable on (4) like yourself and probably us at Carolina Cloud [0] as well, (4) seems like a no brainer. But there's a long tail of semi-technical users who are more comfortable in 2-3 or even 1, which is what ultimately traps them into the ransomware-adjacent situation that is a lot of the modern public cloud. I would push back on "usage-based". Yes it is technically usage-based but the base fee also goes way up and there are also sometimes retainers on these services (ie minimum spend). So of course "usage-based" is not wrong but what it usually means is "more expensive and potentially far more expensive".
The problem is that clouds have easily become 3 or 5 times the price of managed services, 10x the price of option 3, and 20x the price of option 4. To say nothing of the fact that almost all businesses can run fine on "pc under desk" type situations.
So in practice cloud has become the more expensive option the second your spend goes over the price of 1 engineer.
Hetzner is definitely an interesting option. Iâm a bit scared of managing the services on my own (like Postgres, Site2Site VPN, âŚ) but the price difference makes it so appealing. From our financial models, Hetzner can win over AWS when you spend over 10~15K per month on infrastructure and youâre hiring really well. Itâs still a risk, but a risk that definitely can be worthy.
> Iâm a bit scared of managing the services on my own
I see it from the other direction, when if something fails, I have complete access to everything, meaning that I have a chance of fixing it. That's down to hardware even. Things get abstracted away, hidden behind APIs and data lives beyond my reach, when I run stuff in the cloud.
Security and regular mistakes are much the same in the cloud, but I then have to layer whatever complications the cloud provide comes with on top. If cost has to be much much lower if I'm going to trust a cloud provider over running something in my own data center.
You sum it up very neatly. We've heard this from quite a few companies, and that's kind of why we started our ours.
We figured, "Okay, if we can do this well, reliably, and de-risk it; then we can offer that as a service and just split the difference on the cost savings"
(plus we include engineering time proportional to cluster size, and also do the migration on our own dime as part of the de-risking)
I've just shifted my SWE infrastructure from AWS to Hetzner (literally in the last month). My current analysis looks like it will be about 15-20% of the cost - ÂŁ240 vs 40-50 euros.
Expect a significant exit expense, though, especially if you are shifting large volumes of S3 data. That's been our biggest expense. I've moved this to Wasabi at about 8 euros a month (vs about $70-80 a month on S3), but I've paid transit fees of about $180 - and it was more expensive because I used DataSync.
Retrospectively, I should have just DIYed the transfer, but maybe others can benefit from my error...
Extremely useful information - unfortunately I just assumed this didn't apply to me because I am in the UK and not the EU. Another mistake, though given it's not huge amounts of money I will chalk it up to experience.
Hopefully someone else will benefit from this helpful advice.
Iâm wondering if it makes sense to distribute your architecture so that workers who do most of the heavy lifting are in hetzner, while the other stuff is in costly AWS. On the other hand this means you donât have easy access to S3, etc.
Iâm curious to know the answer, too. I used to deploy my software on-prem back in the day, and that always included an installation of Microsoft SQL Server. So, all of my clients had at least one database server they had to keep operational. Most of those clients didnât have an IT staff at all, so if something went wrong (which was exceedingly rare), theyâd call me and Iâd walk them through diagnosing and fixing things, or Iâd Remote Desktop into the server if their firewalls permitted and fix it myself. Backups were automated and would produce an alert if they failed to verify.
Itâs not rocket science, especially when youâre talking about small amounts of data (small credit union systems in my example).
No it was not. 15 years ago Heroku was the rage. Even the places that had bare metal usually had someone running something similar to devops and at least core infrar was not being touched. I am sure places existed but 15 years while far away was already pretty far along from what you describe. At least in SV.
Heroku was popular with startups who didnât have infrastructure skills but the price was high enough that anyone who wasnât in that triangle of âlavish budget, small team, limited app diversityâ wasnât using it. Things like AWS IaaS were far more popular due to the lower cost and greater flexibility but even that was far from a majority service class.
I am not sure if you are trying to refute my lived experience or what exactly the point is. Heroku was wildly popular with startups at the time, not just those with lavish budgets. I was already touching RDS at this point and even before RDS came around no organization I worked at had me jumping on bare metal to provision services myself. There always a system in place where someone helped out engineering to deploy systems. I know this was not always the case but the person I was responding to made it sound like 15 years ago all engineers were provisioning their own database and doing other times of dev/sys ops on a regular basis. Itâs not true at least in SV.
5 - Datacenter (DC) - Like 4, except also take control of the space/power/HVAC/transit/security side of the equation. Makes sense either at scale, or if you have specific needs. Specific needs could be: specific location, reliability (higher or lower than a DC), resilience (conflict planning).
There are actually some really interesting use cases here. For example, reliability: If your company is in a physical office, how strong is the need to run your internal systems in a data centre? If you run your servers in your office, then there's no connectivity reliability concerns. If the power goes out, then the power is out to your staff's computers anyway (still get a UPS though).
Or perhaps you don't need as high reliability if you're doing only batch workloads? Do you need to pay the premium for redundant network connections and power supplies?
If you want your company to still function in the event of some kind of military conflict, do you really want to rely on fibre optic lines between your office and the data center? Do you want to keep all your infrastructure in such a high-value target?
I think this is one of the more interesting areas to think about, at least for me!
When I worked IT for a school district at the beginning of my career (2006-2007), I was blown away that every school had a MASSIVE server room (my office at each school - the MDF). 3-5 racks filled (depending on school size and connection speed to the central DC - data closet) 50-75% was networking equipment (5 PCs per class hardwired), 10% was the Novell Netware server(s) and storage, the other 15% was application storage for app distributions on login.
Personally I haven't seen a scenario where it makes sense beyond a small experimental lab where you value the ability to tinker physically with the hardware regularly.
Offices are usually very expensive real estate in city centers and with very limited cooling capabilities.
Then again the US is a different place, they don't have cities like in Europe (bar NYC).
If you are a bank or a bookmaker or similar you may well want to have total control of physical access to the machines. I know one bookmaker I worked with had their own mini-datacenter, mainly because of physical security.
If you have less than a rack of hardware, if you have physical security requirements, and/or your hardware is used in the office more than from the internet, it can make sense.
5 was a great option for ml work last year since colo rented didn't come with a 10kW cable. With ram, sd and GPU prices the way they are now I have no idea what you'd need to do.
Thank goodness we did all the capex before the OpenAI ram deal and expensive nvidia gpus were the worst we had to deal with.
> 4 - Buy and colocate the hardware yourself â Certainly the cheapest option if you have the skills, scale, cap-ex, and if you plan to run the servers for at least 3-5 years.
Is it still the cheapest after you take into account that skills, scale, cap-ex and long term lock-in also have opportunity costs?
An interesting question, so time for some 100% speculation.
It sounds like they probably have revenue in the âŹ500mm range today. And given that the bare metal cost of AWS-equivalent bills tends to be a 90% reduction, we'll say a âŹ10mm+ bare metal cost.
So I would say a cautious and qualified "yes". But I know even for smaller deployments of tens or hundreds of servers, they'll ask you what the purpose is. If you say something like "blockchain," they're going to say, "Actually, we prefer not to have your business."
I get the strong impression that while they naturally do want business, they also aren't going to take a huge amount of risk on board themselves. Their specialism is optimising on cost, which naturally has to involve avoiding or mitigating risk. I'm sure there'd be business terms to discuss, put it that way.
Why would a client who wants to run a Blockchain be risky for Herzner? I'm not a fan, I just don't see the issue. If the client pays their monthly bill, who cares if they're using the machine to mine for Bitcoin?
They are certain to run the machines at 100% continually, which will cost more than a typical customer who doesn't do this, and leave the old machines with less second-hand value for their auction thing afterwards.
Iâd bet that main reason would be power. Running machines at 100% doesnât subtract much extra , but a server running hard for 24 hours would use more power than a bursty workload.
Netflix might be spending as much as $120m (but probably a little less), and I thought they were probably Amazon's biggest customer. Does someone (single-buyer) spend more than that with AWS?
Hertzner's revenue is somewhere around $400m, so probably a little scary taking on an additional 30% revenue from a single customer, and Netflix's shareholders would probably be worried about risk relying on a vendor that is much smaller than them.
Sometimes if the companies are friendly to the idea, they could form a joint venture or maybe Netflix could just acquire Hertzner (and compete with Amazon?), but I think it unlikely Hertzner could take on Netflix-sized for nontechnical reasons.
However increasing pop capacity by 30% within 6mo is pretty realistic, so I think they'd probably be able to physically service Netflix without changing too much if management could get comfortable with the idea
A $120M spend on AWS is equivalent to around a $12M spend on Hetzner Dedicated (likely even less, the factor is 10-20x in my experience), so that would be 3% of their revenue from a single customer.
How is this reasonable? At what point do they pull a Dropbox and de-AWS? I canât think of why they would gain with AWS over in house hosting at that point.
Iâm not surprised, but youâd think there would be some point where they would decide to build a data center of their own. Itâs a mature enough company.
Can someone explain 2 to me. How is a managed private cloud different from full cloud? Like you are still using AWS or Azure but you are keeping all your operation in a bundled, portable way, so you can leave that provider easily at any time, rather than becoming very dependent on them? Is it like staying provider-agnostic but still cloud based?
To put it plainly: We deploy a Kubernetes cluster on Hetzner dedicated servers and become your DevOps team (or a part thereof).
It works because bare metal is about 10% the cost of cloud, and our value-add is in 1) creating a resilient platform on top of that, 2) supporting it, 3) being on-call, and 4) being or supporting your DevOps team.
This starts with us providing a Kubernetes cluster which we manage, but we also take responsibility for the services run on it. If you want Postgres, Redis, Clickhouse, NATS, etc, we'll deploy it and be SLA-on-call for any issues.
If you don't want to deal with Kubernetes then you don't have to. Just have your software engineers hand us the software and we'll handle deployment.
Everything is deployed on open source tooling, you have access to all the configuration for the services we deploy. You have server root access. If you want to leave you can do.
Our customers have full root access, and our engineers (myself included) are in a Slack channel with you engineers.
And, FWIW, it doesn't have to be Hetzner. We can colocate or use other providers, but Hetzner offer excellent bang-per-buck.
Edit: And all this is included in the cluster price, which comes out cheaper than the same hardware on the major cloud providers
You give customers root but you're on call when something goes tits up?
You're a brave DevOps team. That would cause a lot of friction in my experience, since people with root or other administrative privileges do naughty things, but others are getting called in on Saturday afternoon.
Instead of using the Cloud's own Kubernetes service, for example, you just buy the compute and run your own Kubernetes cluster. At a certain scale that is going to be cheaper if you have to know how. And since you are no longer tied to which services are provided and you just need access to compute and storage. you can also shop around for better prices than Amazon or Azure since you can really go to any provider of a VPS.
I am using something inbetween 2 and 3, a hosted Web-site and database service with excellent customer support. On shared hardware it is 22 âŹ/month. A managed server on dedicated hardware starts at about 50 âŹ/month.
Been using Hetzner Cloud for Kubernetes and generally like it, but it has its limitations. The network is highly unpredictable. You at best get 2Gbit/s, but at worst a few hundreds of Mbit/s.
Is that for the virtual private network? I heard some people say that you actually get higher bandwidth if you're using the public network instead of the private network within Hetzner, which is a little bit crazy.
> Buy and colocate the hardware yourself â Certainly the cheapest option if you have the skills
back then this type of "skill" was abundant. You could easily get sysadmin contractors who would take a drive down to the data-center (probably rented facilities in a real-estate that belonged to a bank or insurance) to exchange some disks that died for some reason. such a person was full stack in a sense that they covered backups, networking, firewalls, and knew how to source hardware.
the argument was that this was too expensive and the cloud was better. so hundreds of thousands of SME's embraced the cloud - most of them never needed Google-type of scale, but got sucked into the "recurring revenue" grift that is SaaS.
If you opposed this mentality you were basically saying "we as a company will never scale this much" which was at best "toxic" and at worst "career-ending".
The thing is these ancient skills still exist. And most orgs simply do not need AWS type of scale. European orgs would do well to revisit these basic ideas. And Hetzner or Lithus would be a much more natural (and honest) fit for these companies.
I wonder how much companies pay yearly in order to avoid having an employee pick up a drive from a local store, drive to the data center, pull the disk drive, screw out the failing hard drive and put in the new one, add it in the raid, verify the repair process has started, and then return to the office.
I don't think I've ever seen a non-hot-swap disk in a normal server. The oldest I dealt with had 16 HDDs per server, and only 12 were accessible from the outside, bu the 4 internal ones were still hot-swap after taking the cover off.
Even some really old (2000s-era) junk I found in a cupboard at work was all hot-swap drives.
But more realistically in this case, you tell the data centre "remote hands" person that a new HDD will arrive next-day from Dell, and it's to go in server XYZ in rack V-U at drive position T. This may well be a free service, assuming normal failure rates.
In the Bay Area there are little datacenters that will happily colocate a rack for you and will even provide an engineer who can swap disks. The service is called âremote handsâ. It may still be faster to drive over.
It baffles me that my career trajectory somehow managed to insulate me from ever having to deal with the cloud, while such esoteric skills as swapping a hot swap disk or racking and cabling a new blade chassis are apparently on the order of finding a COBOL developer now. Really?
I can promise you that large financial institutions still have datacenters. Many, many, many datacenters!
we had two racks in our office of mostly developers. If you have an office you already have a rack for switches and patch panels. Adding a few servers is obvious.
Software development isn't a typical SME however. Mike's Fish and Chips will not buy a server and that's fine.
if someone on the DevOps team knows Nix, option 3 becomes a lot cheaper time-wise! yeah, Nix flakes still need maintenance, especially on the `nixos-unstable` branch, but you get the quickest disaster recovery route possible!
plus, infra flexibility removes random constraints that e.g. Cloudflare Workers have
Indeed! We've yet to go down this route, but it's something we're thinking on. A friend and I have been talking about how to bring Nix-like constructs to Kubernetes as well, which has been interesting. (https://github.com/clotodex/kix, very much in the "this is fun to think about" phase)
There are a bunch of ways to manage bare metal servers apart from Nix. People have been doing it for years. Kickstart, theforeman, maas, etc, [0]. Many to choose from according to your needs and layers you want them to manage.
Reality is these days you just boot a basic image that runs containers
Option 4 as well, that's how we do it at work and it's been great. However, it can't really be "someone on the team knows Nix", anyone working on Ops will need Nix skills in order to be effective.
I would suggest to use both on-premise hardware and cloud computing. Which is probably what comma is doing.
For critical infrastructure, I would rather pay a competent cloud provider than being responsible for reliability issues. Maintaining one server room in the headquarters is something, but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO.
For running many slurm jobs on good servers, cloud computing is very expensive and you sometimes save money in a matter of months. And who cares if the server room is a total loss after a while, worst case you write some more YAML and Terraform and deploy a temporary replacement in the cloud.
Another thing between is colocation, where you put hardware you own in a managed data center. Itâs a bit old fashioned, but it may make sense in some cases.
I can also mention that research HPCs may be worth considering. In research, we have some of the world fastest computers at a fraction of the cost of cloud computing. Itâs great as long as you donât mind not being root and having to use slurm.
I donât know in USA, but in Norway you can run your private company slurm AI workloads on research HPCs, though you will pay quite a bit more than universities and research institutions. But you can also have research projects together with universities or research institutions, and everyone will be happy if your business benefits a lot from the collaboration.
> but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO
I worked in a company with two server farms (a main and a a backup one essentially) in Italy located in two different regions and we had a total of 5 employees taking care of them.
We didn't hear about them, we didn't know their names, but we had almost 100% uptime and terrific performance.
There was one single person out of 40 developers who's main responsibility were deploys, and that's it.
It costed my company 800k euros per year to run both the server farms (hardware, salaries, energy), and it spared the company around 7-8M in cloud costs.
Now I work for clients that spend multiple millions in cloud for a fraction of the output and traffic, and I think employ around 15+ dev ops engineers.
> I would rather pay a competent cloud provider than being responsible for reliability issues.
Why do so many developers and sysadmins think they're not competent for hosting services. It is a lot easier than you think, and its also fun to solve technical issues you may have.
The point was about redundancy / geo spread / HA. Itâs significantly more difficult to operate two physical sites than one. You can only be in one place at a time.
If you want true reliability, you need redundant physical locations, power, networking. Thatâs extremely easy to achieve on cloud providers.
You can just rent the rack space in datacenter and have that covered. It's still much cheaper than running that in cloud.
It doesn't make sense if you only have few servers, but if you are renting equivalent of multiple racks of servers from cloud and run them for most of the day, on-prem is staggeringly cheaper.
We have few racks and we do "move to cloud" calculation every few years and without fail they come up at least 3x the cost.
And before the "but you need to do more work" whining I hear from people that never did that - it's not much more than navigating forest of cloud APIs and dealing with random blackbox issues in cloud that you can't really debug, just go around it.
On cloud it's out of your control when an AZ goes down. When it's your server you can do things to increase reliability. Most colos have redundant power feeds and internet. On prem that's a bit harder, but you can buy a UPS.
If your head office is hit by a meteor your business is over. Don't need to prepare for that.
Maybe you find it fun. I donât, I prefer building software not running and setting up servers.
Itâs also nontrivial once you go past some level of complexity and volume. I have made my career at building software and part of that requires understanding the limitations and specifics of the underlying hardware but at the end of the day I simply want to provision and run a container, I donât want to think about the security and networking setup itâs not worth my time.
> Why do so many developers and sysadmins think they're not competent for hosting services. It is a lot easier than you think, and its also fun to solve technical issues you may have.
It is a different skillset. SRE is also an under-valued/paid (unless one is in FAANGO).
Itâs all downside. If nothing goes wrong, then the company feels like theyâre wasting money on a salary. If things go wrong theyâre all your fault.
At a previous job, the company had its critical IT infrastructure on their own data center. It was not in the IT industry, but the company was large and rich enough to justify two small data centers. It notably had batteries, diesel generators, 24/7 teams, and some advanced security (for valid reasons).
I agree that solving technical issues is very fun, and hosting services is usually easy, but having resilient infrastructure is costly and I simply don't like to be woken up at night to fix stuff while the company is bleeding money and customers.
> Why do so many developers and sysadmins think they're not competent for hosting services.
Because those services solve the problem for them. It is the same thing with GitHub.
However, as predicted half a decade ago with GitHub becoming unreliable [0] and as price increases begin to happen, you can see that self-hosting begins to make more sense and you have complete control of the infrastructure and it has never been more easier to self host and bring control over costs.
> its also fun to solve technical issues you may have.
What you have just seen with coding agents is going to have the same effect on "developers" that will have a decline in skills the moment they become over-reliant on coding agents and won't be able to write a single line of code at all to fix a problem they don't fully understand.
> Maintaining one server room in the headquarters is something, but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO.
Speaking as someone who does this, it is very straightforward. You can rent space from people like Equinix or Global Switch for very reasonable prices. They then take care of power, cooling, cabling plant etc.
Unfortunately we experienced an issue where our Slurm pool was contaminated by a misconstrained Postgres Daemon. Normally the contaminated slurm pool would drain into a docker container, but due to Rust it overloaded and the daemon ate its own head. Eventually we returned it to a restful state so all's well that ends well.
(hardware engineer trying to understand wtaf software people are saying when they speak)
The reason companies donât go with on premises even if cloud is way more expensive is because of the risk involved in on premises.
You can see it quite clearly here that thereâs so many steps to take. Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.
Itâs never about âis the expected cost in on premises less than cloudâ, itâs about the risk adjusted costs.
Once youâve spread risk not only on your main product but also on your infrastructure, it becomes hard.
I would be vary of a smallish company building their own Jira in house in a similar way.
Software companies have higher margins so these decisions are lower stakes. Unless on premises helps the bottom line of the main product that the company provides, these decisions don't really matter in my opinion.
Think of a ~5000 employee startup. Two scenarios:
1. if they win the market, they capture something like ~60% margin
2. if that doesn't happen, they just lose, VC fund runs out and then they leave
In this dynamic, costs associated with infrastructure don't change the bottomline of profitability. The risk involved with rolling out their on infrastructure can hurt their main product's existence itself.
I'm not disputing that there are situations where it makes sense to pay a high risk premium. What I'm disputing is that price doesn't matter. I get the impression that companies are losing the capability to make rational pricing decisions.
>Unless on premises helps the bottom line of the main product that the company provides, these decisions don't really matter in my opinion.
Well, exactly. But the degree to which the price of a specific input affects your bottom line depends on your product.
During the dot com era, some VC funded startups (such as Google) made a decision to avoid using Windows servers, Oracle databases and the whole super expensive scale-up architecture that was the risk-free, professional option at the time. If they hadn't taken this risk, they might not have survived.
[Edit] But I think it's not just about cloud vs on-premises. A more important question may be how you're using the cloud. You don't have to lock yourself into a million proprietary APIs and throw petabytes of your data into an egress jail.
SME and "a server" is doing some big weight lifting here.
If you want a custom server, one or a thousand, it's at least a couple of weeks.
If you want a powerful GPU server, that's rack + power + cooling (and a significant lead time). A respectable GPU server means ~2KW of power dissipation and considerable heat.
If you want a datacenter of any size, now that's a year at least from breaking ground to power-on.
If you want something at all customized, it takes longer than that to receive the server. That being said, you can buy a server that will outperform anything the cloud can give you at much better cost.
I think it wins because opex is seen as stable recurring cost and capex is seen as the money you put in your primary differentiation for long term gains.
For mature Enterprises my understanding is that the financial math works out such that the cloud becomes smart for market validation, before moving to cheaper long term solution once revenue is stable.
Scale up, prove the market and establish operations on the credit card, and if it doesnât work the money moves onto more promising opportunities. If the operation is profitable you transition away from the too expensive cloud to increase profitability, and use the operations incoming revenue to pay for it (freeing up more money to chase more promising opportunities).
Personally I canât imagine anything outside of a hybrid approach, if only to maintain power dynamics with suppliers on both sides. Price increases and forced changes can be met with instant redeployments off their services/stack, creating room for more substantive negotiations. When investments come in the form of saving time and money, itâs not hard to get everyone aligned.
Which is incredibly difficult in the public sector. Yes, there are various financing instruments available for capital purchases but they're always annoying, slow and complicated. It's much easier to spend 5k per month than 500k outright.
At scale (like comma.ai), it's probably cheaper. But until then it's a long term cost optimization with really high upfront capital expenditure and risk. Which means it doesn't make much sense for the majority of startup companies until they become late stage and their hosting cost actually becomes a big cost burden.
There are in between solutions. Renting bare metal instead of renting virtual machines can be quite nice. I've done that via Hetzner some years ago. You pay just about the same but you get a lot more performance for the same money. This is great if you actually need that performance.
People obsess about hardware but there's also the software side to consider. For smaller companies, operations/devops people are usually more expensive than the resources they manage. The cost to optimize is that cost. The hosting cost usually is a rounding error on the staffing cost. And on top of that the amount of responsibilities increases as soon as you own the hardware. You need to service it, monitor it, replace it when it fails, make sure those fans don't get jammed by dust puppies, deal with outages when they happen, etc. All the stuff that you pay cloud providers to do for you now becomes your problem. And it has a non zero cost.
The right mindset for hosting cost is to think of it in FTEs (full time employee cost for a year). If it's below 1 (most startups until they are well into scale up territory), you are doing great. Most of the optimizations you are going to get are going to cost you in actual FTEs spent doing that work. 1 FTE pays for quite a bit of hosting. Think 10K per month in AWS cost. A good ops person/developer is more expensive than that. My company runs at about 1K per month (GCP and misc managed services). It would be the wrong thing to optimize for us. It's not worth spending any amount of time on for me. I literally have more valuable things to do.
This flips when you start getting into the multiple FTEs per month in cost for just the hosting. At that point you probably have additional cost measured in 5-10 FTE in staffing anyway to babysit all of that. So now you can talk about trading off some hosting FTEs for modest amount of extra staffing FTEs and make net gains.
> At scale (like comma.ai), it's probably cheaper. But until then it's a long term cost optimization with really high upfront capital expenditure and risk. Which means it doesn't make much sense for the majority of startup companies until they become late stage and their hosting cost actually becomes a big cost burden.
You rent a dataspace, which is OPEX not CAPEX, and you just lease the servers, which turns big CAPEX into monthly OPEX bill
Running your own DC is "we have two dozen racks of servers" endeavour, but even just renting DC space and buying servers is much cheaper than getting same level of performance from the cloud.
> This flips when you start getting into the multiple FTEs per month in cost for just the hosting. At that point you probably have additional cost measured in 5-10 FTE in staffing anyway to babysit all of that. So now you can talk about trading off some hosting FTEs for modest amount of extra staffing FTEs and make net gains.
YOU NEED THOSE PEOPLE TO MANAGE CLOUD TOO. That's what always get ignore in calculations, people go "oh, but we really need like 2-3 ops people to cover datacenter and have shifts on the on-call", but you need same thing for cloud too, it is just dumped on programmers/devops guys in the team rather than having separate staff.
We have few racks and the part related to hardware is small part of total workload, most of it is same as we would (and do for few cloud customers) in cloud, writing manifests for automation.
Finally, some sense! "Cloud" was meant to make ops jobs disappear, but they just increased our salary by turning us into "DevOps Engineers" and the company's hosting bill increased fivefold in the process. You will never convince even 1% of devs to learn the ops side properly, therefore you'll still end up hiring ops people and we will cost you more now. On top of that, everyone that started as a "DevOps Engineer" knows less about ops than those that started as ops and transitioned into being "DevOps Engineers" (or some flavour of it like SREs or Platform Engineers).
If you're a programmer scared into thinking AI is going to take away your job, re-read my comment.
To be fair, I think people are vastly over estimating the work they would have and the power they would need. Yes, if you have to massively scale up, then it'll take some work, but most of it is one-time work. You do it, and when it runs, you only have a fraction of work over the next months to maintain it. And with fraction, I mean below 5%. And keep in mind that >99% of startups who think of "yeah we need this and that cloud, because we need to scale" will never scale. Instead they are happily locking themselves into a cloud service. And if they actually scale at some point, this service will be massively more expensive.
But they should; cloud wont magically make the architecture scale. A competent CTO should know the limits of the platform, its called "load testing" or "stress testing"; scalability is independent of the provider. Cloud gives you a nicer interface to add resources, granted; but that"s it.
As a hear-say anecdote, thats why some startups have db servers with hundreds of gb of ram and dozens of cpus to run a workload that could be served from a 5 year old laptop.
We have two on site servers that we use. For various reasons (power cuts, internet outages, cleaners unplugging them) Iâd say we have to intervene with them physically about once a month. Itâs a total pain in the ass, especially when you donât have _an_ it person sitting in the office to mind it. Iâm in the Uk and our office is in SpainâŚ
You should also calculate the cost of getting it up and running. With Google Cloud (I don't actually use AWS), I mainly worry about building docker containers in CI and deploying them to vms and triggering rolling restarts as those get replaced with new ones. I don't worry about booting them. I don't worry about provisioning operating systems or configuration to them. Or security updates. They come up with a lot of pre-provisioned monitoring and other stuff. No effort required on my side.
And for production setups. You need people on stand by to fix the server in case of hardware issues; also outside office hours. Also, where does the hardware live? What's your process when it fails? Who drives to wherever the thing is and fixes it? What do you pay them to be available for that? What's the lead time for spare components? Do you actually keep those in supply? Where? Do you pay for security for wherever all that happens? What about cleaning, AC, or a special server room in your building. All that stuff is cost. Some of it is upfront cost. Some of it is recurring cost.
The article is a about a company that owns its own data center. The cost they are citing (5 million) is substantial and probably a bit more complete. That's one end of the spectrum.
> I don't worry about booting them. I don't worry about provisioning operating systems or configuration to them. Or security updates. They come up with a lot of pre-provisioned monitoring and other stuff. No effort required on my side.
These are not difficult problems. You can use the same/similar cloud install images.
A 10 year old nerd can install Linux on a computer; if you're a professional developer I'm sure you can read the documentation and automate that.
> And for production setups. You need people on stand by to fix the server in case of hardware issues; also outside office hours.
You could use the same person who is on standby to fix the cloud system if that has some failure.
> Also, where does the hardware live?
In rented rackspace nearby, and/or in other locations if you need more redundancy.
> What's your process when it fails? Who drives to wherever the thing is and fixes it? What do you pay them to be available for that? What's the lead time for spare components? Do you actually keep those in supply? Where?
It will probably report the hardware failure to Dell/HP/etc automatically and open a case. Email or phone to confirm, the part will be sent overnight, and you can either install it yourself (very, very easy for things like failed disks) or ask a technician to do it (I only did this once with a CPU failure on a brand new server). Dell/HP/etc will provide the technician, or your rented datacentre space will have one for simpler tasks like disks.
> You should also calculate the cost of getting it up and running.
I was not doing the calculation. I was only pointing out that it was not as simple as you make it out to be.
Okay, a few other things that aren't in most calculations:
1. Looking at jobs postings in my area, the highest paid ones require experience with specific cloud vendors. The FTEs you need to "manage" the cloud are a great deal more expensive than developers.
2. You don't need to compare on-prem data center with AWS - you can rent a pretty beefy VPS or colocate for a fraction of the cost of AWS (or GCP, or Azure) services. You're comparing the most expensive alternative when avoiding cloud services, not the most typical.
3. Even if you do want to build your own on-prem rack, FTEs aren't generally paid extra for being on the standby rota. You aren't paying extra. Where you will pay extra is for hot failovers, or machine room maintenance, etc, which you don't actually need if your hot failover is a cheap beefy VPS-on-demand on Hetzner, DO, etc.
4. You are measuring the cost of absolute 0% downtime. I can't think of many businesses that have such high sensitivity to downtime. Even banks handle downtime much larger than that even while their IT systems are still up. With such strict requirements you're getting into the spot where the business itself cannot continue because of catastrophe, but the IT systems can :-/. What use is the IT systems when the business itself may be down?
The TLDR is:
1. If you have highly paid cloud-trained FTEs, and
2. Your only option other than Cloud is on-prem, and
3. Your FTEs are actually FT-contractors who get paid per hour, and
4. Your uptime requirements are moire stringent than national banks,
yeah, then cloud services are only slightly more expensive.
You know how many businesses fall into that specific narrow set of requirements?
> it doesn't make much sense for the majority of startup companies until they become late stage
Here's what TFA says about this:
> Cloud companies generally make onboarding very easy, and offboarding very difficult. If you are not vigilant you will sleepwalk into a situation of high cloud costs and no way out.
and I think they're right. Be careful how you start because you may be stuck in the initial situation for a long time.
This also depends so much on your scaling needs. If you need 3 mid-sized ECS/EC2 instances, a load balancer, and a database with backups, renting those from AWS isnât going to be significantly more expensive for a decent-sized company than hiring someone to manage a cluster for you and dealing with all the overhead of keeping it maintained and secure.
If youâre at the scale of hundreds of instances, that math changes significantly.
And a lot of it depends on what type of business you have and what percent of your budget hosting accounts for.
I also thinks itâs risk model too. Every time I see these kind of posts I think it misses the point there is a balance not only on cost like you describe but risk as well. You are paying to offload some of the risk from yourself.
No, low isn't good perse. I worked in a datacenter which in winters had less than 40%, ram was failing all over the place. Low humidity causes static electricity.
Low is good if you are also adding more humidity back in. If you want to maintain 45-50% (guessing), then you would want <45% environmental humidity so that you can raise it to the level you want. You're right about avoiding static, but you'd still want to try to keep it somewhat consistent.
It is much cheaper to use external air for cooling if you can.
Yeah but the article makes it sound as if lower is better, which it is definitely not. And yeah you need to control humidity, that might mean sometimes lowering, and sometimes increase it by whatever solution you have.
Also this is where cutting corner indeed results in lower cost, which was the reason for the OP to begin with. It just means you won't get as good a datacenter as people who are actually tuning this whole day and have decades of experience.
I fully lost three small VPS there, and their response was poor: they didn't even refund time lost, they didn't compensate for time lost (e.g. a couple of months of free VPS), I got better updates from the news than from them (news were saying "almost total loss", while them were trying to convince me that I had the incredible bad luck that my three VPS were in the very small zone affected by the fire). The only way I had to recover what I lost was backups in local machines.
When someone point out how safe are cloud providers, as if they have multiple levels of redundancy and are fully protected against even an alien invasion, I remember the OVH fire.
They handled the fire terribly and after that they improved a bit, but an OVH VPS is just a VM running on a single piece of hardware.
Quite not the same thing as the "Compute" which is running on clusters.
They use the datasenter for model training, not to serve online users. Presumably even if it will be offline for a week or even a month it will not be a total disaster as long as they have, for example, offsite tape backups.
Flooding due to burst frozen pipe, false sprinkler trigger, or many others.
Something very similar happened at work. Water valve monitoring wasnât up yet. Fire didnât respond because reasons. Huge amount of water flooded over a 3 day weekend. Total loss.
why build one when you can have two at twice the price?
But, if you're building a datacenter for $5M, spending $10-15M for redundant datacenters (even with extra networking costs), would still be cheaper than their estimated $25M cloud costs.
> Self-reliance is great, but there are other benefits to running your own compute. It inspires good engineering.
It's easy to inspire people when you have great engineers in the first place. That's a given at a place like comma.ai, but there are many companies out there where administering a datacenter is far beyond their core competencies.
I feel like skilled engineers have a hard time understanding the trade-offs from cloud companies. The same way that comma.ai employees likely don't have an in-house canteen, it can make sense to focus on what you are good at and outsource the rest.
Same thing. I was previously spending 5-8K on DigitalOcean, supposedly a "budget" cloud. Then the company was sold, and I started a new company on entirely self-hosted hardware. Cloudflare tunnel + CC + microk8s made it trivial! And I spend close to nothing other than internet that I already am spending on. I do have solar power too.
Working at a non-tech regional bigco, where ofc cloud is the default, I see everyday how AWS costs get out of hand, it's a constant struggle just to keep costs flat. In our case, the reality is that NONE of our services require scalability, and the main upside of high uptime is nice primarily for my blood pressure.. we only really need uptime during business hours, nobody cares what happens at night when everybody is sleeping.
On the other hand, there's significant vendor lockin, complexity, etc. And I'm not really sure we actually end up with less people over time, headcount always expands over time, and there's always cool new projects like monitoring, observability, AI, etc.
My feeling is, if we rented 20-30 chunky machines and ran Linux on them, with k8s, we'd be 80% there. For specific things I'd still use AWS, like infinite S3 storage, or RDS instances for super-important data.
If I were to do a startup, I would almost certainly not base it off AWS (or other cloud), I'd do what I write above: run chunky servers on OVH (initially just 1-2), and use specific AWS services like S3 and RDS.
A bit unrelated to the above, but I'd also try to keep away from expensive SaaS like Jira, Slack, etc. I'd use the best self-hosted open source version, and be done with it. I'd try Gitea for git hosting, Mattermost for team chat, etc.
And actually, given the geo-political situation as an EU citizen, maybe I wouldn't even put my data on AWS at all and self-host that as well...
The #1 reason I would advocate for using AWS today is the compliance package they bring to the party. No other cloud provider has anything remotely like Artifact. I can pull Amazon's PCI-DSS compliance documentation using an API call. If you have a heavily regulated business (or work with customers who do), AWS is hard to beat.
If you don't have any kind of serious compliance requirement, using Amazon is probably not ideal. I would say that Azure AD is ok too if you have to do Microsoft stuff, but I'd never host an actual VM on that cloud.
Compliance and "Microsoft stuff" covers a lot of real world businesses. Going on prem should only be done if it's actually going to make your life easier. If you have to replicate all of Azure AD or Route53, it might be better to just use the cloud offerings.
Even at the personal blog level, I'd argue it's worth it to run your own server (even if it's just an old PC in a closet). Gets you on the path to running a home lab.
> Cloud companies generally make onboarding very easy, and offboarding very difficult. If you are not vigilant you will sleepwalk into a situation of high cloud costs and no way out. If you want to control your own destiny, you must run your own compute.
Cost and lock-in are obvious factors, but "sovereignty" has also become a key factor in the sales cycle, at least in Europe.
Handing health data, Juvoly is happy to run AI work loads on premise.
The company I work for used to have a hybrid where 95% was on-prem, but became closer to 90% in the cloud when it became more expensive to do on-prem because of VMware licensing. There are alternatives to VMware, but not officially supported with our hardware configuration, so the switch requires changing all the hardware, which still drives it higher than the cloud. Almost everything we have is cloud agnostic, and for anything that requires resilience, it sits in two different providers.
Now the company is looking at doing further cost savings as the buildings rented for running on-prem are sitting mostly unused, but also the prices of buildings have gone up in recent years, notably too, so we're likely to be saving money moving into the cloud. This is likely to make the cloud transition permanent.
The cloud requires expertise in company-specific APIs and billing systems. A data center requires knowledge of Watts, bits, and FLOPs. I know which one I rather think about.
I used to colocate a 2U server that I purchased with a local data center. It was a great learning experience for me. Im curious why a company wouldn't colocate their own hardware? Proximity isnt an issue when you can have the datacenter perform physical tasks. Bravo to the comma team regardless. It'll be a great learning experience and make each person on their team better.
Ps... bx cable instead of conduit for electrical looks cringe.
The main reason not to colocate is if you're somewhere with high real estate costs... E.g
Hetzner managed servers competes on price w/co-location for me because I'm in London.
I colocate in London, a single server / firewall comes to around ÂŁ5k a year. I also colocate two other servers in some northern UK location in some industrial estate for ÂŁ2k as my backups. I've never enjoyed the cloud and dedicated server's have their own caveats too.
Budget hosts such as Hetzner/OVH have been known to suddenly pull the plug for no reason.
My kit is old, second hand old (Cisco UCS 220 M5, 2xDell somethings) and last night I just discovered I can throw in two NVIDIA T4's and turn it in to a personal LLM.
I'm quite excited having my own colocated server with basic LLM abilities. My own hardware with my own data and my own cables. Just need my own IP's now.
> Budget hosts such as Hetzner/OVH have been known to suddenly pull the plug for no reason.
The same would apply for any number of hosts. Hetzner/OVH are cheap, but as your own numbers show the location price gap is more than sufficient to cover the costs of servers.
In fact you can colocate with Hetzner too, and you'd get a similar price gap - the lower cost of real-estate is a large part of the reason why they can be as cheap as they are.
Data centre operations is a real estate play - to the point that at least one UK data centre operator is owned by a real estate investment company.
Thanks. I hadn't seen it as such and you're right. I guess it comes down to personal preference.
Where I feel that data has become a commodity in that I can sell your username and email for a few pence, I would rather prefer to have my own hardware in my own possession and that any request of it has to go to me, nor some server provider.
I cancelled my digital ocean server of almost a decade late last year and replaced it with a raspberry pi 3 that was doing nothing. We can do it, we should do it.
> Maintaining a data center is much more about solving real-world challenges. The cloud requires expertise in company-specific APIs and billing systems. A data center requires knowledge of Watts, bits, and FLOPs. I know which one I rather think about.
I find this to be applicable on a smaller scale too! I'd rather setup and debug a beefy Linux VPS via SSH than fiddle with various propietary cloud APIs/interfaces. Doesn't go as low-level as Watts, bits and FLOPs but I still consider knowledge about Linux more valuable than knowing which Azure knobs to turn.
> Cloud companies generally make onboarding very easy, and offboarding very difficult.
I reckon most on-prem deployments have significantly worse offboarding than the cloud providers. As a cloud provider you can win business by having something for offboarding, but internally you'd never get buy-in to spend on a backup plan if you decide to move to the cloud.
> As a cloud provider you can win business by having something for offboarding, but internally you'd never get buy-in to spend on a backup plan if you decide to move to the cloud.
Its the other way around. How do you think all businesses moved to the cloud in the first place?
What redundancy are we talking about? AWS has proven to the world on multiple occasions that redundancy across geo locations is useless, because if us-east-1 is down, their whole cloud is done, causing a big chunk of the world to be down.
Half sarcasm of course, but it goes to show that the world is not going to fall apart in many cases when it comes to software. Sure, it's not ideal in lots of cases, but we'll survive without redundancy.
Microsoft made the TCO argument and won. Self-hosting is only an option if you can afford expensive SysOps/DevOps/WhateverWeAreCalledTheseDays to manage it.
15-years ago or so a spreadsheet was floating around where you could enter server costs, compute power, etc and it would tell you when you would break-even by buying instead of going with AWS. I think it was leaked from Amazon because it was always three-years to break-even even as hardware changed over time.
Azure provides their own "Total Cost of Ownership" calculator for this purpose [0]. Notably, this makes you estimate peripheral costs such as cost of having a server administrator, electricity, etc.
Thank you, I've wanted to see someone use this in the real world. When doing Azure certifications (AZ900, AZ204, etc.), they force you to learn about this tool.
I may be out of date with RAM prices. Dell's configuration tool wants ÂŁ1000 each for 32GB RDIMMs â but prices in Dell's configuration tool are always significantly higher than we get if we write to their sales person.
Even so, a rough configuration for a 2-processor 16 core/processor server with 256GiB RAM comes to $20k, vs $22k + 100% = $44k quoted by MS. (The 100% is MS' 20%-per-year "maintenance cost" that they add on to the estimate. In reality this is 0% as everything is under Dell's warranty.)
And most importantly, the tool is only comparing the cost of Azure to constructing and maintaining a data centre! Unless there are other requirements (which would probably rule out Azure anyway) that's daft, a realistic comparison should be to colocation or hired dedicated servers, depending on the scale.
If you buy, maybe. Leasing or renting tends to be cheaper from day one. Tack on migration costs and ca. 6 months is a more realistic target. If the spreadsheet always said 3 years, it sounds like an intentional "leak".
Well, somebody should recreate it. I smell a potential startup idea somewhere. There's a ton of "cloud cost optimizers" software but most involve tweaking AWS knobs and taking a cut of the savings. A startup that could offload non critical service from AWS to colo and traditional bare metal hosting like Hetzner has a strong future.
One thing to keep in mind is that the curve for GPU depreciation (in the last 5 years at least) is a little steeper than 3 years. Current estimates is that the capital depreciation cost would plunge dramatically around the third year. For a top tier H100 depreciation kicks in around the 3rd year but they mentioned for the less capable ones like the A100 the depreciation is even worse.
Now this is not factoring cost of labour. Labor at SF wages is dreadfully expensive, now if your data center is right across the border in Tijuana on the other hand..
I'm thinking about doing a research project at my university looking into distributed "data centers" hosted by communities instead of centralized cloud providers.
The trick is in how to create mostly self-maintaining deployable/swappable data centers at low cost...
Realistically, it's the speed with which you can expand and contract. The cloud gives unbounded flexibility - not on the per-request scale or whatever, but on the per-project scale. To try things out with a bunch of EC2s or GCEs is cheap. You have it for a while and then you let it go. I say this as someone with terabytes of RAM in servers, and a cabinet I have in the Bay Area.
I just read about Railway doing something similar, sadly their prices are still high compared to other bare metal providers and even VPS such as Hetzner with Dokploy, very similar feature set yet for the same 5 dollars you get way more CPU, storage and RAM.
Billing per used or not idle cpu cycle would be quite interesting. Number of cores would just effectively be your cost cap. Efficiency would be even more important. And if the provider over subscribes cores you just pay less. Actually that's probably why they don't do it...
The observation about incentives is underappreciated here. When your compute is fixed, engineers optimize code. When compute is a budget line, engineers optimize slide decks. That's not really a cloud vs on-prem argument, it's a psychology-of-engineering argument.
This is a great solution for a very specific type of team but I think most companies with consistent GPU workloads will still just rent dedicated servers and call it a day.
I agree, and cloud compute is poised to become even more commoditized in the coming years (gazillion new data centers + AI plateauing + efficiency gains, the writing is on the wall). Thereâs no way this makes sense for most companies.
The advantage of renting vs. owning is that you can always get the latest gen, and that brings you newer capabilities (i.e. fp8, fp4, etc) and cheaper prices for current_gen-1. But betting on something plateauing when all the signs point towards the exact opposite is not one of the bets i'd make.
Well, the capabilities have already plateaued as far as I can tell :-/
Over the next few yeas we can probably wring out some performance improvements, maybe some efficiency improvements.
A lot of the current AI users right now are businesses trying to on-sell AI (code reviewers/code generators, recipe apps, assistant apps, etc), and there's way too many of them in the supply/demand ratio, so you can expect maybe 90% of these companies to disappear in the next few years, taking the demand for capacity with them.
Other benefits: easy access to reliable infrastructure and latest hardware which you can swap as you please. There are cases where it makes sense to navigate away from the big players (like dropbox going from aws to on-prem), but again you make this move when you want to optimize costs and are not worried about the trade-offs.
Not long ago Railway moved from GCP to their own infrastructure since it was very expensive for them. [0] Some go for a Oxide rack [1] for a full stack solution (both hardware and software) for intense GPU workloads, instead of building it themselves.
It's very expensive and only makes sense if you really need infrastructure sovereignty. It makes more sense if you're profitable in the tens of millions after raising hundreds of millions.
It also makes sense for governments (including those in the EU) which should think about this and have the compute in house and disconnected from the internet if they are serious about infrastructure sovereignty, rather than depending on US-based providers such as AWS.
I like Hotzâs style: simply and straightforwardly attempting the difficult and complex. I always get the impression: âYou donât need to be too fancy or clever. You donât need permission or credentials. You just need to go out and do the thing. What are you waiting for?â
the âbuild your own datacenterâ story is fun (and commaâs setup is undeniably cool), but for most companies itâs a seductive trap: youâll spend your rarest resource (engineer attention) on watts, humidity, failed disks, supply chains, and âwhy is this rack hot,â instead of on the product. comma can justify it because their workload is huge and steady, theyâre willing to run non-redundant storage, and theyâve built custom GPU boxes and infra around a very specific ML pipeline. ([comma.ai blog][1])
## 1) capex is a tax on flexibility
a datacenter turns âcomputeâ into a big up-front bet: hardware choices, networking choices, facility choices, and a depreciation schedule that does not care about your roadmap. cloud flips that: you pay for what you use, you can experiment cheaply, and you can stop spending the minute a strategy changes. the best feature of renting is that quitting is easy.
## 2) scaling isnât a vibe, itâs a deadline
real businesses donât scale smoothly. they spike. they get surprise customers. they do one insane training run. they run a migration. owning means you either overbuild âjust in caseâ (idle metal), or you underbuild and miss the moment. renting means you can burst, use spot/preemptible for the ugly parts, and keep steady stuff on reserved/committed discounts.
## 3) reliability is more than âitâs up most daysâ
comma explicitly says they keep things simple and donât need redundancy for ~99% uptime at their scale. ([comma.ai blog][1]) thatâs a perfectly valid tradeâif your business can tolerate it. many canât. cloud providers sell multi-zone, multi-region, managed backups, managed databases, and boring compliance checklists because âfive ninesâ isnât achieved by a couple heroic engineers and a PID loop.
## 4) the hidden cost isnât power, itâs people
comma spent ~$540k on power in 2025 and runs up to ~450kW, plus all the cooling and facility work. ([comma.ai blog][1]) but the larger, sneakier bill is: on-call load, hiring niche operators, hardware failures, spare parts, procurement, security, audits, vendor management, and the opportunity cost of your best engineers becoming part-time building managers. cloud is expensive, yesâbecause it bundles labor, expertise, and economies of scale you donât have.
## 5) âvendor lock-inâ is real, but self-lock-in is worse
cloud lock-in is usually optional: you choose proprietary managed services because theyâre convenient. if youâre disciplined, you can keep escape hatches: containers, kubernetes, terraform, postgres, object storage abstractions, multi-region backups, and a tested migration plan. owning your datacenter is also lock-inâexcept the vendor is past you, and the contract is âwe can never stop maintaining this.â
## the practical rule
*if you have massive, predictable, always-on utilization, and you want to become good at running infrastructure as a core competency, owning can win.* thatâs basically commaâs case. ([comma.ai blog][1])
*otherwise, rent.* buy speed, buy optionality, and keep your team focused on the thing only your company can do.
if you want, tell me your rough workload shape (steady vs spiky, cpu vs gpu, latency needs, compliance), and iâll give you a blunt ârent / colo / ownâ recommendation in 5 lines.
Am I the only one that is simply scared of running your own cloud? What happens if your administrator credentials get leaked? At least with Azure I can phone microsoft and initiate a recovery. Because of backups and soft deletion policies quite a lot is possible. I guess you can build in these failsafe scenarios locally too? But what if a fire happens like in South Korea? Sure most companies run more immediate risks such as going bankrupt, but at least Cloud relieves me from the stuff of nightmares.
Except now I have nightmares that the USA will enforce the patriot act and force Microsoft to hand over all their data in European data centers and then we have to migrate everything to a local cloud provider. Argh...
Do you have a computer at home? Are you scared of its credentials leaking? A server is just another computer with a good internet connection.
You can equip your server with a mouse, keyboard and screen and then it doesn't even need credentials. The credential is your physical access to the mouse and keyboard.
One thing I don't really understand here is why they're incurring the costs of having this physically in San Diego, rather than further afield with a full-time server tech essentially living on-prem, especially if their power numbers are correct. Is everyone being able to physically show up on site immediately that much better than a 24/7 pair of remote hands + occasional trips for more team members if needed?
Having worked only with the cloud I really wonder if these companies don't use other software with subscriptions. Even though AWS is "expensive" its a just another line item compared to most companies overall SaaS spend. Most businesses don't need that much compute or data transfer in the grand scheme of things.
Or better; write your software such that you can scale to tens of thousands of concurrent users on a single machine. This can really put the savings into perspective.
Well the article starts out with a suggestion that we should all get a data center... It's quite a jump to assume that everyone reading this article needs to train their own LLMs.
This is an industry we're[0] in. Owning is at one end of the spectrum, with cloud at the other, and a broadly couple of options in-between:
1 - Cloud â This is minimising cap-ex, hiring, and risk, while largely maximising operational costs (its expensive) and cost variability (usage based).
2 - Managed Private Cloud - What we do. Still minimal-to-no cap-ex, hiring, risk, and medium-sized operational cost (around 50% cheaper than AWS et al). We rent or colocate bare metal, manage it for you, handle software deployments, deploy only open-source, etc. Only really makes sense above âŹ$5k/month spend.
3 - Rented Bare Metal â Let someone else handle the hardware financing for you. Still minimal cap-ex, but with greater hiring/skilling and risk. Around 90% cheaper than AWS et al (plus time).
4 - Buy and colocate the hardware yourself â Certainly the cheapest option if you have the skills, scale, cap-ex, and if you plan to run the servers for at least 3-5 years.
A good provider for option 3 is someone like Hetzner. Their internal ROI on server hardware seems to be around the 3 year mark. After which I assume it is either still running with a client, or goes into their server auction system.
Options 3 & 4 generally become more appealing either at scale, or when infrastructure is part of the core business. Option 1 is great for startups who want to spend very little initially, but then grow very quickly. Option 2 is pretty good for SMEs with baseline load, regular-sized business growth, and maybe an overworked DevOps team!
[0] https://lithus.eu, adam@
I think the issue with this formulation is what drives the cost at cloud providers isn't necessarily that their hardware is too expensive (which it is), but that they push you towards overcomplicated and inefficient architectures that cost too much to run.
A core at this are all the 'managed' services - if you have a server box, its in your financial interest to squeeze as much per out of it as possible. If you're using something like ECS or serverless, AWS gains nothing by optimizing the servers to make your code run faster - their hard work results in less billed infrastructure hours.
This 'microservices' push usually means that instead of having an on-server session where you can serve stuff from a temporary cache, all the data that persists between requests needs to be stored in a db somewhere, all the auth logic needs to re-check your credentials, and something needs to direct the traffic and load balance these endpoint, and all this stuff costs money.
I think if you have 4 Java boxes as servers with a redundant DB with read replicas on EC2, your infra is so efficient and cheap that even paying 4x for it rather than going for colocation is well worth it because of the QoL and QoS.
These crazy AWS bills usually come from using every service under the sun.
The complexity is what gets you. One of AWS's favorite situations is
1) Senior engineer starts on AWS
2) Senior engineer leaves because our industry does not value longevity or loyalty at all whatsoever (not saying it should, just observing that it doesn't)
3) New engineer comes in and panics
4) Ends up using a "managed service" to relieve the panic
5) New engineer leaves
6) Second new engineer comes in and not only panics but outright needs help
7) Paired with some "certified AWS partner" who claims to help "reduce cost" but who actually gets a kickback from the extra spend they induce (usually 10% if I'm not mistaken)
Calling it it ransomware is obviously hyperbolic but there are definitely some parallels one could draw
On top of it all, AWS pricing is about to massively go up due to the RAM price increase. There's no way it can't since AWS is over half of Amazon's profit while only around 15% of its revenue.
The end result of all this is that the percentage of people who know how to implement systems without AWS/Azure will be a single digit. From that point on, this will be the only "economic" way, it doesn't matter what the prices are.
That's not a factual statement over reality, but more of a normative judgement to justify resignation. Yes, professionals that know how to actually do these things are not abundantly available, but available enough to achieve the transition. The talent exists and is absolutely passionate about software freedom and hence highly intrinsically motivated to work on it. The only thing that is lacking so far is the demand and the talent available will skyrocket, when the market starts demanding it.
> The only thing that is lacking so far is the demand and the talent available will skyrocket, when the market starts demanding it.
But will the market demand it? AWS just continues to grow.
Itâs all anecdotal but in my experiences itâs usually opposite. Bored senior engineer wants to use something new and picks a AWS bespoke service for a new project.
I am sure it happens a multitude of ways but I have never seen the case you are describing.
Iâve seen your case more than the ransom scenario too. But also even more often: early-to-mid-career dev saw a cloud pattern trending online, heard it was a new âbest practice,â and so needed to find a way to move their company to using it.
Just this week a friend of mine was spinning up some AWS managed service, complaining about the complexity, and how any reconfiguration took 45 minutes to reload. It's a service you can just install with apt, the default configuration is fine. Not only is many service no longer cheaper in the cloud, the management overhead also exceed that of on-prem.
I'd gladly use (and maybe even pay for!) an open-source reimplementation of AWS RDS Aurora. All the bells and whistles with failover, clustering, volume-based snaps, cross-region replication, metrics etc.
As far as I know, nothing comes close to Aurora functionality. Even in vibecoding world. No, 'apt-get install postgres' is not enough.
serverless v2 is one of the products that i was skeptical about but is genuinely one of the most robust solutions out there in that space. it has its warts, but I usually default to it for fresh installs because you get so much out of the box with it
What managed service? Curious, I donât use the full suite of aws services but wondering what would take 45mins, maybe it was a large cluster of some sort that needed rolling changes?
My observation is that all these services are exploding in complexity, and they justify saying that there are more features now, so everyone needs to accept spending more and more time and effort for the same results.
It's basically the same dynamic as hedonic adjustment in the CPI calculations. Cars may cost twice as much now they have usb chargers built in so inflation isn't really that bad.
I think this was MWAA
> your infra is so efficient and cheap that even paying 4x for it rather than going for colocation is well worth it because of the QoL and QoS.
You donât need colocation to save 4x though. Bandwidth pricing is 10x. EC2 is 2-4x especially outside US. EBS for its iops is just bad.
Great comment. I agree it's a spectrum and those of us who are comfortable on (4) like yourself and probably us at Carolina Cloud [0] as well, (4) seems like a no brainer. But there's a long tail of semi-technical users who are more comfortable in 2-3 or even 1, which is what ultimately traps them into the ransomware-adjacent situation that is a lot of the modern public cloud. I would push back on "usage-based". Yes it is technically usage-based but the base fee also goes way up and there are also sometimes retainers on these services (ie minimum spend). So of course "usage-based" is not wrong but what it usually means is "more expensive and potentially far more expensive".
[0] https://carolinacloud.io, derek@
The problem is that clouds have easily become 3 or 5 times the price of managed services, 10x the price of option 3, and 20x the price of option 4. To say nothing of the fact that almost all businesses can run fine on "pc under desk" type situations.
So in practice cloud has become the more expensive option the second your spend goes over the price of 1 engineer.
Hetzner is definitely an interesting option. Iâm a bit scared of managing the services on my own (like Postgres, Site2Site VPN, âŚ) but the price difference makes it so appealing. From our financial models, Hetzner can win over AWS when you spend over 10~15K per month on infrastructure and youâre hiring really well. Itâs still a risk, but a risk that definitely can be worthy.
> Iâm a bit scared of managing the services on my own
I see it from the other direction, when if something fails, I have complete access to everything, meaning that I have a chance of fixing it. That's down to hardware even. Things get abstracted away, hidden behind APIs and data lives beyond my reach, when I run stuff in the cloud.
Security and regular mistakes are much the same in the cloud, but I then have to layer whatever complications the cloud provide comes with on top. If cost has to be much much lower if I'm going to trust a cloud provider over running something in my own data center.
You sum it up very neatly. We've heard this from quite a few companies, and that's kind of why we started our ours.
We figured, "Okay, if we can do this well, reliably, and de-risk it; then we can offer that as a service and just split the difference on the cost savings"
(plus we include engineering time proportional to cluster size, and also do the migration on our own dime as part of the de-risking)
I've just shifted my SWE infrastructure from AWS to Hetzner (literally in the last month). My current analysis looks like it will be about 15-20% of the cost - ÂŁ240 vs 40-50 euros.
Expect a significant exit expense, though, especially if you are shifting large volumes of S3 data. That's been our biggest expense. I've moved this to Wasabi at about 8 euros a month (vs about $70-80 a month on S3), but I've paid transit fees of about $180 - and it was more expensive because I used DataSync.
Retrospectively, I should have just DIYed the transfer, but maybe others can benefit from my error...
FYI, AWS offers free Egress when leaving them (because they were forced to be EU regulation, but they chose to offer it globally):
https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-i...
But. Don't leave it until the last minute to talk to them about this. They don't make it easy, and require some warning (think months, IIRC)
Extremely useful information - unfortunately I just assumed this didn't apply to me because I am in the UK and not the EU. Another mistake, though given it's not huge amounts of money I will chalk it up to experience.
Hopefully someone else will benefit from this helpful advice.
Iâm wondering if it makes sense to distribute your architecture so that workers who do most of the heavy lifting are in hetzner, while the other stuff is in costly AWS. On the other hand this means you donât have easy access to S3, etc.
networking costs are so high in AWS I doubt this makes sense
> Iâm a bit scared of managing the services on my own (like Postgres, Site2Site VPN, âŚ)
Out of interest, how old are you? This was quite normal expectation of a technical department even 15 years ago.
Iâm curious to know the answer, too. I used to deploy my software on-prem back in the day, and that always included an installation of Microsoft SQL Server. So, all of my clients had at least one database server they had to keep operational. Most of those clients didnât have an IT staff at all, so if something went wrong (which was exceedingly rare), theyâd call me and Iâd walk them through diagnosing and fixing things, or Iâd Remote Desktop into the server if their firewalls permitted and fix it myself. Backups were automated and would produce an alert if they failed to verify.
Itâs not rocket science, especially when youâre talking about small amounts of data (small credit union systems in my example).
No it was not. 15 years ago Heroku was the rage. Even the places that had bare metal usually had someone running something similar to devops and at least core infrar was not being touched. I am sure places existed but 15 years while far away was already pretty far along from what you describe. At least in SV.
Heroku was popular with startups who didnât have infrastructure skills but the price was high enough that anyone who wasnât in that triangle of âlavish budget, small team, limited app diversityâ wasnât using it. Things like AWS IaaS were far more popular due to the lower cost and greater flexibility but even that was far from a majority service class.
I am not sure if you are trying to refute my lived experience or what exactly the point is. Heroku was wildly popular with startups at the time, not just those with lavish budgets. I was already touching RDS at this point and even before RDS came around no organization I worked at had me jumping on bare metal to provision services myself. There always a system in place where someone helped out engineering to deploy systems. I know this was not always the case but the person I was responding to made it sound like 15 years ago all engineers were provisioning their own database and doing other times of dev/sys ops on a regular basis. Itâs not true at least in SV.
No amount of money will make me maintain my own dbs. We tried it at first and it was a nightmare.
It's worth becoming good at.
you're missing 5, what they are doing.
There is a world of difference between renting some cabinets in an Equinix datacenter and operating your own.
Fair point!
5 - Datacenter (DC) - Like 4, except also take control of the space/power/HVAC/transit/security side of the equation. Makes sense either at scale, or if you have specific needs. Specific needs could be: specific location, reliability (higher or lower than a DC), resilience (conflict planning).
There are actually some really interesting use cases here. For example, reliability: If your company is in a physical office, how strong is the need to run your internal systems in a data centre? If you run your servers in your office, then there's no connectivity reliability concerns. If the power goes out, then the power is out to your staff's computers anyway (still get a UPS though).
Or perhaps you don't need as high reliability if you're doing only batch workloads? Do you need to pay the premium for redundant network connections and power supplies?
If you want your company to still function in the event of some kind of military conflict, do you really want to rely on fibre optic lines between your office and the data center? Do you want to keep all your infrastructure in such a high-value target?
I think this is one of the more interesting areas to think about, at least for me!
When I worked IT for a school district at the beginning of my career (2006-2007), I was blown away that every school had a MASSIVE server room (my office at each school - the MDF). 3-5 racks filled (depending on school size and connection speed to the central DC - data closet) 50-75% was networking equipment (5 PCs per class hardwired), 10% was the Novell Netware server(s) and storage, the other 15% was application storage for app distributions on login.
Personally I haven't seen a scenario where it makes sense beyond a small experimental lab where you value the ability to tinker physically with the hardware regularly.
Offices are usually very expensive real estate in city centers and with very limited cooling capabilities.
Then again the US is a different place, they don't have cities like in Europe (bar NYC).
If you are a bank or a bookmaker or similar you may well want to have total control of physical access to the machines. I know one bookmaker I worked with had their own mini-datacenter, mainly because of physical security.
If you have less than a rack of hardware, if you have physical security requirements, and/or your hardware is used in the office more than from the internet, it can make sense.
5 was a great option for ml work last year since colo rented didn't come with a 10kW cable. With ram, sd and GPU prices the way they are now I have no idea what you'd need to do.
Thank goodness we did all the capex before the OpenAI ram deal and expensive nvidia gpus were the worst we had to deal with.
> 4 - Buy and colocate the hardware yourself â Certainly the cheapest option if you have the skills, scale, cap-ex, and if you plan to run the servers for at least 3-5 years.
Is it still the cheapest after you take into account that skills, scale, cap-ex and long term lock-in also have opportunity costs?
That is why the the second "if" is there.
You can get locked into cloud too.
The lock in is not really long term as it is an easy option to migrate off.
What is the upper limit of Hertzner? Say you have an AWS bill in the $100s of millions, could Hertzner realistically take on that scale?
An interesting question, so time for some 100% speculation.
It sounds like they probably have revenue in the âŹ500mm range today. And given that the bare metal cost of AWS-equivalent bills tends to be a 90% reduction, we'll say a âŹ10mm+ bare metal cost.
So I would say a cautious and qualified "yes". But I know even for smaller deployments of tens or hundreds of servers, they'll ask you what the purpose is. If you say something like "blockchain," they're going to say, "Actually, we prefer not to have your business."
I get the strong impression that while they naturally do want business, they also aren't going to take a huge amount of risk on board themselves. Their specialism is optimising on cost, which naturally has to involve avoiding or mitigating risk. I'm sure there'd be business terms to discuss, put it that way.
Why would a client who wants to run a Blockchain be risky for Herzner? I'm not a fan, I just don't see the issue. If the client pays their monthly bill, who cares if they're using the machine to mine for Bitcoin?
They are certain to run the machines at 100% continually, which will cost more than a typical customer who doesn't do this, and leave the old machines with less second-hand value for their auction thing afterwards.
Iâd bet that main reason would be power. Running machines at 100% doesnât subtract much extra , but a server running hard for 24 hours would use more power than a bursty workload.
(While weâre all speculating)
Who are you thinking of?
Netflix might be spending as much as $120m (but probably a little less), and I thought they were probably Amazon's biggest customer. Does someone (single-buyer) spend more than that with AWS?
Hertzner's revenue is somewhere around $400m, so probably a little scary taking on an additional 30% revenue from a single customer, and Netflix's shareholders would probably be worried about risk relying on a vendor that is much smaller than them.
Sometimes if the companies are friendly to the idea, they could form a joint venture or maybe Netflix could just acquire Hertzner (and compete with Amazon?), but I think it unlikely Hertzner could take on Netflix-sized for nontechnical reasons.
However increasing pop capacity by 30% within 6mo is pretty realistic, so I think they'd probably be able to physically service Netflix without changing too much if management could get comfortable with the idea
A $120M spend on AWS is equivalent to around a $12M spend on Hetzner Dedicated (likely even less, the factor is 10-20x in my experience), so that would be 3% of their revenue from a single customer.
> Hertzner's revenue is somewhere around $400m, so probably a little scary taking on an additional 30% revenue from a single customer
A little scare for both sides.
Unless we're misunderstanding something I think the $100Ms figure is hard to consider in a vacuum.
That $120m will become $12m when they're not using AWS.
Figma apparently spends around 300-400k/day on AWS. I think this puts them up there.
How is this reasonable? At what point do they pull a Dropbox and de-AWS? I canât think of why they would gain with AWS over in house hosting at that point.
Iâm not surprised, but youâd think there would be some point where they would decide to build a data center of their own. Itâs a mature enough company.
Can someone explain 2 to me. How is a managed private cloud different from full cloud? Like you are still using AWS or Azure but you are keeping all your operation in a bundled, portable way, so you can leave that provider easily at any time, rather than becoming very dependent on them? Is it like staying provider-agnostic but still cloud based?
To put it plainly: We deploy a Kubernetes cluster on Hetzner dedicated servers and become your DevOps team (or a part thereof).
It works because bare metal is about 10% the cost of cloud, and our value-add is in 1) creating a resilient platform on top of that, 2) supporting it, 3) being on-call, and 4) being or supporting your DevOps team.
This starts with us providing a Kubernetes cluster which we manage, but we also take responsibility for the services run on it. If you want Postgres, Redis, Clickhouse, NATS, etc, we'll deploy it and be SLA-on-call for any issues.
If you don't want to deal with Kubernetes then you don't have to. Just have your software engineers hand us the software and we'll handle deployment.
Everything is deployed on open source tooling, you have access to all the configuration for the services we deploy. You have server root access. If you want to leave you can do.
Our customers have full root access, and our engineers (myself included) are in a Slack channel with you engineers.
And, FWIW, it doesn't have to be Hetzner. We can colocate or use other providers, but Hetzner offer excellent bang-per-buck.
Edit: And all this is included in the cluster price, which comes out cheaper than the same hardware on the major cloud providers
You give customers root but you're on call when something goes tits up?
You're a brave DevOps team. That would cause a lot of friction in my experience, since people with root or other administrative privileges do naughty things, but others are getting called in on Saturday afternoon.
Instead of using the Cloud's own Kubernetes service, for example, you just buy the compute and run your own Kubernetes cluster. At a certain scale that is going to be cheaper if you have to know how. And since you are no longer tied to which services are provided and you just need access to compute and storage. you can also shop around for better prices than Amazon or Azure since you can really go to any provider of a VPS.
I am using something inbetween 2 and 3, a hosted Web-site and database service with excellent customer support. On shared hardware it is 22 âŹ/month. A managed server on dedicated hardware starts at about 50 âŹ/month.
#2.5ish
We rent hardware and also some VPS, as well as use AWS for cheap things such as S3 fronted with Cloudflare, and SES for priority emails.
We have other services we pay for, such as AI content detection, disposable email detection, a small postal email server, and more.
We're only a small business, so having predictable monthly costs is vital.
Our servers are far from maxed out, and we process ~4 million dynamic page and API requests per day.
Been using Hetzner Cloud for Kubernetes and generally like it, but it has its limitations. The network is highly unpredictable. You at best get 2Gbit/s, but at worst a few hundreds of Mbit/s.
https://docs.hetzner.com/cloud/technical-details/faq/#what-k...
Is that for the virtual private network? I heard some people say that you actually get higher bandwidth if you're using the public network instead of the private network within Hetzner, which is a little bit crazy.
Hetzner dedicated is pretty bad at private networks, so bad you should use a VPN instead. Don't know about the cloud side of things.
this is what we did in the 90ies into mid 2000:
> Buy and colocate the hardware yourself â Certainly the cheapest option if you have the skills
back then this type of "skill" was abundant. You could easily get sysadmin contractors who would take a drive down to the data-center (probably rented facilities in a real-estate that belonged to a bank or insurance) to exchange some disks that died for some reason. such a person was full stack in a sense that they covered backups, networking, firewalls, and knew how to source hardware.
the argument was that this was too expensive and the cloud was better. so hundreds of thousands of SME's embraced the cloud - most of them never needed Google-type of scale, but got sucked into the "recurring revenue" grift that is SaaS.
If you opposed this mentality you were basically saying "we as a company will never scale this much" which was at best "toxic" and at worst "career-ending".
The thing is these ancient skills still exist. And most orgs simply do not need AWS type of scale. European orgs would do well to revisit these basic ideas. And Hetzner or Lithus would be a much more natural (and honest) fit for these companies.
I wonder how much companies pay yearly in order to avoid having an employee pick up a drive from a local store, drive to the data center, pull the disk drive, screw out the failing hard drive and put in the new one, add it in the raid, verify the repair process has started, and then return to the office.
I don't think I've ever seen a non-hot-swap disk in a normal server. The oldest I dealt with had 16 HDDs per server, and only 12 were accessible from the outside, bu the 4 internal ones were still hot-swap after taking the cover off.
Even some really old (2000s-era) junk I found in a cupboard at work was all hot-swap drives.
But more realistically in this case, you tell the data centre "remote hands" person that a new HDD will arrive next-day from Dell, and it's to go in server XYZ in rack V-U at drive position T. This may well be a free service, assuming normal failure rates.
In the Bay Area there are little datacenters that will happily colocate a rack for you and will even provide an engineer who can swap disks. The service is called âremote handsâ. It may still be faster to drive over.
> ancient skills https://youtu.be/ZtYU87QNjPw?&t=10
It baffles me that my career trajectory somehow managed to insulate me from ever having to deal with the cloud, while such esoteric skills as swapping a hot swap disk or racking and cabling a new blade chassis are apparently on the order of finding a COBOL developer now. Really?
I can promise you that large financial institutions still have datacenters. Many, many, many datacenters!
we had two racks in our office of mostly developers. If you have an office you already have a rack for switches and patch panels. Adding a few servers is obvious.
Software development isn't a typical SME however. Mike's Fish and Chips will not buy a server and that's fine.
if someone on the DevOps team knows Nix, option 3 becomes a lot cheaper time-wise! yeah, Nix flakes still need maintenance, especially on the `nixos-unstable` branch, but you get the quickest disaster recovery route possible!
plus, infra flexibility removes random constraints that e.g. Cloudflare Workers have
Indeed! We've yet to go down this route, but it's something we're thinking on. A friend and I have been talking about how to bring Nix-like constructs to Kubernetes as well, which has been interesting. (https://github.com/clotodex/kix, very much in the "this is fun to think about" phase)
There are a bunch of ways to manage bare metal servers apart from Nix. People have been doing it for years. Kickstart, theforeman, maas, etc, [0]. Many to choose from according to your needs and layers you want them to manage.
Reality is these days you just boot a basic image that runs containers
[0] Longer list here: https://github.com/alexellis/awesome-baremetal
This is what we do, I gave a talk about our setup earlier this week at CfgMgmtCamp: https://www.youtube.com/watch?v=DBxkVVrN0mA&t=8457s
Option 4 as well, that's how we do it at work and it's been great. However, it can't really be "someone on the team knows Nix", anyone working on Ops will need Nix skills in order to be effective.
I'm a NixOS fan, but been using Talos Linux on Hetzner nodes (using Cluster-API) to form a Kubernetes Cluster. Works great too!
I would suggest to use both on-premise hardware and cloud computing. Which is probably what comma is doing.
For critical infrastructure, I would rather pay a competent cloud provider than being responsible for reliability issues. Maintaining one server room in the headquarters is something, but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO.
For running many slurm jobs on good servers, cloud computing is very expensive and you sometimes save money in a matter of months. And who cares if the server room is a total loss after a while, worst case you write some more YAML and Terraform and deploy a temporary replacement in the cloud.
Another thing between is colocation, where you put hardware you own in a managed data center. Itâs a bit old fashioned, but it may make sense in some cases.
I can also mention that research HPCs may be worth considering. In research, we have some of the world fastest computers at a fraction of the cost of cloud computing. Itâs great as long as you donât mind not being root and having to use slurm.
I donât know in USA, but in Norway you can run your private company slurm AI workloads on research HPCs, though you will pay quite a bit more than universities and research institutions. But you can also have research projects together with universities or research institutions, and everyone will be happy if your business benefits a lot from the collaboration.
> but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO
I worked in a company with two server farms (a main and a a backup one essentially) in Italy located in two different regions and we had a total of 5 employees taking care of them.
We didn't hear about them, we didn't know their names, but we had almost 100% uptime and terrific performance.
There was one single person out of 40 developers who's main responsibility were deploys, and that's it.
It costed my company 800k euros per year to run both the server farms (hardware, salaries, energy), and it spared the company around 7-8M in cloud costs.
Now I work for clients that spend multiple millions in cloud for a fraction of the output and traffic, and I think employ around 15+ dev ops engineers.
> I would rather pay a competent cloud provider than being responsible for reliability issues.
Why do so many developers and sysadmins think they're not competent for hosting services. It is a lot easier than you think, and its also fun to solve technical issues you may have.
The point was about redundancy / geo spread / HA. Itâs significantly more difficult to operate two physical sites than one. You can only be in one place at a time.
If you want true reliability, you need redundant physical locations, power, networking. Thatâs extremely easy to achieve on cloud providers.
You can just rent the rack space in datacenter and have that covered. It's still much cheaper than running that in cloud.
It doesn't make sense if you only have few servers, but if you are renting equivalent of multiple racks of servers from cloud and run them for most of the day, on-prem is staggeringly cheaper.
We have few racks and we do "move to cloud" calculation every few years and without fail they come up at least 3x the cost.
And before the "but you need to do more work" whining I hear from people that never did that - it's not much more than navigating forest of cloud APIs and dealing with random blackbox issues in cloud that you can't really debug, just go around it.
How much does your single site go down?
On cloud it's out of your control when an AZ goes down. When it's your server you can do things to increase reliability. Most colos have redundant power feeds and internet. On prem that's a bit harder, but you can buy a UPS.
If your head office is hit by a meteor your business is over. Don't need to prepare for that.
You don't need full "cloud" providers for that, colocation is a thing.
or just to be good at hiding the round trip of latency
Maybe you find it fun. I donât, I prefer building software not running and setting up servers.
Itâs also nontrivial once you go past some level of complexity and volume. I have made my career at building software and part of that requires understanding the limitations and specifics of the underlying hardware but at the end of the day I simply want to provision and run a container, I donât want to think about the security and networking setup itâs not worth my time.
Also I'd add this question, why do so many developers and sysadmins think, that cloud companies always hire competent/non-lazy/non-pissed employees?
> Why do so many developers and sysadmins think they're not competent for hosting services. It is a lot easier than you think, and its also fun to solve technical issues you may have.
It is a different skillset. SRE is also an under-valued/paid (unless one is in FAANGO).
Itâs all downside. If nothing goes wrong, then the company feels like theyâre wasting money on a salary. If things go wrong theyâre all your fault.
At a previous job, the company had its critical IT infrastructure on their own data center. It was not in the IT industry, but the company was large and rich enough to justify two small data centers. It notably had batteries, diesel generators, 24/7 teams, and some advanced security (for valid reasons).
I agree that solving technical issues is very fun, and hosting services is usually easy, but having resilient infrastructure is costly and I simply don't like to be woken up at night to fix stuff while the company is bleeding money and customers.
> Why do so many developers and sysadmins think they're not competent for hosting services.
Because those services solve the problem for them. It is the same thing with GitHub.
However, as predicted half a decade ago with GitHub becoming unreliable [0] and as price increases begin to happen, you can see that self-hosting begins to make more sense and you have complete control of the infrastructure and it has never been more easier to self host and bring control over costs.
> its also fun to solve technical issues you may have.
What you have just seen with coding agents is going to have the same effect on "developers" that will have a decline in skills the moment they become over-reliant on coding agents and won't be able to write a single line of code at all to fix a problem they don't fully understand.
[0] https://news.ycombinator.com/item?id=22867803
> Maintaining one server room in the headquarters is something, but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO.
Speaking as someone who does this, it is very straightforward. You can rent space from people like Equinix or Global Switch for very reasonable prices. They then take care of power, cooling, cabling plant etc.
Unfortunately we experienced an issue where our Slurm pool was contaminated by a misconstrained Postgres Daemon. Normally the contaminated slurm pool would drain into a docker container, but due to Rust it overloaded and the daemon ate its own head. Eventually we returned it to a restful state so all's well that ends well.
(hardware engineer trying to understand wtaf software people are saying when they speak)
The reason companies donât go with on premises even if cloud is way more expensive is because of the risk involved in on premises.
You can see it quite clearly here that thereâs so many steps to take. Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.
Itâs never about âis the expected cost in on premises less than cloudâ, itâs about the risk adjusted costs.
Once youâve spread risk not only on your main product but also on your infrastructure, it becomes hard.
I would be vary of a smallish company building their own Jira in house in a similar way.
I'm starting to wonder though whether companies even have the in-house competence to compare the options and price this risk correctly.
>Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.
Yes, but one differentiating factor is always price and you don't want to lose all your margins to some infrastructure provider.
Software companies have higher margins so these decisions are lower stakes. Unless on premises helps the bottom line of the main product that the company provides, these decisions don't really matter in my opinion.
Think of a ~5000 employee startup. Two scenarios:
1. if they win the market, they capture something like ~60% margin
2. if that doesn't happen, they just lose, VC fund runs out and then they leave
In this dynamic, costs associated with infrastructure don't change the bottomline of profitability. The risk involved with rolling out their on infrastructure can hurt their main product's existence itself.
I'm not disputing that there are situations where it makes sense to pay a high risk premium. What I'm disputing is that price doesn't matter. I get the impression that companies are losing the capability to make rational pricing decisions.
>Unless on premises helps the bottom line of the main product that the company provides, these decisions don't really matter in my opinion.
Well, exactly. But the degree to which the price of a specific input affects your bottom line depends on your product.
During the dot com era, some VC funded startups (such as Google) made a decision to avoid using Windows servers, Oracle databases and the whole super expensive scale-up architecture that was the risk-free, professional option at the time. If they hadn't taken this risk, they might not have survived.
[Edit] But I think it's not just about cloud vs on-premises. A more important question may be how you're using the cloud. You don't have to lock yourself into a million proprietary APIs and throw petabytes of your data into an egress jail.
Precious real-world engineering skills also play a role.
But most importantly, the attractive power that companies doing on-premise infrastructure have towards the best talent.
Itâs also opex vs capex, which is a battle opex wins most of the time.
Well, capex has a multi-year depreciation schedule and has to cover interest rates. So the simplified "opex wins most of the time" is right.
But we are talking about a cost difference of tens of times, maybe a few hundred. The cloud is not like "most of the time".
Opex is faster. Login, click, SSH, get a tea.
Capex needs work. A couple of years, at least.
If you are willing to put in the work. Your mundane computer is always better than the shiny one you don't own.
That's because of company policies. An SME owner will buy a server and have it in the rack the next day.
Of course creating a VM is still a teraform commit away (you're not using clickops in prod surely)
SME and "a server" is doing some big weight lifting here.
If you want a custom server, one or a thousand, it's at least a couple of weeks.
If you want a powerful GPU server, that's rack + power + cooling (and a significant lead time). A respectable GPU server means ~2KW of power dissipation and considerable heat.
If you want a datacenter of any size, now that's a year at least from breaking ground to power-on.
If you want something at all customized, it takes longer than that to receive the server. That being said, you can buy a server that will outperform anything the cloud can give you at much better cost.
I think it wins because opex is seen as stable recurring cost and capex is seen as the money you put in your primary differentiation for long term gains.
For mature Enterprises my understanding is that the financial math works out such that the cloud becomes smart for market validation, before moving to cheaper long term solution once revenue is stable.
Scale up, prove the market and establish operations on the credit card, and if it doesnât work the money moves onto more promising opportunities. If the operation is profitable you transition away from the too expensive cloud to increase profitability, and use the operations incoming revenue to pay for it (freeing up more money to chase more promising opportunities).
Personally I canât imagine anything outside of a hybrid approach, if only to maintain power dynamics with suppliers on both sides. Price increases and forced changes can be met with instant redeployments off their services/stack, creating room for more substantive negotiations. When investments come in the form of saving time and money, itâs not hard to get everyone aligned.
True, but for a lot of companies âour servers are on-premâ is not a primary differentiator.
i think we are saying the same thing?
Capex may also require you to take out loans
Which is incredibly difficult in the public sector. Yes, there are various financing instruments available for capital purchases but they're always annoying, slow and complicated. It's much easier to spend 5k per month than 500k outright.
It depends. Grant funding (e.g. in academia) makes capex easier to manage than opex (because when the grant runs out you still have device).
At scale (like comma.ai), it's probably cheaper. But until then it's a long term cost optimization with really high upfront capital expenditure and risk. Which means it doesn't make much sense for the majority of startup companies until they become late stage and their hosting cost actually becomes a big cost burden.
There are in between solutions. Renting bare metal instead of renting virtual machines can be quite nice. I've done that via Hetzner some years ago. You pay just about the same but you get a lot more performance for the same money. This is great if you actually need that performance.
People obsess about hardware but there's also the software side to consider. For smaller companies, operations/devops people are usually more expensive than the resources they manage. The cost to optimize is that cost. The hosting cost usually is a rounding error on the staffing cost. And on top of that the amount of responsibilities increases as soon as you own the hardware. You need to service it, monitor it, replace it when it fails, make sure those fans don't get jammed by dust puppies, deal with outages when they happen, etc. All the stuff that you pay cloud providers to do for you now becomes your problem. And it has a non zero cost.
The right mindset for hosting cost is to think of it in FTEs (full time employee cost for a year). If it's below 1 (most startups until they are well into scale up territory), you are doing great. Most of the optimizations you are going to get are going to cost you in actual FTEs spent doing that work. 1 FTE pays for quite a bit of hosting. Think 10K per month in AWS cost. A good ops person/developer is more expensive than that. My company runs at about 1K per month (GCP and misc managed services). It would be the wrong thing to optimize for us. It's not worth spending any amount of time on for me. I literally have more valuable things to do.
This flips when you start getting into the multiple FTEs per month in cost for just the hosting. At that point you probably have additional cost measured in 5-10 FTE in staffing anyway to babysit all of that. So now you can talk about trading off some hosting FTEs for modest amount of extra staffing FTEs and make net gains.
> At scale (like comma.ai), it's probably cheaper. But until then it's a long term cost optimization with really high upfront capital expenditure and risk. Which means it doesn't make much sense for the majority of startup companies until they become late stage and their hosting cost actually becomes a big cost burden.
You rent a dataspace, which is OPEX not CAPEX, and you just lease the servers, which turns big CAPEX into monthly OPEX bill
Running your own DC is "we have two dozen racks of servers" endeavour, but even just renting DC space and buying servers is much cheaper than getting same level of performance from the cloud.
> This flips when you start getting into the multiple FTEs per month in cost for just the hosting. At that point you probably have additional cost measured in 5-10 FTE in staffing anyway to babysit all of that. So now you can talk about trading off some hosting FTEs for modest amount of extra staffing FTEs and make net gains.
YOU NEED THOSE PEOPLE TO MANAGE CLOUD TOO. That's what always get ignore in calculations, people go "oh, but we really need like 2-3 ops people to cover datacenter and have shifts on the on-call", but you need same thing for cloud too, it is just dumped on programmers/devops guys in the team rather than having separate staff.
We have few racks and the part related to hardware is small part of total workload, most of it is same as we would (and do for few cloud customers) in cloud, writing manifests for automation.
> YOU NEED THOSE PEOPLE TO MANAGE CLOUD TOO.
Finally, some sense! "Cloud" was meant to make ops jobs disappear, but they just increased our salary by turning us into "DevOps Engineers" and the company's hosting bill increased fivefold in the process. You will never convince even 1% of devs to learn the ops side properly, therefore you'll still end up hiring ops people and we will cost you more now. On top of that, everyone that started as a "DevOps Engineer" knows less about ops than those that started as ops and transitioned into being "DevOps Engineers" (or some flavour of it like SREs or Platform Engineers).
If you're a programmer scared into thinking AI is going to take away your job, re-read my comment.
Honestly, the way I've seen a lot of cloud done, they need _more_ people to manage that than a sensible private cloud setup.
To be fair, I think people are vastly over estimating the work they would have and the power they would need. Yes, if you have to massively scale up, then it'll take some work, but most of it is one-time work. You do it, and when it runs, you only have a fraction of work over the next months to maintain it. And with fraction, I mean below 5%. And keep in mind that >99% of startups who think of "yeah we need this and that cloud, because we need to scale" will never scale. Instead they are happily locking themselves into a cloud service. And if they actually scale at some point, this service will be massively more expensive.
Startups don't know how much hardware they need when they release to customers. The extreme flexibility of cloud makes a lot of sense for them.
But they should; cloud wont magically make the architecture scale. A competent CTO should know the limits of the platform, its called "load testing" or "stress testing"; scalability is independent of the provider. Cloud gives you a nicer interface to add resources, granted; but that"s it.
As a hear-say anecdote, thats why some startups have db servers with hundreds of gb of ram and dozens of cpus to run a workload that could be served from a 5 year old laptop.
We have two on site servers that we use. For various reasons (power cuts, internet outages, cleaners unplugging them) Iâd say we have to intervene with them physically about once a month. Itâs a total pain in the ass, especially when you donât have _an_ it person sitting in the office to mind it. Iâm in the Uk and our office is in SpainâŚ
But it is significantly cheaper and faster
Your calculation assumes that an FTE is needed to maintain a few beefy servers.
Once they are up and running that employee is spending at most a few hours a month on them. Maybe even a few hours every six months.
OTOH you are specifically ignoring that you'll require mostly the same time from a cloud trained person if you're all-in on AWS.
I expect the marginal cost of one employee over the other is zero.
> Once they are up and running
You should also calculate the cost of getting it up and running. With Google Cloud (I don't actually use AWS), I mainly worry about building docker containers in CI and deploying them to vms and triggering rolling restarts as those get replaced with new ones. I don't worry about booting them. I don't worry about provisioning operating systems or configuration to them. Or security updates. They come up with a lot of pre-provisioned monitoring and other stuff. No effort required on my side.
And for production setups. You need people on stand by to fix the server in case of hardware issues; also outside office hours. Also, where does the hardware live? What's your process when it fails? Who drives to wherever the thing is and fixes it? What do you pay them to be available for that? What's the lead time for spare components? Do you actually keep those in supply? Where? Do you pay for security for wherever all that happens? What about cleaning, AC, or a special server room in your building. All that stuff is cost. Some of it is upfront cost. Some of it is recurring cost.
The article is a about a company that owns its own data center. The cost they are citing (5 million) is substantial and probably a bit more complete. That's one end of the spectrum.
You are massively overcomplicating this.
> I don't worry about booting them. I don't worry about provisioning operating systems or configuration to them. Or security updates. They come up with a lot of pre-provisioned monitoring and other stuff. No effort required on my side.
These are not difficult problems. You can use the same/similar cloud install images.
A 10 year old nerd can install Linux on a computer; if you're a professional developer I'm sure you can read the documentation and automate that.
> And for production setups. You need people on stand by to fix the server in case of hardware issues; also outside office hours.
You could use the same person who is on standby to fix the cloud system if that has some failure.
> Also, where does the hardware live?
In rented rackspace nearby, and/or in other locations if you need more redundancy.
> What's your process when it fails? Who drives to wherever the thing is and fixes it? What do you pay them to be available for that? What's the lead time for spare components? Do you actually keep those in supply? Where?
It will probably report the hardware failure to Dell/HP/etc automatically and open a case. Email or phone to confirm, the part will be sent overnight, and you can either install it yourself (very, very easy for things like failed disks) or ask a technician to do it (I only did this once with a CPU failure on a brand new server). Dell/HP/etc will provide the technician, or your rented datacentre space will have one for simpler tasks like disks.
Shush! The cloud companies want customers to think it is a complicated near death experience to run on their own hardware.
It is sad that the knowledge of how easy it really is, is getting extinct. The cloud and SaaS companies benefit greatly.
> You should also calculate the cost of getting it up and running.
I was not doing the calculation. I was only pointing out that it was not as simple as you make it out to be.
Okay, a few other things that aren't in most calculations:
1. Looking at jobs postings in my area, the highest paid ones require experience with specific cloud vendors. The FTEs you need to "manage" the cloud are a great deal more expensive than developers.
2. You don't need to compare on-prem data center with AWS - you can rent a pretty beefy VPS or colocate for a fraction of the cost of AWS (or GCP, or Azure) services. You're comparing the most expensive alternative when avoiding cloud services, not the most typical.
3. Even if you do want to build your own on-prem rack, FTEs aren't generally paid extra for being on the standby rota. You aren't paying extra. Where you will pay extra is for hot failovers, or machine room maintenance, etc, which you don't actually need if your hot failover is a cheap beefy VPS-on-demand on Hetzner, DO, etc.
4. You are measuring the cost of absolute 0% downtime. I can't think of many businesses that have such high sensitivity to downtime. Even banks handle downtime much larger than that even while their IT systems are still up. With such strict requirements you're getting into the spot where the business itself cannot continue because of catastrophe, but the IT systems can :-/. What use is the IT systems when the business itself may be down?
The TLDR is:
1. If you have highly paid cloud-trained FTEs, and
2. Your only option other than Cloud is on-prem, and
3. Your FTEs are actually FT-contractors who get paid per hour, and
4. Your uptime requirements are moire stringent than national banks,
yeah, then cloud services are only slightly more expensive.
You know how many businesses fall into that specific narrow set of requirements?
> it doesn't make much sense for the majority of startup companies until they become late stage
Here's what TFA says about this:
> Cloud companies generally make onboarding very easy, and offboarding very difficult. If you are not vigilant you will sleepwalk into a situation of high cloud costs and no way out.
and I think they're right. Be careful how you start because you may be stuck in the initial situation for a long time.
And not just any FTEs, probably few senior / staff level engineers who would cost a lot more.
You should keep in mind that for a lot of things you can use a servicing contract, rather than hiring full-time employees.
It's typically going to cost significantly less; it can make a lot of sense for small companies, especially.
This also depends so much on your scaling needs. If you need 3 mid-sized ECS/EC2 instances, a load balancer, and a database with backups, renting those from AWS isnât going to be significantly more expensive for a decent-sized company than hiring someone to manage a cluster for you and dealing with all the overhead of keeping it maintained and secure.
If youâre at the scale of hundreds of instances, that math changes significantly.
And a lot of it depends on what type of business you have and what percent of your budget hosting accounts for.
I also thinks itâs risk model too. Every time I see these kind of posts I think it misses the point there is a balance not only on cost like you describe but risk as well. You are paying to offload some of the risk from yourself.
Datacenters need cool dry air? <45%
No, low isn't good perse. I worked in a datacenter which in winters had less than 40%, ram was failing all over the place. Low humidity causes static electricity.
Low is good if you are also adding more humidity back in. If you want to maintain 45-50% (guessing), then you would want <45% environmental humidity so that you can raise it to the level you want. You're right about avoiding static, but you'd still want to try to keep it somewhat consistent.
It is much cheaper to use external air for cooling if you can.
Yeah but the article makes it sound as if lower is better, which it is definitely not. And yeah you need to control humidity, that might mean sometimes lowering, and sometimes increase it by whatever solution you have.
Also this is where cutting corner indeed results in lower cost, which was the reason for the OP to begin with. It just means you won't get as good a datacenter as people who are actually tuning this whole day and have decades of experience.
The datacenter is in San Diego - a quick Google confirms that external humidity pretty much never drops below 50% there.
Things would be different in a colder climate where humidity goes --> 0% in the winter
It would be interesting to hear their contingency plan for any kind of disaster (most commonly a fire) that hits their data center.
Yep, does anyone remember the OVH fire[1][2]?
[1] https://www.techradar.com/news/remember-the-ovhcloud-data-ce...
[2] https://blocksandfiles.com/wp-content/uploads/2023/03/ovhclo...
I fully lost three small VPS there, and their response was poor: they didn't even refund time lost, they didn't compensate for time lost (e.g. a couple of months of free VPS), I got better updates from the news than from them (news were saying "almost total loss", while them were trying to convince me that I had the incredible bad luck that my three VPS were in the very small zone affected by the fire). The only way I had to recover what I lost was backups in local machines.
When someone point out how safe are cloud providers, as if they have multiple levels of redundancy and are fully protected against even an alien invasion, I remember the OVH fire.
OVH VPS is not the same as say, AWS EC2.
It's their "Compute" under "Public Cloud" that is competing against AWS EC2. https://us.ovhcloud.com/public-cloud/compute/
They handled the fire terribly and after that they improved a bit, but an OVH VPS is just a VM running on a single piece of hardware. Quite not the same thing as the "Compute" which is running on clusters.
contingency plan: Don't build your data center out of wood.
Plastic is made from the same stuff as gasoline.
Drain cleaner and hydrochloric acid makes salt water. Water is made of highly explosive hydrogen. Salt is made of toxic chlorine and explosive sodium.
They use the datasenter for model training, not to serve online users. Presumably even if it will be offline for a week or even a month it will not be a total disaster as long as they have, for example, offsite tape backups.
Flooding due to burst frozen pipe, false sprinkler trigger, or many others.
Something very similar happened at work. Water valve monitoring wasnât up yet. Fire didnât respond because reasons. Huge amount of water flooded over a 3 day weekend. Total loss.
Theres only one solution to this problem and its 2 data centres in some way or form
What's the line from Contact?
why build one when you can have two at twice the price?
But, if you're building a datacenter for $5M, spending $10-15M for redundant datacenters (even with extra networking costs), would still be cheaper than their estimated $25M cloud costs.
Or build two 2.5MM DCs (if can parallelize your workload well enough) and in case of disaster, you only lose capacity.
You need however plan for 1MM+ pa in OPEX because good SREs ainât cheap (or hardware guys building and maintaining machines)
the plan is to not set it on fire. If your office burns down you are already screwed
> Self-reliance is great, but there are other benefits to running your own compute. It inspires good engineering.
It's easy to inspire people when you have great engineers in the first place. That's a given at a place like comma.ai, but there are many companies out there where administering a datacenter is far beyond their core competencies.
I feel like skilled engineers have a hard time understanding the trade-offs from cloud companies. The same way that comma.ai employees likely don't have an in-house canteen, it can make sense to focus on what you are good at and outsource the rest.
Same thing. I was previously spending 5-8K on DigitalOcean, supposedly a "budget" cloud. Then the company was sold, and I started a new company on entirely self-hosted hardware. Cloudflare tunnel + CC + microk8s made it trivial! And I spend close to nothing other than internet that I already am spending on. I do have solar power too.
Working at a non-tech regional bigco, where ofc cloud is the default, I see everyday how AWS costs get out of hand, it's a constant struggle just to keep costs flat. In our case, the reality is that NONE of our services require scalability, and the main upside of high uptime is nice primarily for my blood pressure.. we only really need uptime during business hours, nobody cares what happens at night when everybody is sleeping.
On the other hand, there's significant vendor lockin, complexity, etc. And I'm not really sure we actually end up with less people over time, headcount always expands over time, and there's always cool new projects like monitoring, observability, AI, etc.
My feeling is, if we rented 20-30 chunky machines and ran Linux on them, with k8s, we'd be 80% there. For specific things I'd still use AWS, like infinite S3 storage, or RDS instances for super-important data.
If I were to do a startup, I would almost certainly not base it off AWS (or other cloud), I'd do what I write above: run chunky servers on OVH (initially just 1-2), and use specific AWS services like S3 and RDS.
A bit unrelated to the above, but I'd also try to keep away from expensive SaaS like Jira, Slack, etc. I'd use the best self-hosted open source version, and be done with it. I'd try Gitea for git hosting, Mattermost for team chat, etc.
And actually, given the geo-political situation as an EU citizen, maybe I wouldn't even put my data on AWS at all and self-host that as well...
The #1 reason I would advocate for using AWS today is the compliance package they bring to the party. No other cloud provider has anything remotely like Artifact. I can pull Amazon's PCI-DSS compliance documentation using an API call. If you have a heavily regulated business (or work with customers who do), AWS is hard to beat.
If you don't have any kind of serious compliance requirement, using Amazon is probably not ideal. I would say that Azure AD is ok too if you have to do Microsoft stuff, but I'd never host an actual VM on that cloud.
Compliance and "Microsoft stuff" covers a lot of real world businesses. Going on prem should only be done if it's actually going to make your life easier. If you have to replicate all of Azure AD or Route53, it might be better to just use the cloud offerings.
> The #1 reason I would advocate for using AWS today is the compliance package they bring to the party.
I was going to post the same comment.
Most of the people agreeing to foot the AWS bill do it because they see how much the compliance is worth to them.
Iâm impressed that San Diego electrical power manages to be even more expensive than in the UK. That takes some doing.
I love articles like this and companies with this kind of openness. Mad respect to them for this article and for sharing software solutions!
Even at the personal blog level, I'd argue it's worth it to run your own server (even if it's just an old PC in a closet). Gets you on the path to running a home lab.
> Cloud companies generally make onboarding very easy, and offboarding very difficult. If you are not vigilant you will sleepwalk into a situation of high cloud costs and no way out. If you want to control your own destiny, you must run your own compute.
Cost and lock-in are obvious factors, but "sovereignty" has also become a key factor in the sales cycle, at least in Europe.
Handing health data, Juvoly is happy to run AI work loads on premise.
> We use SSDs for reliability and speed.
Hey, how do SSDs fail lately? Do they ... vanish off the bus still? Or do they go into read only mode?
The company I work for used to have a hybrid where 95% was on-prem, but became closer to 90% in the cloud when it became more expensive to do on-prem because of VMware licensing. There are alternatives to VMware, but not officially supported with our hardware configuration, so the switch requires changing all the hardware, which still drives it higher than the cloud. Almost everything we have is cloud agnostic, and for anything that requires resilience, it sits in two different providers.
Now the company is looking at doing further cost savings as the buildings rented for running on-prem are sitting mostly unused, but also the prices of buildings have gone up in recent years, notably too, so we're likely to be saving money moving into the cloud. This is likely to make the cloud transition permanent.
This quote is gold:
The cloud requires expertise in company-specific APIs and billing systems. A data center requires knowledge of Watts, bits, and FLOPs. I know which one I rather think about.
> Having your own data center is cool
This company sounds more like a hobby interest than a business focused on solving genuine problems.
I used to colocate a 2U server that I purchased with a local data center. It was a great learning experience for me. Im curious why a company wouldn't colocate their own hardware? Proximity isnt an issue when you can have the datacenter perform physical tasks. Bravo to the comma team regardless. It'll be a great learning experience and make each person on their team better.
Ps... bx cable instead of conduit for electrical looks cringe.
The main reason not to colocate is if you're somewhere with high real estate costs... E.g Hetzner managed servers competes on price w/co-location for me because I'm in London.
I colocate in London, a single server / firewall comes to around ÂŁ5k a year. I also colocate two other servers in some northern UK location in some industrial estate for ÂŁ2k as my backups. I've never enjoyed the cloud and dedicated server's have their own caveats too.
Budget hosts such as Hetzner/OVH have been known to suddenly pull the plug for no reason.
My kit is old, second hand old (Cisco UCS 220 M5, 2xDell somethings) and last night I just discovered I can throw in two NVIDIA T4's and turn it in to a personal LLM.
I'm quite excited having my own colocated server with basic LLM abilities. My own hardware with my own data and my own cables. Just need my own IP's now.
> Budget hosts such as Hetzner/OVH have been known to suddenly pull the plug for no reason.
The same would apply for any number of hosts. Hetzner/OVH are cheap, but as your own numbers show the location price gap is more than sufficient to cover the costs of servers.
In fact you can colocate with Hetzner too, and you'd get a similar price gap - the lower cost of real-estate is a large part of the reason why they can be as cheap as they are.
Data centre operations is a real estate play - to the point that at least one UK data centre operator is owned by a real estate investment company.
Thanks. I hadn't seen it as such and you're right. I guess it comes down to personal preference.
Where I feel that data has become a commodity in that I can sell your username and email for a few pence, I would rather prefer to have my own hardware in my own possession and that any request of it has to go to me, nor some server provider.
I cancelled my digital ocean server of almost a decade late last year and replaced it with a raspberry pi 3 that was doing nothing. We can do it, we should do it.
> Maintaining a data center is much more about solving real-world challenges. The cloud requires expertise in company-specific APIs and billing systems. A data center requires knowledge of Watts, bits, and FLOPs. I know which one I rather think about.
I find this to be applicable on a smaller scale too! I'd rather setup and debug a beefy Linux VPS via SSH than fiddle with various propietary cloud APIs/interfaces. Doesn't go as low-level as Watts, bits and FLOPs but I still consider knowledge about Linux more valuable than knowing which Azure knobs to turn.
> Cloud companies generally make onboarding very easy, and offboarding very difficult.
I reckon most on-prem deployments have significantly worse offboarding than the cloud providers. As a cloud provider you can win business by having something for offboarding, but internally you'd never get buy-in to spend on a backup plan if you decide to move to the cloud.
> As a cloud provider you can win business by having something for offboarding, but internally you'd never get buy-in to spend on a backup plan if you decide to move to the cloud.
Its the other way around. How do you think all businesses moved to the cloud in the first place?
Interesting that they go for no redundancy
What redundancy are we talking about? AWS has proven to the world on multiple occasions that redundancy across geo locations is useless, because if us-east-1 is down, their whole cloud is done, causing a big chunk of the world to be down.
Half sarcasm of course, but it goes to show that the world is not going to fall apart in many cases when it comes to software. Sure, it's not ideal in lots of cases, but we'll survive without redundancy.
Microsoft made the TCO argument and won. Self-hosting is only an option if you can afford expensive SysOps/DevOps/WhateverWeAreCalledTheseDays to manage it.
So.... you're saying they must be understaffed and paying poverty range wages to afford the San Diego climate and still cut a profit? ;)
15-years ago or so a spreadsheet was floating around where you could enter server costs, compute power, etc and it would tell you when you would break-even by buying instead of going with AWS. I think it was leaked from Amazon because it was always three-years to break-even even as hardware changed over time.
Azure provides their own "Total Cost of Ownership" calculator for this purpose [0]. Notably, this makes you estimate peripheral costs such as cost of having a server administrator, electricity, etc.
[0] - https://azure-int.microsoft.com/en-us/pricing/tco/calculator...
I plugged in our own numbers (60 servers we own in a data centre we rent) and Microsoft thinks this costs us an order of magnitude more than it does.
Their "assumption" for hardware purchase prices seems way off compared to what we buy from Dell or HP.
It's interesting that the "IT labour" cost they estimate is $140k for DIY, and $120k for Azure.
Their saving is 5 times more than what we spend...
Thank you, I've wanted to see someone use this in the real world. When doing Azure certifications (AZ900, AZ204, etc.), they force you to learn about this tool.
I may be out of date with RAM prices. Dell's configuration tool wants ÂŁ1000 each for 32GB RDIMMs â but prices in Dell's configuration tool are always significantly higher than we get if we write to their sales person.
Even so, a rough configuration for a 2-processor 16 core/processor server with 256GiB RAM comes to $20k, vs $22k + 100% = $44k quoted by MS. (The 100% is MS' 20%-per-year "maintenance cost" that they add on to the estimate. In reality this is 0% as everything is under Dell's warranty.)
And most importantly, the tool is only comparing the cost of Azure to constructing and maintaining a data centre! Unless there are other requirements (which would probably rule out Azure anyway) that's daft, a realistic comparison should be to colocation or hired dedicated servers, depending on the scale.
If you buy, maybe. Leasing or renting tends to be cheaper from day one. Tack on migration costs and ca. 6 months is a more realistic target. If the spreadsheet always said 3 years, it sounds like an intentional "leak".
Did the AWS part include the egress costs to extract your data from AWS, if you ever want to leave them?
AWS says they will waive all egress costs when exiting https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-i...
Because the EU forced them to
Well, somebody should recreate it. I smell a potential startup idea somewhere. There's a ton of "cloud cost optimizers" software but most involve tweaking AWS knobs and taking a cut of the savings. A startup that could offload non critical service from AWS to colo and traditional bare metal hosting like Hetzner has a strong future.
One thing to keep in mind is that the curve for GPU depreciation (in the last 5 years at least) is a little steeper than 3 years. Current estimates is that the capital depreciation cost would plunge dramatically around the third year. For a top tier H100 depreciation kicks in around the 3rd year but they mentioned for the less capable ones like the A100 the depreciation is even worse.
https://www.silicondata.com/use-cases/h100-gpu-depreciation/
Now this is not factoring cost of labour. Labor at SF wages is dreadfully expensive, now if your data center is right across the border in Tijuana on the other hand..
There's the HN I know and love
> San Diego power cost is over 40c/kWh, ~3x the global average. Itâs a ripoff, and overpriced simply due to political dysfunction.
Mind anyone elaborate? Always thought this is was a direct cause of the free market. Not sure if by dysfunction the op means lack of intervention.
Did you say âfree marketâ? There is one provider. There is a lot of regulation, mostly incompetent. Itâs a mess.
I'm thinking about doing a research project at my university looking into distributed "data centers" hosted by communities instead of centralized cloud providers.
The trick is in how to create mostly self-maintaining deployable/swappable data centers at low cost...
Realistically, it's the speed with which you can expand and contract. The cloud gives unbounded flexibility - not on the per-request scale or whatever, but on the per-project scale. To try things out with a bunch of EC2s or GCEs is cheap. You have it for a while and then you let it go. I say this as someone with terabytes of RAM in servers, and a cabinet I have in the Bay Area.
I just read about Railway doing something similar, sadly their prices are still high compared to other bare metal providers and even VPS such as Hetzner with Dokploy, very similar feature set yet for the same 5 dollars you get way more CPU, storage and RAM.
https://blog.railway.com/p/launch-week-02-welcome
Their pricing page is so confusing: CPU: $0.00000772 per vCPU / sec
This seems to imply $40 / month for 2 vCPU which seems very high?
Or maybe they mean "used" CPU versus idle?
Billing per used or not idle cpu cycle would be quite interesting. Number of cores would just effectively be your cost cap. Efficiency would be even more important. And if the provider over subscribes cores you just pay less. Actually that's probably why they don't do it...
This was one of the coolest job ads that I've ever read :). Congrats for what you have done with your infrastructure, team and product!
Look the bottom of that page:
An error occurred: API rate limit already exceeded for installation ID 73591946.
Error from https://giscus.app/
Fellow says one thing and uses another.
The observation about incentives is underappreciated here. When your compute is fixed, engineers optimize code. When compute is a budget line, engineers optimize slide decks. That's not really a cloud vs on-prem argument, it's a psychology-of-engineering argument.
Well, their comment section is fore sure not running on premises, but on the cloud:
"An error occurred: API rate limit already exceeded for installation ID 73591946."
This is a great solution for a very specific type of team but I think most companies with consistent GPU workloads will still just rent dedicated servers and call it a day.
I agree, and cloud compute is poised to become even more commoditized in the coming years (gazillion new data centers + AI plateauing + efficiency gains, the writing is on the wall). Thereâs no way this makes sense for most companies.
> AI plateauing
Ummm is that plateauing with us in the room?
The advantage of renting vs. owning is that you can always get the latest gen, and that brings you newer capabilities (i.e. fp8, fp4, etc) and cheaper prices for current_gen-1. But betting on something plateauing when all the signs point towards the exact opposite is not one of the bets i'd make.
> Ummm is that plateauing with us in the room?
Well, the capabilities have already plateaued as far as I can tell :-/
Over the next few yeas we can probably wring out some performance improvements, maybe some efficiency improvements.
A lot of the current AI users right now are businesses trying to on-sell AI (code reviewers/code generators, recipe apps, assistant apps, etc), and there's way too many of them in the supply/demand ratio, so you can expect maybe 90% of these companies to disappear in the next few years, taking the demand for capacity with them.
It's the opposite. The more consistent your workload the more practical and cost-effective it is to go on-prem.
Cloud excels for bursty or unpredictable workloads where quickly scaling up and down can save you money.
Other benefits: easy access to reliable infrastructure and latest hardware which you can swap as you please. There are cases where it makes sense to navigate away from the big players (like dropbox going from aws to on-prem), but again you make this move when you want to optimize costs and are not worried about the trade-offs.
In case anyone from comma.ai reads this: "CTO @ comma.ai" the link at the end is broken, itâs relative instead of absolute.
no because it's on premise you see? you don't need to access the world wide web, just their server
/s
Looks insanely daunting imo
Not long ago Railway moved from GCP to their own infrastructure since it was very expensive for them. [0] Some go for a Oxide rack [1] for a full stack solution (both hardware and software) for intense GPU workloads, instead of building it themselves.
It's very expensive and only makes sense if you really need infrastructure sovereignty. It makes more sense if you're profitable in the tens of millions after raising hundreds of millions.
It also makes sense for governments (including those in the EU) which should think about this and have the compute in house and disconnected from the internet if they are serious about infrastructure sovereignty, rather than depending on US-based providers such as AWS.
[0] https://blog.railway.com/p/data-center-build-part-one
[1] https://oxide.computer/
I was under impression that Oxide rack does not currently ship with GPU's - at least with buildin. . Has this changed recently ?
Oxide racks don't yet have a GPU solution. But it is a good options for general compute and even with GPU required, general compute hasn't gone away.
I like Hotzâs style: simply and straightforwardly attempting the difficult and complex. I always get the impression: âYou donât need to be too fancy or clever. You donât need permission or credentials. You just need to go out and do the thing. What are you waiting for?â
This was written by Harald Schäfer, the CTO of comma.ai. I'm not so sure if G. Hotz is still involved in comma.ai.
Don't think he is, but it does seem like he inspired a hacker mentality in the shop during his tenure.
Ah I missed that.
Chatgpt:
# donât own the cloud, rent instead
the âbuild your own datacenterâ story is fun (and commaâs setup is undeniably cool), but for most companies itâs a seductive trap: youâll spend your rarest resource (engineer attention) on watts, humidity, failed disks, supply chains, and âwhy is this rack hot,â instead of on the product. comma can justify it because their workload is huge and steady, theyâre willing to run non-redundant storage, and theyâve built custom GPU boxes and infra around a very specific ML pipeline. ([comma.ai blog][1])
## 1) capex is a tax on flexibility
a datacenter turns âcomputeâ into a big up-front bet: hardware choices, networking choices, facility choices, and a depreciation schedule that does not care about your roadmap. cloud flips that: you pay for what you use, you can experiment cheaply, and you can stop spending the minute a strategy changes. the best feature of renting is that quitting is easy.
## 2) scaling isnât a vibe, itâs a deadline
real businesses donât scale smoothly. they spike. they get surprise customers. they do one insane training run. they run a migration. owning means you either overbuild âjust in caseâ (idle metal), or you underbuild and miss the moment. renting means you can burst, use spot/preemptible for the ugly parts, and keep steady stuff on reserved/committed discounts.
## 3) reliability is more than âitâs up most daysâ
comma explicitly says they keep things simple and donât need redundancy for ~99% uptime at their scale. ([comma.ai blog][1]) thatâs a perfectly valid tradeâif your business can tolerate it. many canât. cloud providers sell multi-zone, multi-region, managed backups, managed databases, and boring compliance checklists because âfive ninesâ isnât achieved by a couple heroic engineers and a PID loop.
## 4) the hidden cost isnât power, itâs people
comma spent ~$540k on power in 2025 and runs up to ~450kW, plus all the cooling and facility work. ([comma.ai blog][1]) but the larger, sneakier bill is: on-call load, hiring niche operators, hardware failures, spare parts, procurement, security, audits, vendor management, and the opportunity cost of your best engineers becoming part-time building managers. cloud is expensive, yesâbecause it bundles labor, expertise, and economies of scale you donât have.
## 5) âvendor lock-inâ is real, but self-lock-in is worse
cloud lock-in is usually optional: you choose proprietary managed services because theyâre convenient. if youâre disciplined, you can keep escape hatches: containers, kubernetes, terraform, postgres, object storage abstractions, multi-region backups, and a tested migration plan. owning your datacenter is also lock-inâexcept the vendor is past you, and the contract is âwe can never stop maintaining this.â
## the practical rule
*if you have massive, predictable, always-on utilization, and you want to become good at running infrastructure as a core competency, owning can win.* thatâs basically commaâs case. ([comma.ai blog][1]) *otherwise, rent.* buy speed, buy optionality, and keep your team focused on the thing only your company can do.
if you want, tell me your rough workload shape (steady vs spiky, cpu vs gpu, latency needs, compliance), and iâll give you a blunt ârent / colo / ownâ recommendation in 5 lines.
[1]: https://blog.comma.ai/datacenter/ "Owning a $5M data center - comma.ai blog"
Am I the only one that is simply scared of running your own cloud? What happens if your administrator credentials get leaked? At least with Azure I can phone microsoft and initiate a recovery. Because of backups and soft deletion policies quite a lot is possible. I guess you can build in these failsafe scenarios locally too? But what if a fire happens like in South Korea? Sure most companies run more immediate risks such as going bankrupt, but at least Cloud relieves me from the stuff of nightmares.
Except now I have nightmares that the USA will enforce the patriot act and force Microsoft to hand over all their data in European data centers and then we have to migrate everything to a local cloud provider. Argh...
Do you have a computer at home? Are you scared of its credentials leaking? A server is just another computer with a good internet connection.
You can equip your server with a mouse, keyboard and screen and then it doesn't even need credentials. The credential is your physical access to the mouse and keyboard.
Then literally own the cloud, like run the hardware on-prem yourself.
One thing I don't really understand here is why they're incurring the costs of having this physically in San Diego, rather than further afield with a full-time server tech essentially living on-prem, especially if their power numbers are correct. Is everyone being able to physically show up on site immediately that much better than a 24/7 pair of remote hands + occasional trips for more team members if needed?
Coolness factor of having a datacenter right in your office.
Stopped reading at "Our main storage arrays have no redundancy". This isn't a data center, it's a volatile AI memory bank.
This turns out to be a more and more important primitive for companies who are building their own models [1].
[1] https://si.inc/posts/the-heap/
Having worked only with the cloud I really wonder if these companies don't use other software with subscriptions. Even though AWS is "expensive" its a just another line item compared to most companies overall SaaS spend. Most businesses don't need that much compute or data transfer in the grand scheme of things.
Or better; write your software such that you can scale to tens of thousands of concurrent users on a single machine. This can really put the savings into perspective.
If you were to read TFA, it is about ML training workloads, not web servers
Well the article starts out with a suggestion that we should all get a data center... It's quite a jump to assume that everyone reading this article needs to train their own LLMs.
capex vs opex the Opera.