I'm thinking about doing a research project at my university looking into distributed "data centers" hosted by communities instead of centralized cloud providers.
The trick is in how to create mostly self-maintaining deployable/swappable data centers at low cost...
I just read about Railway doing something similar, sadly their prices are still high compared to other bare metal providers and even VPS such as Hetzner with Dokploy, very similar feature set yet for the same 5 dollars you get way more CPU, storage and RAM.
Goes for small business and individuals as well. Sure, there are times that cloud makes sense, but you can and should do a lot on your own hardware.
> San Diego power cost is over 40c/kWh, ~3x the global average. It’s a ripoff, and overpriced simply due to political dysfunction.
Mind anyone elaborate? Always thought this is was a direct cause of the free market. Not sure if by dysfunction the op means lack of intervention.
Cloud, in terms of "other company's infrastructure" always implies losing the competence to select, source and operate hardware. Treating hardware as commodity will eventually treat your very own business as commodity: Someone can just copy your software/IP and ruin your business. Every durable business needs some kind of intellectual property and human skills that are not replaceable easily. This sounds binary, but isn't. You can build long-lasting partnerships. German Mittelstand did that over decades.
what is the underling filesystem for your kv store, it doesn't appear to use raw devices.
I love this article. Great write up. Gave me the same feeling when I would read about Stackoverflows handful of servers that ran all of the sites.
Don't even have to go this far. Colocating in a couple regions will give you most of the logistical thrills at a fraction of the cost!
Even at the personal blog level, I'd argue it's worth it to run your own server (even if it's just an old PC in a closet). Gets you on the path to running a home lab.
Hetzner bare metal ran much of crypto for many years before they cracked down on it.
I cancelled my digital ocean server of almost a decade late last year and replaced it with a raspberry pi 3 that was doing nothing. We can do it, we should do it.
Look the bottom of that page:
An error occurred: API rate limit already exceeded for installation ID 73591946.
Error from https://giscus.app/
Fellow says one thing and uses another.
Microsoft made the TCO argument and won. Self-hosting is only an option if you can afford expensive SysOps/DevOps/WhateverWeAreCalledTheseDays to manage it.
The cloud is a psyop, a scam. Except at the tiniest free-tier / near free-tier use cases, or true scale to zero setups.
I've helped a startup with 2.5M revenue reduce their cloud spend from close to 2M/yr to below 1M/yr. They could have reached 250k/yr renting bare-metal servers. Probably 100k/yr in colos by spending 250k once on hardware. They had the staff to do it but the CEO was too scared.
Cloud evangelism (is it advocacy now?) messed up the minds of swaths of software engineers. Suddenly costs didn't matter and scaling was the answer to poor designs. Sizing your resource requirements became a lost art, and getting into reaction mode became law.
Welcome to "move fast and get out of business", all enabled by cloud architecture blogs that recommend tight integration with vendor lock-in mechanisms.
Use the cloud to move fast, but stick to cloud-agnostic tooling so that it doesn't suck you in forever.
I've seen how much cloud vendors are willing to spend to get business. That's when you realize just how massive their margins are.
This is hackernews, do the math for the love of god.
There are good business and technical reasons to choose a public cloud.
There are good business and technical reasons to choose a private cloud.
There are good business and technical reasons to do something in-between or hybrid.
The endless "public cloud is a ripoff" or "private clouds are impossible" is just a circular discussion past each other. Saying to only use one or another is textbook cargo-culting.
> We use SSDs for reliability and speed.
Hey, how do SSDs fail lately? Do they ... vanish off the bus still? Or do they go into read only mode?
TLDR:
> In comma’s case I estimate we’ve spent ~5M on our data center, and we would have spent 25M+ had we done the same things in the cloud.
IMO, that's the biggie. It's enough to justify paying someone to run their datacenter. I wish there was a bit more detail to justify those assumptions, though.
That being said, if their needs grow by orders of magnitude, I'd anticipate that they would want to move their servers somewhere with cheaper electricity.
This is a great solution for a very specific type of team but I think most companies with consistent GPU workloads will still just rent dedicated servers and call it a day.
Well, their comment section is fore sure not running on premises, but on the cloud:
"An error occurred: API rate limit already exceeded for installation ID 73591946."
Is there a client to sell on your own unused private cloud?
Not long ago Railway moved from GCP to their own infrastructure since it was very expensive for them. [0] Some go for a Oxide rack [1] for a full stack solution (both hardware and software) for intense GPU workloads, instead of building it themselves.
It's very expensive and only makes sense if you really need infrastructure sovereignty. It makes more sense if you're profitable in the tens of millions after raising hundreds of millions.
It also makes sense for governments (including those in the EU) which should think about this and have the compute in house and disconnected from the internet if they are serious about infrastructure sovereignty, rather than depending on US-based providers such as AWS.
In case anyone from comma.ai reads this: "CTO @ comma.ai" the link at the end is broken, it’s relative instead of absolute.
The observation about incentives is underappreciated here. When your compute is fixed, engineers optimize code. When compute is a budget line, engineers optimize slide decks. That's not really a cloud vs on-prem argument, it's a psychology-of-engineering argument.
mark my words. cloud will fall out of fashion, but it will come back in fashion under another name in some amount of years. its cyclical.
I've just shifted to Hetzner, no regret
IT dinosaur here, who has run and engineered the entire spectrum over the course of my career.
Everything is a trade-off. Every tool has its purpose. There is no "right way" to build your infrastructure, only a right way for you.
In my subjective experience, the trade-offs are generally along these lines:
* Platform as a Service (Vercel, AWS Lambda, Azure Functions, basically anything where you give it your code and it "just works"): great for startups, orgs with minimal talent, and those with deep pockets for inevitable overruns. Maximum convenience means maximum cost. Excellent for weird customer one-offs you can bill for (and slap a 50% margin on top). Trade-off is that everything is abstracted away, making troubleshooting underlying infrastructure issues nigh impossible; also that people forget these things exist until the customer has long since stopped paying for them or a nasty bill arrives.
* Infrastructure as a Service (AWS, GCP, Azure, Vultr, etc; commonly called the "Public Cloud"): great for orgs with modest technical talent but limited budgets or infrastructure that's highly variable (scales up and down frequently). Also excellent for everything customer-facing, like load balancers, frontends, websites, you name it. If you can invoice someone else for it, putting it in here makes a lot of sense. Trade-off is that this isn't yours, it'll never be yours, you'll be renting it forever from someone else who charges you a pretty penny and can cut you off or raise prices anytime they like.
* Managed Service/Hosting Providers (e.g., ye olde Rackspace): you don't own the hardware, but you're also not paying the premium for infrastructure orchestrators. As close to bare metal as you can get without paying for actual servers. Excellent for short-term "testing" of PoCs before committing CapEx, or for modest infrastructure needs that aren't likely to change substantially enough to warrant a shift either on-prem or off to the cloud. You'll need more talent though, and you're ultimately still renting the illusion of sovereignty from someone else in perpetuity.
* Bare Metal, be it colocation or on-premises: you own it, you decide what to do with it, and nobody can stop you. The flip side is you have to bootstrap everything yourself, which can be a PITA depending on what you actually want - or what your stakeholders demand you offer. Running VMs? Easy-peasy. Bare metal K8s clusters? I mean, it can be done, but I'd personally rather chew glass than go without a managed control plane somewhere. CapEx is insane right now (thanks, AI!), but TCO is still measured in two to three years before you're saving more than you'd have spent on comparable infrastructure elsewhere, even with savings plans. Talent needs are highly variable - a generalist or two can get you 80% to basic AWS functionality with something like Nutanix or VCF (even with fancy stuff like DBaaS), but anything cutting edge is going to need more headcount than a comparable IaaS build. God help you if you opt for a Microsoft stack, as any on-prem savings are likely to evaporate at your next True-Up.
In my experience, companies have bought into the public cloud/IaaS because they thought it'd save them money versus the talent needed for on-prem; to be fair, back when every enterprise absolutely needed a network team and a DB team and a systems team and a datacenter team, this was technically correct. Nowadays, most organizational needs can be handled with a modest team of generalists or a highly competent generalist and one or two specialists for specific needs (e.g., a K8s engineer and a network engineer); modern software and operating systems make managing even huge orgs a comparable breeze, especially if you're running containers or appliances instead of bespoke VMs.
As more orgs like Comma or Basecamp look critically at their infrastructure needs versus their spend, or they seriously reflect on the limited sovereignty they have by outsourcing everything to US Tech companies, I expect workloads and infrastructure to become substantially more diversified than the current AWS/GCP/Azure trifecta.
One thing I don't really understand here is why they're incurring the costs of having this physically in San Diego, rather than further afield with a full-time server tech essentially living on-prem, especially if their power numbers are correct. Is everyone being able to physically show up on site immediately that much better than a 24/7 pair of remote hands + occasional trips for more team members if needed?
I like Hotz’s style: simply and straightforwardly attempting the difficult and complex. I always get the impression: “You don’t need to be too fancy or clever. You don’t need permission or credentials. You just need to go out and do the thing. What are you waiting for?”
Am I the only one that is simply scared of running your own cloud? What happens if your administrator credentials get leaked? At least with Azure I can phone microsoft and initiate a recovery. Because of backups and soft deletion policies quite a lot is possible. I guess you can build in these failsafe scenarios locally too? But what if a fire happens like in South Korea? Sure most companies run more immediate risks such as going bankrupt, but at least Cloud relieves me from the stuff of nightmares.
Except now I have nightmares that the USA will enforce the patriot act and force Microsoft to hand over all their data in European data centers and then we have to migrate everything to a local cloud provider. Argh...
Clouds suck. But so does “on premises”. Or co-location.
In the future, what you will need to remain competitive is computing at the edge. Only one company is truly poised to deliver on that at massive scale.
And finally we reach the point where you're not shot for explaining if you invest in ownership after everything is over you have something left that has intrinsic value regardless of what you were doing with it.
Otherwise, well just like that gym membership, you get out what you put into it...
> In a future blog post I hope I can tell you about how we produce our own power and you should too.
Rackmounted fusion reactors, I hope. Would solve my homelab wattage issues too.
if i understood correctly, you dont kubernetes, rights? Did you consider it?
Having worked only with the cloud I really wonder if these companies don't use other software with subscriptions. Even though AWS is "expensive" its a just another line item compared to most companies overall SaaS spend. Most businesses don't need that much compute or data transfer in the grand scheme of things.
Stopped reading at "Our main storage arrays have no redundancy". This isn't a data center, it's a volatile AI memory bank.
Or better; write your software such that you can scale to tens of thousands of concurrent users on a single machine. This can really put the savings into perspective.
Chatgpt:
# don’t own the cloud, rent instead
the “build your own datacenter” story is fun (and comma’s setup is undeniably cool), but for most companies it’s a seductive trap: you’ll spend your rarest resource (engineer attention) on watts, humidity, failed disks, supply chains, and “why is this rack hot,” instead of on the product. comma can justify it because their workload is huge and steady, they’re willing to run non-redundant storage, and they’ve built custom GPU boxes and infra around a very specific ML pipeline. ([comma.ai blog][1])
## 1) capex is a tax on flexibility
a datacenter turns “compute” into a big up-front bet: hardware choices, networking choices, facility choices, and a depreciation schedule that does not care about your roadmap. cloud flips that: you pay for what you use, you can experiment cheaply, and you can stop spending the minute a strategy changes. the best feature of renting is that quitting is easy.
## 2) scaling isn’t a vibe, it’s a deadline
real businesses don’t scale smoothly. they spike. they get surprise customers. they do one insane training run. they run a migration. owning means you either overbuild “just in case” (idle metal), or you underbuild and miss the moment. renting means you can burst, use spot/preemptible for the ugly parts, and keep steady stuff on reserved/committed discounts.
## 3) reliability is more than “it’s up most days”
comma explicitly says they keep things simple and don’t need redundancy for ~99% uptime at their scale. ([comma.ai blog][1]) that’s a perfectly valid trade—if your business can tolerate it. many can’t. cloud providers sell multi-zone, multi-region, managed backups, managed databases, and boring compliance checklists because “five nines” isn’t achieved by a couple heroic engineers and a PID loop.
## 4) the hidden cost isn’t power, it’s people
comma spent ~$540k on power in 2025 and runs up to ~450kW, plus all the cooling and facility work. ([comma.ai blog][1]) but the larger, sneakier bill is: on-call load, hiring niche operators, hardware failures, spare parts, procurement, security, audits, vendor management, and the opportunity cost of your best engineers becoming part-time building managers. cloud is expensive, yes—because it bundles labor, expertise, and economies of scale you don’t have.
## 5) “vendor lock-in” is real, but self-lock-in is worse
cloud lock-in is usually optional: you choose proprietary managed services because they’re convenient. if you’re disciplined, you can keep escape hatches: containers, kubernetes, terraform, postgres, object storage abstractions, multi-region backups, and a tested migration plan. owning your datacenter is also lock-in—except the vendor is past you, and the contract is “we can never stop maintaining this.”
## the practical rule
*if you have massive, predictable, always-on utilization, and you want to become good at running infrastructure as a core competency, owning can win.* that’s basically comma’s case. ([comma.ai blog][1]) *otherwise, rent.* buy speed, buy optionality, and keep your team focused on the thing only your company can do.
if you want, tell me your rough workload shape (steady vs spiky, cpu vs gpu, latency needs, compliance), and i’ll give you a blunt “rent / colo / own” recommendation in 5 lines.
[1]: https://blog.comma.ai/datacenter/ "Owning a $5M data center - comma.ai blog"
capex vs opex the Opera.
[dead]
[dead]
[dead]
And now go do that in another region. Bam, savings gone. /s
What I mean is that I'm assuming the math here works because the primary purpose of the hardware is training models. You don't need 6 or 7 nines for that is what I'm imagining. But when you have customers across geography that use your app hosted on those servers pretty much 24/7 then you can't afford much downtime.
[dead]
[dead]
Realistically, it's the speed with which you can expand and contract. The cloud gives unbounded flexibility - not on the per-request scale or whatever, but on the per-project scale. To try things out with a bunch of EC2s or GCEs is cheap. You have it for a while and then you let it go. I say this as someone with terabytes of RAM in servers, and a cabinet I have in the Bay Area.