Don't rent the cloud, own instead

1044 points • by Torq_boi • today at 5:50 AM • 438 comments • view on HN

Comments

This is an industry we're[0] in. Owning is at one end of the spectrum, with cloud at the other, and a broadly couple of options in-between:

1 - Cloud – This is minimising cap-ex, hiring, and risk, while largely maximising operational costs (its expensive) and cost variability (usage based).

2 - Managed Private Cloud - What we do. Still minimal-to-no cap-ex, hiring, risk, and medium-sized operational cost (around 50% cheaper than AWS et al). We rent or colocate bare metal, manage it for you, handle software deployments, deploy only open-source, etc. Only really makes sense above €$5k/month spend.

3 - Rented Bare Metal – Let someone else handle the hardware financing for you. Still minimal cap-ex, but with greater hiring/skilling and risk. Around 90% cheaper than AWS et al (plus time).

4 - Buy and colocate the hardware yourself – Certainly the cheapest option if you have the skills, scale, cap-ex, and if you plan to run the servers for at least 3-5 years.

A good provider for option 3 is someone like Hetzner. Their internal ROI on server hardware seems to be around the 3 year mark. After which I assume it is either still running with a client, or goes into their server auction system.

Options 3 & 4 generally become more appealing either at scale, or when infrastructure is part of the core business. Option 1 is great for startups who want to spend very little initially, but then grow very quickly. Option 2 is pretty good for SMEs with baseline load, regular-sized business growth, and maybe an overworked DevOps team!

[0] https://lithus.eu, adam@

➕ show 19 replies

scalemaxx • today at 4:25 PM

Everything comes circle. Back in my day, we just called it a "data center". Or on-premise. You know, before the cloud even existed. A 1990s VP of IT would look at this post and say, what's new? Better computing for sure. Better virtualization and administration software, definitely. Cooling and power and racks? More of the same.

The argument made 2 decades ago was that you shouldn't own the infrastructure (capital expense) and instead just account for the cost as operational expense (opex). The rationale was you exchange ownership for rent. Make your headache someone else's headache.

The ping pong between centralized vs decentralized, owned vs rented, will just keep going. It's never an either or, but when companies make it all-or-nothing then you have to really examine the specifics.

➕ show 4 replies

tgtweak • today at 6:24 PM

>San Diego has a mild climate and we opted for pure outside air cooling. This gives us less control of the temperature and humidity, but uses only a couple dozen kW. We have dual 48” intake fans and dual 48” exhaust fans to keep the air cool. To ensure low humidity (<45%) we use recirculating fans to mix hot exhaust air with the intake air. One server is connected to several sensors and runs a PID loop to control the fans to optimize the temperature and humidity.

Oh man, this is bad advice. Airborn humidity and contaminants will KILL your servers on a very short horizon in most places - even San Diego. I highly suggest enthalpy wheel coolers (kyotocooling is one vendor - switch datacenters runs very similar units on their massive datacenters in the Nevada desert) as they remove the heat from the indoor air using outdoor air (+can boost slightly with an integrated refrigeration unit to hit target intake temps) without switching the air from one side to the other. This has huge benefits for air control quality and outdoor air tolerance and a single 500KW heat rejection unit uses only 25KW of input power (when it needs to boost the AC unit's output). You can combine this with evaporative cooling on the exterior intakes to lower the temps even further at the expense of some water consumption (typically far cheaper than the extra electricity to boost the cooling through an hvac cycle).

Not knocking the achievement just speaking from experience that taking outdoor air (even filtered + mixed) into a datacenter is a recipe for hardware failure and the mean time to failure for that is highly dependant on your outdoor air conditions. I've run 3MW facilities with passive air cooling and taking outdoor air directly into servers requires a LOT more conditioning and consideration than is outlined in this article.

➕ show 2 replies

speedgoose • today at 8:11 AM

I would suggest to use both on-premise hardware and cloud computing. Which is probably what comma is doing.

For critical infrastructure, I would rather pay a competent cloud provider than being responsible for reliability issues. Maintaining one server room in the headquarters is something, but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO.

For running many slurm jobs on good servers, cloud computing is very expensive and you sometimes save money in a matter of months. And who cares if the server room is a total loss after a while, worst case you write some more YAML and Terraform and deploy a temporary replacement in the cloud.

Another thing between is colocation, where you put hardware you own in a managed data center. It’s a bit old fashioned, but it may make sense in some cases.

I can also mention that research HPCs may be worth considering. In research, we have some of the world fastest computers at a fraction of the cost of cloud computing. It’s great as long as you don’t mind not being root and having to use slurm.

I don’t know in USA, but in Norway you can run your private company slurm AI workloads on research HPCs, though you will pay quite a bit more than universities and research institutions. But you can also have research projects together with universities or research institutions, and everyone will be happy if your business benefits a lot from the collaboration.

➕ show 4 replies

alecco • today at 10:39 PM

Counterpoint: "Why I'm Selling All My GPUs" https://www.youtube.com/watch?v=C6mu2QRVNSE

TL;DW: GPU rental arbitrage is dead. Regulation hell. GPU prices. Rental price erosion. Building costs rising. Complexity of things like backup power. Delays of connection to energy grid. Staffing costs.

IFC_LLC • today at 2:19 PM

This is cool. Yet, there are levels of insanity and those depend on your inability to estimate things.

When I'm launching a project it's easier for me to rent $250 worth of compute from AWS. When the project consumes $30k a month, it's easier for me to rent a colocation.

My point is that a good engineer should know how to calculate all the ups and downs here to propose a sound plan to the management. That's the winning thing.

➕ show 2 replies

kevinkatzke • today at 3:38 PM

Feels like I’ve lived through a full infrastructure fashion cycle already. I started my career when cloud was the obvious answer and on-prem was “legacy.”

Now on-prem is cool again.

Makes me wonder whether we’re already setting up the next cycle 10 years from now, when everyone rediscovers why cloud was attractive in the first place and starts saying “on-prem is a bad idea” again.

➕ show 7 replies

drnick1 • today at 4:02 PM

On premises isn't only about saving money (that's not always clear). The article neglects the most important benefits which are freedom (control) and privacy. It's basically the same considerations that apply to owning vs renting a house.

➕ show 1 reply

jillesvangurp • today at 7:42 AM

At scale (like comma.ai), it's probably cheaper. But until then it's a long term cost optimization with really high upfront capital expenditure and risk. Which means it doesn't make much sense for the majority of startup companies until they become late stage and their hosting cost actually becomes a big cost burden.

There are in between solutions. Renting bare metal instead of renting virtual machines can be quite nice. I've done that via Hetzner some years ago. You pay just about the same but you get a lot more performance for the same money. This is great if you actually need that performance.

People obsess about hardware but there's also the software side to consider. For smaller companies, operations/devops people are usually more expensive than the resources they manage. The cost to optimize is that cost. The hosting cost usually is a rounding error on the staffing cost. And on top of that the amount of responsibilities increases as soon as you own the hardware. You need to service it, monitor it, replace it when it fails, make sure those fans don't get jammed by dust puppies, deal with outages when they happen, etc. All the stuff that you pay cloud providers to do for you now becomes your problem. And it has a non zero cost.

The right mindset for hosting cost is to think of it in FTEs (full time employee cost for a year). If it's below 1 (most startups until they are well into scale up territory), you are doing great. Most of the optimizations you are going to get are going to cost you in actual FTEs spent doing that work. 1 FTE pays for quite a bit of hosting. Think 10K per month in AWS cost. A good ops person/developer is more expensive than that. My company runs at about 1K per month (GCP and misc managed services). It would be the wrong thing to optimize for us. It's not worth spending any amount of time on for me. I literally have more valuable things to do.

This flips when you start getting into the multiple FTEs per month in cost for just the hosting. At that point you probably have additional cost measured in 5-10 FTE in staffing anyway to babysit all of that. So now you can talk about trading off some hosting FTEs for modest amount of extra staffing FTEs and make net gains.

➕ show 8 replies

simianwords • today at 7:15 AM

The reason companies don’t go with on premises even if cloud is way more expensive is because of the risk involved in on premises.

You can see it quite clearly here that there’s so many steps to take. Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.

It’s never about “is the expected cost in on premises less than cloud”, it’s about the risk adjusted costs.

Once you’ve spread risk not only on your main product but also on your infrastructure, it becomes hard.

I would be vary of a smallish company building their own Jira in house in a similar way.

➕ show 2 replies

3acctforcom • today at 6:30 PM

The lowest grade I got in my business degree was in the "IT management" course. That's because the ONLY acceptable answer to any business IT problem is to move everything to the cloud. Renting is ALWAYS better than owning because you transfer cost and risk to a 3rd party.

That's pretty much the dogma of the 2010s.

It doesn't matter that my org runs a line-of-business datacentre that is a fraction of the cost of public cloud. It doesn't matter that my "big" ERP and admin servers take up half a rack in that datacentre. MBA dogma says that I need to fire every graybeard sysadmin, raze our datacentre facility to the ground, and move to AWS.

Fun fact, salaries and hardware purchases typically track inflation, because switching cost for hardware is nil and hiring isn't that expensive. Whereas software is usually 5-10% increases every year because they know that vendor lock-in and switching costs for software are expensive.

➕ show 1 reply

vadepaysa • today at 6:03 PM

I was an on-prem maxi (if thats a thing) for a long time. I've run clusters that costed more than $5M, but these days I am a changed man. I start with PaaS like Vercel and work my way down to on-prem depending on how important and cost conscious that workload is.

Pains I faced running BIG clusters on-prem.

1. Supply chain Management -- everything from power supplies all the way to GPUs and storage has to be procured, shipped, disassembled and installed. You need labor pool and dedicated management.

2. Inventory Management -- You also need to manage inventory on hand for parts that WILL fail. You can expect 20% of your cluster to have some degree of issues on an ongoing basis

3. Networking and security -- You are on your own defending your network or have to pay a ton of money to vendors to come in and help you. Even with the simplest of storage clusters, we've had to deal with pretty sophisticated attacks.

When I ran massive clusters, I had a large team dealing with these. Obviously, with PaaS, you dont need anyone.

➕ show 2 replies

swordsith • today at 10:17 PM

Recently learned about tailscale and have been accessing my project from my phone, It's been a game changer. The fact that they support teams of up to 3 people and 100 devices on the free plan is awesome imo. Running locally just makes me feel so much more comfortable.

sgarland • today at 4:21 PM

Note that they're running R630/R730s for storage. Those are 12-year old servers, and yet they say each one can do 20 Gbps (2.5 GBps) of random reads. In comparison, the same generation of hardware at AWS ({c,m,r}4) instance maxes out at 50% of that for EBS throughput on m4, and 70% on r4 - and that assumes carefully tuned block sizes.

Old hardware is _plenty_ powerful for a lot of tasks today.

hbogert • today at 7:06 AM

Datacenters need cool dry air? <45%

No, low isn't good perse. I worked in a datacenter which in winters had less than 40%, ram was failing all over the place. Low humidity causes static electricity.

➕ show 2 replies

insuranceguru • today at 3:06 PM

The own vs rent calculus for compute is starting to mirror the market value vs replacement cost divergence we see in physical assets. Cloud is convenient because it lowers OpEx initially, but you lose control over the long-term CapEx efficiency. Once you reach a certain scale, paying the premium for AWS flexibility stops making sense compared to the raw horsepower of owned metal.

➕ show 1 reply

regular_trash • today at 5:14 PM

The distinction between rent/own is kind of a false dichotomy. You never truly own your platform - you just "rent" it in a more distributed way that shields you from a single stress point. The tradeoff is that you have to manage more resources to take care of it, but you have much greater flexibility.

I have a feeling AI is going to be similar in the future. Sure, you can "rent" access to LLM's and have agents doing all your code. And in the future, it'll likely be as good as most engineers today. But the tradeoff is that you are effectively renting your labor from a single source instead of having a distributed workforce. I don't know what the long-term ramifications are here, if any, but I thought it was an interesting parallel.

butterisgood • today at 2:17 PM

I think this is how IBM is making tons of money on mainframes. A lot of what people are doing with cloud can be done on premises with the right levels of virtualization.

https://intellectia.ai/news/stock/ibm-mainframe-business-ach...

60% YoY growth is pretty excellent for an "outdated" technology.

sys42590 • today at 6:51 AM

It would be interesting to hear their contingency plan for any kind of disaster (most commonly a fire) that hits their data center.

➕ show 5 replies

sakopov • today at 9:47 PM

Does anyone remember how cloud prices used to trend down? That was about 6 years ago and then seemingly after the pandemic everything started going the other way.

dh2022 • today at 9:34 PM

LOL’ed IRL at “ In a future blog post I hope I can tell you about how we produce our own power and you should too.” Producing own power as a pre-requisite for running on-prem is a non-starter for many.

➕ show 1 reply

pja • today at 8:07 AM

I’m impressed that San Diego electrical power manages to be even more expensive than in the UK. That takes some doing.

epistasis • today at 5:40 PM

Ah Slurm, so good to see it still being used. As soon as I touched it in ~2010 I realized this was finally the solid queue management system we needed. Things like Sun Grid Engine or PBS were always such awful and burdensome PoS.

IIRC, Slurm came out of LLNL, and it finally made both usage and management of a cluster of nodes really easy and fun.

Compare Slurm to something like AWS Batch or Google Batch and just laugh at what the cloud has created...

yomismoaqui • today at 10:17 AM

This quote is gold:

The cloud requires expertise in company-specific APIs and billing systems. A data center requires knowledge of Watts, bits, and FLOPs. I know which one I rather think about.

➕ show 1 reply

pu_pe • today at 8:24 AM

> Self-reliance is great, but there are other benefits to running your own compute. It inspires good engineering.

It's easy to inspire people when you have great engineers in the first place. That's a given at a place like comma.ai, but there are many companies out there where administering a datacenter is far beyond their core competencies.

I feel like skilled engineers have a hard time understanding the trade-offs from cloud companies. The same way that comma.ai employees likely don't have an in-house canteen, it can make sense to focus on what you are good at and outsource the rest.

➕ show 1 reply

ghc • today at 2:09 PM

If it were me, instead of writing all these bespoke services to replicate cloud functionality, I'd just buy oxide.computer systems.

MagicMoonlight • today at 8:15 PM

For ML it makes sense, because you’re using so much compute that renting it is just burning money.

For most businesses, it’s a false economy. Hardware is cheap, but having proper redundancy and multiple sites isn’t. Having a 24/7 team available to respond to issues isn’t.

What happens if their data centre loses power? What if it burns down?

ynac • today at 6:33 PM

Not nearly on the article's level, but I've been operating what I call a fog machine (itsy bitsy personal cloud) for about 15 years. It's just a bunch of local and off-site NAS boxes. It has kinda worked out great. Mostly Synology, but probably won't be when their scheduled retirement comes up. The networking is dead simple, the power use is distributed, and the size of it all is still a monster for me - back in the day, I had to use it for a very large audio project to keep backups of something like 750,000 albums and other audio recordings along with their metadata and assets.

apothegm • today at 1:05 PM

This also depends so much on your scaling needs. If you need 3 mid-sized ECS/EC2 instances, a load balancer, and a database with backups, renting those from AWS isn’t going to be significantly more expensive for a decent-sized company than hiring someone to manage a cluster for you and dealing with all the overhead of keeping it maintained and secure.

If you’re at the scale of hundreds of instances, that math changes significantly.

And a lot of it depends on what type of business you have and what percent of your budget hosting accounts for.

➕ show 1 reply

siliconc0w • today at 4:34 PM

You can also buy the hardware and hire an IT vendor to rack and help manage it as smart hands so you never need to visit the datacenter. With modern beefy hardware, even large web services only need a few racks so most orgs don't even to manage a large footprint.

Sure you have to schedule your own hardware repairs or updates but it also means you don't need to wrangle with the ridiculous cost-engineering, reserved instances, cloud product support issues or API deprecations, proprietary configuration languages, etc.

Bare metal is better for a lot of non-cost reasons too, as the article notes it's just easier/better to reason about the lower level primitives and you get more reliable and repeatable performance.

➕ show 1 reply

JKCalhoun • today at 2:28 PM

Naive comment from a hobbyist with nothing close to $5M: I'm curious about the degree to which you build a "home lab" equivalent. I mean if "scaling" turned out to be just adding another Raspberry Pi to the rack (where is Mr. Geerling when you need him?) I could grow my mini-cloud month by month as spending money allowed.

(And it would be fun too.)

➕ show 3 replies

Maro • today at 11:00 AM

Working at a non-tech regional bigco, where ofc cloud is the default, I see everyday how AWS costs get out of hand, it's a constant struggle just to keep costs flat. In our case, the reality is that NONE of our services require scalability, and the main upside of high uptime is nice primarily for my blood pressure.. we only really need uptime during business hours, nobody cares what happens at night when everybody is sleeping.

On the other hand, there's significant vendor lockin, complexity, etc. And I'm not really sure we actually end up with less people over time, headcount always expands over time, and there's always cool new projects like monitoring, observability, AI, etc.

My feeling is, if we rented 20-30 chunky machines and ran Linux on them, with k8s, we'd be 80% there. For specific things I'd still use AWS, like infinite S3 storage, or RDS instances for super-important data.

If I were to do a startup, I would almost certainly not base it off AWS (or other cloud), I'd do what I write above: run chunky servers on OVH (initially just 1-2), and use specific AWS services like S3 and RDS.

A bit unrelated to the above, but I'd also try to keep away from expensive SaaS like Jira, Slack, etc. I'd use the best self-hosted open source version, and be done with it. I'd try Gitea for git hosting, Mattermost for team chat, etc.

And actually, given the geo-political situation as an EU citizen, maybe I wouldn't even put my data on AWS at all and self-host that as well...

komali2 • today at 3:12 PM

> The cloud requires expertise in company-specific APIs and billing systems.

This is one reason I hate dealing with AWS. It feels like a waste of time in some ways. Like learning a fly-by-night javascript library - maybe I'm better off spending that time writing the functionality on my own, to increase my knowledge and familiarity?

0xbadcafebee • today at 6:32 PM

  If your business relies on compute, and you run that compute in the cloud, you are putting a lot of trust in your cloud provider. Cloud companies generally make onboarding very easy, and offboarding very difficult. If you are not vigilant you will sleepwalk into a situation of high cloud costs and no way out. If you want to control your own destiny, you must run your own compute.

This is not a valid reason for running your own datacenter, or running your own server.

  Self-reliance is great, but there are other benefits to running your own compute. It inspires good engineering. Maintaining a data center is much more about solving real-world challenges. The cloud requires expertise in company-specific APIs and billing systems. A data center requires knowledge of Watts, bits, and FLOPs. I know which one I rather think about.

This is not a valid reason for running your own datacenter, or running your own server.

  Avoiding the cloud for ML also creates better incentives for engineers. Engineers generally want to improve things. In ML many problems go away by just using more compute. In the cloud that means improvements are just a budget increase away. This locks you into inefficient and expensive solutions. Instead, when all you have available is your current compute, the quickest improvements are usually speeding up your code, or fixing fundamental issues.

This is not a valid reason for owning a datacenter, or running your own server.

  Finally there’s cost, owning a data center can be far cheaper than renting in the cloud. Especially if your compute or storage needs are fairly consistent, which tends to be true if you are in the business of training or running models. In comma’s case I estimate we’ve spent ~5M on our data center, and we would have spent 25M+ had we done the same things in the cloud.

This is one of only two valid reasons for owning a datacenter, and one of several valid reasons for running your own server.

The only two valid reasons to build/operate a datacenter: 1) what you're doing is so costly that building your own factory is the only profitable way for your business to produce its widgets, 2) you can't find a datacenter with the location or capacity you need and there is no other way to serve your business needs.

There's many valid reasons to run your own servers (colo), although most people will not run into them in a business setting.

nubela • today at 12:37 PM

Same thing. I was previously spending 5-8K on DigitalOcean, supposedly a "budget" cloud. Then the company was sold, and I started a new company on entirely self-hosted hardware. Cloudflare tunnel + CC + microk8s made it trivial! And I spend close to nothing other than internet that I already am spending on. I do have solar power too.

juvoly • today at 9:12 AM

> Cloud companies generally make onboarding very easy, and offboarding very difficult. If you are not vigilant you will sleepwalk into a situation of high cloud costs and no way out. If you want to control your own destiny, you must run your own compute.

Cost and lock-in are obvious factors, but "sovereignty" has also become a key factor in the sales cycle, at least in Europe.

Handing health data, Juvoly is happy to run AI work loads on premise.

ex-aws-dude • today at 5:26 PM

I can see how this would work fine if the primary purpose is for training rather than serving large volumes of customer traffic in multiple regions

It would probably even make sense for some companies to still use cloud for their API but do the training on prem as that may be the expensive part.

bob1029 • today at 11:09 AM

The #1 reason I would advocate for using AWS today is the compliance package they bring to the party. No other cloud provider has anything remotely like Artifact. I can pull Amazon's PCI-DSS compliance documentation using an API call. If you have a heavily regulated business (or work with customers who do), AWS is hard to beat.

If you don't have any kind of serious compliance requirement, using Amazon is probably not ideal. I would say that Azure AD is ok too if you have to do Microsoft stuff, but I'd never host an actual VM on that cloud.

Compliance and "Microsoft stuff" covers a lot of real world businesses. Going on prem should only be done if it's actually going to make your life easier. If you have to replicate all of Azure AD or Route53, it might be better to just use the cloud offerings.

➕ show 1 reply

cgsmith • today at 7:02 AM

I used to colocate a 2U server that I purchased with a local data center. It was a great learning experience for me. Im curious why a company wouldn't colocate their own hardware? Proximity isnt an issue when you can have the datacenter perform physical tasks. Bravo to the comma team regardless. It'll be a great learning experience and make each person on their team better.

Ps... bx cable instead of conduit for electrical looks cringe.

➕ show 1 reply

kavalg • today at 7:55 AM

This was one of the coolest job ads that I've ever read :). Congrats for what you have done with your infrastructure, team and product!

➕ show 1 reply

eubluue • today at 6:00 PM

On top of that, now when the US cloud act is again a weapon against EU, most European companies know better and are migrating in droves to colo, on-prem and EU clouds. Bye bye US hyperscalers!

Dormeno • today at 9:24 AM

The company I work for used to have a hybrid where 95% was on-prem, but became closer to 90% in the cloud when it became more expensive to do on-prem because of VMware licensing. There are alternatives to VMware, but not officially supported with our hardware configuration, so the switch requires changing all the hardware, which still drives it higher than the cloud. Almost everything we have is cloud agnostic, and for anything that requires resilience, it sits in two different providers.

Now the company is looking at doing further cost savings as the buildings rented for running on-prem are sitting mostly unused, but also the prices of buildings have gone up in recent years, notably too, so we're likely to be saving money moving into the cloud. This is likely to make the cloud transition permanent.

danpalmer • today at 7:28 AM

> Cloud companies generally make onboarding very easy, and offboarding very difficult.

I reckon most on-prem deployments have significantly worse offboarding than the cloud providers. As a cloud provider you can win business by having something for offboarding, but internally you'd never get buy-in to spend on a backup plan if you decide to move to the cloud.

➕ show 1 reply

comrade1234 • today at 7:05 AM

15-years ago or so a spreadsheet was floating around where you could enter server costs, compute power, etc and it would tell you when you would break-even by buying instead of going with AWS. I think it was leaked from Amazon because it was always three-years to break-even even as hardware changed over time.

➕ show 4 replies

b8 • today at 6:08 PM

SSD's don't last longer than HDDs. Also they're much more expensive due to AI now. They should move to cutdown on power costs.

durakot • today at 7:54 AM

There's the HN I know and love

imcritic • today at 12:57 PM

I love articles like this and companies with this kind of openness. Mad respect to them for this article and for sharing software solutions!

evertheylen • today at 9:25 AM

> Maintaining a data center is much more about solving real-world challenges. The cloud requires expertise in company-specific APIs and billing systems. A data center requires knowledge of Watts, bits, and FLOPs. I know which one I rather think about.

I find this to be applicable on a smaller scale too! I'd rather setup and debug a beefy Linux VPS via SSH than fiddle with various propietary cloud APIs/interfaces. Doesn't go as low-level as Watts, bits and FLOPs but I still consider knowledge about Linux more valuable than knowing which Azure knobs to turn.

wessorh • today at 8:40 PM

what is the underling filesystem for your kv store, it doesn't appear to use raw devices.

alt Hacker News

Don't rent the cloud, own instead

Comments

🔗 View 48 more comments