Insufficient mock data in the staging environment? Like no BYOIP prefixes at all? Since even one prefix should have shown that it would be deleted by that subtask...
From all the recent outages, it sounds like Cloudflare is barely tested at all. Maybe they have lots of unit tests etc, but they do not seem to test their whole system... I get that their whole setup is vast, but even testing that subtask manually would have surfaced the bug
Reliability was/is CF's label.
It's alarming already. Too many outages in the past months. CF should fix it, or it becomes unacceptable and people will leave the platform.
I really hope they will figure things out.
Not sure why everyone is complaining, new MCP features are more important than uptime
I do not work in the space at all, but it seems like Cloudflare has been having more network disruptions lately than they used to. To anyone who deals with this sort of thing, is that just recency bias?
is this blog post LLM generated?
the explanation makes no sense:
> Because the client is passing pending_delete with no value, the result of Query().Get(“pending_delete”) here will be an empty string (“”), so the API server interprets this as a request for all BYOIP prefixes instead of just those prefixes that were supposed to be removed. The system interpreted this as all returned prefixes being queued for deletion.
client:
resp, err := d.doRequest(ctx, http.MethodGet, `/v1/prefixes?pending_delete`, nil)
server: if v := req.URL.Query().Get("pending_delete"); v != "" {
// ignore other behavior and fetch pending objects from the ip_prefixes_deleted table
prefixes, err := c.RO().IPPrefixes().FetchPrefixesPendingDeletion(ctx)
if err != nil {
api.RenderError(ctx, w, ErrInternalError)
return
}
api.Render(ctx, w, http.StatusOK, renderIPPrefixAPIResponse(prefixes, nil))
return
}
even if the client had passed a value it would have still done exactly the same thing, as the value of "v" (or anything from the request) is not used in that blockThis blog post is inaccurate, the prefixes were being revoked over and over - to keep your prefixes advertised you had to have a script that would readd them or else it would be withdrawn again. The way they seemed to word it is really dishonest.
While neither am I nor the company I work for directly impacted by this outage, I wonder how long can Cloudflare take these hits and keep apologizing for it. Truly appreciate them being transparent about it, but businesses care more about SLAs and uptime than the incident report.
The irony is that the outage was caused by a change from the "Code Orange: Fail Small initiative".
They definitely failed big this time.
Hindsight is 20/20 but why not dry run this change in production and monitor the logs/metrics before enabling it? Seems prudent for any new “delete something in prod” change.
The one redeeming feature of this failure is staged rollouts. As someone advertising routes through CF, we were quite happy to be spared from the initial 25%.
They should have rewritten this code in Rust using these brilliant language models. /jk
I'm honestly amazed that a company CF's size doesn't have a neat little cluster of Mac Minis running OpenClaw and quietly taking care of this for them.
The eternal tech outage aphorism: It's always DNS, except for when it's BGP.
> Because the client is passing pending_delete with no value, the result of Query().Get(“pending_delete”) here will be an empty string (“”), so the API server interprets this as a request for all BYOIP prefixes instead of just those prefixes that were supposed to be removed.
Lmao, iirc long time ago Google's internal system had the same exact bug (treating empty as "all" in the delete call) that took down all their edges. Surprisingly there was little impact as traffic just routed through the next set of proxies.
One has to wonder when the board realises Dane was a bad replacement for JGC. These outages are getting ridiculous
Is this trend of oversharing code snippets and TMI postmortems done purposely to distract their customers from raging over the outage and the next impending fuckup?
This transparent report can earn my trust
If you track large SaaS and Cloud uptime, it seem to correlate pretty highly with compensation for big companies. Is cloudflare getting top talent?
Sure vibe-coded slop that has not been properly peer reviewed or tested prior to deployment is leading to major outages, but the point is they are producing lots of code. More code is good, that means you are a good programmer. Reading code would just slow things down.
again?
It's something we debated in our team: if there's an API that returns data based on filters, what's the better behavior if no filters are provided - return everything or return nothing?
The consensus was that returning everything is rarely what's desired, for two reasons: first, if the system grows, allowing API users to return everything at once can be a problem both for our server (lots of data in RAM when fetching from the DB => OOM, and additional stress on the DB) and for the user (the same problem on their side). Second, it's easy to forget to specify filters, especially in cases like "let's delete something based on some filters."
So the standard practice now is to return nothing if no filters are provided, and we pay attention to it during code reviews. If the user does really want all the data, you can add pagination to your API. With pagination, it's very unlikely for the user to accidentally fetch everything because they must explicitly work with pagination tokens, etc.
Another option, if you don't want pagination, is to have a separate method named accordingly, like ListAllObjects, without any filters.