Show HN: Superlog (YC P26) – Observability that installs itself and fixes bugs

60 points • by Magnanten • yesterday at 3:54 PM • 43 comments • view on HN

Hey HN, we’re Nico and Arseniy, co-founders of Superlog (https://superlog.sh). We're building a self-installing, self healing observability tool meant not to be opened. It has a wizard that daily sets up proper logging and an agent that investigates errors and opens PRs.

Super short demo: https://www.youtube.com/watch?v=xFhU9Mk247M.

In our earlier startups, we tried Sentry, Datadog, Grafana, Dash0, and nothing was good enough. Proper telemetry and alerting still requires a ton of manual setup. We struggled with adding good logs, so debugging was tough, especially as codebases grow at a faster pace. Meanwhile, the Datadog/Dash0 bill kept climbing, and we still spent engineering hours to learn, configure, and maintain our observability tooling.

With Sentry, we found ourselves flooded by a stream of alerts into our Slack channel, most were duplicates or lacked context, so alert fatigue/constant interrupts were a real pain. The #ops notification is consistently the worst feeling on a Saturday morning

We’ve seen too many times servers run out of memory and disk, and three AWS metrics giving us three different values. Half of the graphs on dashboards are normally empty or outdated, and manually clicking through UIs, especially when the team is small, seems like a huge waste of time.

At some point we realized that solving this problem would be more valuable than the things we had been working on, and we had the expertise to do it, since Arseniy had spent years at Datadog, getting paged during the night to debug production incidents. So we decided to build a platform that would just work: agent-first, MCP-native, zero-setup.

Here’s how Superlog works: we have a wizard that scans your repo, and automatically instruments it with well-structured logs, traces and metrics via OpenTelemetry. We make sure to highlight main failure modes, endpoint performance, usage per tenant, and LLM/upstream cost (by callsite, tenant and model).

Errors get fingerprinted and grouped into incidents, so you see one issue, not a thousand duplicates. When you get a notification from Superlog, you see a clear failure summary, its inferred severity and impact upfront.

Then the agent investigates and tries to solve the issue. If it has enough context, it produces a concise and tested PR. If it doesn't, it posts its findings for the investigating team, and automatically pulls in the engineers that could contribute more context based on documentation, previous investigations and Slack threads.

Either way the output is one clean PR per incident, posted in Slack, that you can merge, ignore, or open as a Claude Code session and modify.

Three things we think are different from other observability vendors:

(1) We solve the setup pain. The wizard will instrument everything with native OTel SDKs, respecting the semantic conventions, with proper service and environment tagging. We’re also working on native automatic dashboards and alerts, so that you can see what’s going on in a glance and don’t miss subtle failure modes.

(2) Our telemetry doesn’t decay. The wizard runs daily, and keeps adding logs, alerts and dashboards where it’s needed. You don't have to remember to instrument new features. The next time something breaks, the data you need to debug it is already there.

(3) Our goal is to solve alert fatigue. We use agents to merge similar errors and refine the summaries, giving you relevant information upfront. We have a custom evaluation setup that makes sure that our summaries are dense and correct, and severity and impact is on point. We also give you confidence scores for every LLM-enhanced metric so that wrong guesses don’t get boosted.

Important: superlog telemetry is vendor-neutral, so you keep all the logs/metrics/traces we install. Pricing is on the site. We're early, so expect rough edges and please tell us when you find them.

You can try it at https://superlog.sh. We'd love to hear what you're using today, what's broken about it, and whether the "one mergeable PR per incident" model sounds useful or terrifying. Especially keen to hear from folks running integration-heavy products, anyone who's rolled their own observability, and anyone who has tried Sentry / Datadog MCPs and given up. Comments and feedback welcome!

Comments

htrp • yesterday at 11:54 PM

Not their fault

Railway their hosting provider is entirely down as well

From https://status.railway.com/

>Identified

>Google Cloud has blocked our account, making some Railway services unavailable. We have escalated this directly with Google. The Railway Platform team has since confirmed access to Google Cloud and is working on restoring access to all workloads. We have access to some of our Google Cloud–hosted infrastructure and are working to restore the rest of the service. We apologize for the disruption.

➕ show 1 reply

jonnyasmar • yesterday at 11:27 PM

Building on the "investigation > patch" point — running Claude Code, Codex, and Gemini CLI daily, the pattern I keep noticing is that auto-fix is fine on "obvious bug, obvious fix" (off-by-one, null check, missing await, error not propagated). It falls over on "subtle invariant" bugs where the existing code is intentionally weird to preserve something non-obvious — the PR looks right and breaks something three modules away.

The tool I'd actually want isn't "tries harder to fix everything." It's one that credibly says "this touches an invariant I can't see — here's what I think might happen, you handle it." Calibrated humility beats confident patches.

Curious how your high-confidence threshold actually works. Self-reported model certainty (notoriously unreliable), test coverage in the affected area, blast-radius of the change, something else?

➕ show 1 reply

OsrsNeedsf2P • yesterday at 4:30 PM

There's very few startups that I look at these days and don't think to myself, "I could just write a Claude skill for that". This one seems pretty cool. Congrats on launch

➕ show 1 reply

behat • yesterday at 8:27 PM

>> Superlog scans your codebase and infrastructure to add new alerts, metrics and dashboards, preventing tricky failure modes and observability decay.

This is interesting, and my prior belief here has been that this automates a one time set up, and perhaps a quarterly clean-up or reactive monitoring changes that people do today. Curious what your experience has been - do teams accept these ongoing maintenance PRs at a good rate?

For full disclosure / context: we work in a related space - investigation agents for production issues.

ottoid • yesterday at 11:24 PM

I would love to use it but the website is down

"Please check your network settings to confirm that your domain has provisioned.

If you are a visitor, please let the owner know you're stuck at the station."

Would love to learn more and consider being a customer!

➕ show 1 reply

e12e • yesterday at 4:55 PM

Interesting project - but you need to add some information on where the data goes. As far as I can tell, code goes to some upstream ai provider (for installing, for analyzing).

Telemetry goes to some provider or local hosted solution? And then to your upstream ai provider for analysis?

➕ show 1 reply

tommy29tmar • yesterday at 10:01 PM

Before running the install prompt, I’d want to see a dry run: which files it would touch, what telemetry leaves the box, provider calls, and what “high confidence” means. For debugging tools, generating a PR is the easy part; knowing whether it’s grounded in enough evidence is the part I’d worry about.

tuo-lei • yesterday at 6:46 PM

investigation is the hard part, not generating patches. we've had prod issues where the fix was obvious once you knew the cause, but finding the cause meant connecting an error trace to a config change from 3 deploys ago. if the MCP only surfaces traces and logs from one service the agent is going to propose workarounds instead of actual fixes. how deep does the investigation context actually go?

➕ show 1 reply

exabrial • yesterday at 8:27 PM

It deleted the codebase, which technically.. is a valid way to get rid of all of the bugs.

I kid, nice work. As others have said, investigation, and understanding "the why it was originally done that way", not the patch, is usually the lion share of the work.

0xferruccio • yesterday at 5:14 PM

Congrats on the launch, this looks very promising. I hadn't seen any installation that uses a URL to point to a skill, seems like an evolution of wizard scripts

That been said for more complex setups like on kubernetes where you need a collector and an operator I found OTEL to be super painful to setup a couple of years ago. Has it gotten any easier now?

➕ show 1 reply

sskates • yesterday at 5:47 PM

I love the launch! Automated observability that feeds back into the product development process is the future of this category vs having to spend a lot of time configuring and managing the infrastructure yourself.

It's something we've thought a lot about at Amplitude. We'd love to talk.

➕ show 1 reply

solfox • yesterday at 4:34 PM

Love the concept! Some feedback: I went to sign up to give it a go, but the set up process left me feeling a bit untrusting - so I backed out for now. I'd prefer more explanation about what to expect, what I will get, how it is safe, etc before asking me to run a prompt.

➕ show 1 reply

user- • yesterday at 4:49 PM

I would love to try it but I got stuck when it asked for Slack since I dont use that.

➕ show 1 reply

rdataguy • yesterday at 9:34 PM

Seems very useful, congratulations on the launch!

evil-olive • yesterday at 5:29 PM

on your pricing page:

> Start with one repo. Price the rest when the signal is real.

which makes it sound like possibly the $150/mo price is per-repo?

I think that could use some clarification - if I have 10 services in a monorepo vs 10 individual service repos, does that 10x my cost?

➕ show 1 reply

3form • yesterday at 4:54 PM

Any plans for an on-prem version?

➕ show 1 reply

FantasyLabai • yesterday at 5:01 PM

This is a very interesting idea and im excited to see where this goes. Congrats!

➕ show 1 reply

tontinton • yesterday at 4:18 PM

What's your moat?

➕ show 1 reply

aloknnikhil • yesterday at 5:57 PM

The typical issues I have seen with LLMs / Agents tend to be reactive in their fixes. So they tend to "patch" the symptom more than "fix" the root cause. Interested to see how you solve this problem.

➕ show 1 reply

TZubiri • yesterday at 5:21 PM

Sorry to be crude, but this sounds either dead on arrival, or at least needing a pivot, or a rephrasing of the pitch:

The moment something changes the system, it no longer observes it, in fact observing something might cause it to change ( https://en.wikipedia.org/wiki/Observer_effect_(physics) )

Either it's a tool for observing or it's a tool for fixing issues, it cannot be both, by physical principle.

Best case scenario here is that the product succeeds, and then you need to instrument the product itself in order to observe it, like debugging the debugger. But it wouldn't be an observability tool, it would shift the product that needs to be observed from the previous source code that is now a target language into the new source code that is now your product.

➕ show 3 replies

KaiShips • yesterday at 7:02 PM

[dead]

nyxw43347 • yesterday at 11:28 PM

[dead]

Ember_Wipe • yesterday at 5:29 PM

[flagged]

➕ show 1 reply

alt Hacker News

Show HN: Superlog (YC P26) – Observability that installs itself and fixes bugs

Comments