logoalt Hacker News

shubhamintechyesterday at 10:02 PM1 replyview on HN

The full-session evaluation framing is the right call - most teams don't realize the failure happened in turn 2 until they've spent 3 hours blaming the model. One thing worth thinking about as you grow: connecting caught regressions to production conversation data. When your simulation flags a new failure mode, being able to say "this pattern has already surfaced X times in prod this week" cuts the prioritization debate in half. Does Cekura currently let you correlate simulation failures back to real user sessions, or is that still a manual step?


Replies

atarusyesterday at 10:35 PM

We track the failure modes in production directly instead of relying on simulation. So if suddenly we are seeing a failure mode pop up too often, we can alert timely. In the approach of going from simulation to monitoring, I am worried the feedback might be delayed.

Doing it in production also helps to go run simulations by replaying those production conversations ensuring you are handling regression.