Resolved -
This incident has been resolved. Query processing has stabilized, all backlogs have cleared, and we are no longer seeing any errors or delays.
Feb 18, 10:06 PST
Monitoring -
Our mitigation efforts appear to be successful and we are seeing active recovery. We have identified the root cause as an infrastructure scaling event that introduced instability into our job processing pipeline. Queue depths are decreasing and we are continuing to monitor closely.
Feb 18, 09:42 PST
Identified -
Our team has identified a stuck transaction in our job queue, which was preventing work from being dispatched to our backend workers. We have cycled the affected component and are seeing recovery. Investigation continues and we will provide another update shortly.
Feb 18, 09:32 PST
Investigating -
We are seeing increased application errors and are actively investigating.
Feb 18, 08:20 PST
Monitoring -
Error rates have stayed down since our mitigation was implemented. We are continuing to monitor.
Feb 18, 07:21 PST
Update -
We identified an unhealthy resource in our infrastructure and have applied a mitigation. We are seeing error rates decrease but we are continuing to monitor and investigate.
Feb 18, 07:08 PST
Investigating -
Our monitoring picked up an increase in error rates and we have begun receiving some reports. We are actively investigating and working on mitigation
Feb 18, 07:05 PST