Platform Performance Degraded

Incident Report for Productsup

Postmortem

This incident showed it’s first signs January 14th, 6pm CET but didn’t have a huge impact on our systems. While we were investigating the issue we saw it suddenly appear and disappear over the next few days and over the weekend. And the time range where the issue appeared was not enough to finally find the cause immediately.

We saw our primary scheduling under unknown load periodically every few ours for only a few minutes.

The issue got more intense on Sunday, January 17th and Monday, January 18th. Additional to the system load issue, we received complains on Monday about scheduling issues, where individual sites would start at seemingly random times instead of at the scheduled times. This affected some, but not all sites and was also not easy to isolate. While we had the verdict that the increased load would cause the scheduler to malfunction, the schedules were also off at times without the system load peaks.

On Monday, January 17th we were finally able to resolve the issue with increased system loads on our scheduling infrastructure and that led to correct site scheduling as well.

The root cause was a sudden traffic increase for some minutes which overwhelmed the scheduling infrastructure and led to unresponsive systems. We were able to mitigate the traffic peaks to handle the traffic appropriately and now have also additional monitoring in place for this case.

We recognize the inconvenience this may have caused and appreciate your patience as we were trying to resolve the issue.

Posted Jan 20, 2021 - 10:01 CET

Resolved

Response times are normalized and all systems are operational.

Posted Jan 18, 2021 - 20:59 CET

Monitoring

All systems respond properly again but we're monitoring the situation a bit longer.

Posted Jan 18, 2021 - 18:41 CET

Identified

We've identified the issue and we're working on a fix to return to optimal performance.

Posted Jan 18, 2021 - 18:37 CET

Investigating

We're investigating slow response times in out Platform.

Posted Jan 18, 2021 - 18:15 CET

This incident affected: Platform API, Productsup Platform, and Data Processing.