This incident showed it’s first signs January 14th, 6pm CET but didn’t have a huge impact on our systems. While we were investigating the issue we saw it suddenly appear and disappear over the next few days and over the weekend. And the time range where the issue appeared was not enough to finally find the cause immediately.
We saw our primary scheduling under unknown load periodically every few ours for only a few minutes.
The issue got more intense on Sunday, January 17th and Monday, January 18th. Additional to the system load issue, we received complains on Monday about scheduling issues, where individual sites would start at seemingly random times instead of at the scheduled times. This affected some, but not all sites and was also not easy to isolate. While we had the verdict that the increased load would cause the scheduler to malfunction, the schedules were also off at times without the system load peaks.
On Monday, January 17th we were finally able to resolve the issue with increased system loads on our scheduling infrastructure and that led to correct site scheduling as well.
The root cause was a sudden traffic increase for some minutes which overwhelmed the scheduling infrastructure and led to unresponsive systems. We were able to mitigate the traffic peaks to handle the traffic appropriately and now have also additional monitoring in place for this case.
We recognize the inconvenience this may have caused and appreciate your patience as we were trying to resolve the issue.