Summary On 17.01.2024, one of the processing storage cluster servers rebooted unexpectedly due to a configuration issue with the object storage gateway. This incident was compounded by previous changes made to accommodate a new replication zone in Frankfurt, leading to operational complications.
Incident Details Background Several months ago, modifications were made to include a new replication zone in Frankfurt. However, it appears the toolset used for this configuration did not execute correctly, resulting in the retention of a "default" zone and a "default" zone group alongside the active "fsn" and "fsn-1" zones and zone groups in Falkenstein.
Incident Timeline Reboot: The server rebooted unexpectedly. Configuration Issue: Upon investigation, it was found that the object storage gateway connected to the default zone instead of the intended configuration. Service Behavior: Despite the service starting, it encountered issues as there was no data in the default zone, leading to complaints about missing data for incoming requests. Monitoring Oversight: The monitoring system only tracked whether the service was "up," which did not alert us to the underlying issue until a user report was received. Resolution The issue was identified after user reports highlighted the data unavailability. Immediate actions were taken to correct the object storage gateway configuration, ensuring it points to the correct zone.
Lessons Learned Monitoring Improvements: To prevent similar incidents in the future, we will enhance our monitoring to include error rates on this cluster. This will allow us to detect unusual patterns that may indicate configuration mistakes or other issues. Configuration Checks: A review of the configuration process for replication zones will be conducted to ensure that toolsets function as expected and do not leave behind unintended configurations.