Issue Summary
A final deployment to our Platform API as preparation for Black Friday was required. The changes did introduce more stability and performance improvements to our infrastructure.
The deployment was targeted towards a specific group of sites, which we are monitoring already for a while, due to causing more load then regular sites.
The deployment happened at 12:40 CET. Around 13:00 CET we noticed unusual errors coming in. Quick investigation revealed a problem in our deployment. By human error accidentally other sites were also upgraded. These sites were reverted by 13:15 CET.
Later during the day we noticed irregularities with the mistakenly upgraded sites. We started investigating and realized the severity of this problem. All sites which were mistakenly upgraded and ran a processing during this time frame were affected by a decrease in imported products. We created a status page update and immediately started working on a solution to prevent further harm. We started to repair the affected sites data to their original state. By 18:00 CET affected sites were blocked from running and by 22:20 CET all data was restored and processes were triggered to export the data.
Note: no data sent to the API was eventually lost, after it being restored at 22:20 CET.
Corrective and Preventative Measures
We learned from the upgrade that our automated tests on the deployment need to be expanded to involve more test cases which focus more on the process of reverting upgrades. In addition test cases should be extended with deeper integration in the whole Platform, so indirect effects can be understood more quickly.
We will implement improved automated testings for before and after deployments. We also already improved our monitoring processes to catch potential issues as soon as possible.
Productsup is committed to continually and quickly improve our technology and operational processes to prevent future mishaps. Unfortunately, we were not able to prevent yesterday's problems. For this, we sincerely apologize for the inconvenience this has caused you, your team, and your organization. We thank you for your continued support.