Imported product count dropped significantly for some sites.
Incident Report for Productsup
Postmortem

Issue Summary

A final deployment to our Platform API as preparation for Black Friday was required. The changes did introduce more stability and performance improvements to our infrastructure.

The deployment was targeted towards a specific group of sites, which we are monitoring already for a while, due to causing more load then regular sites.

The deployment happened at 12:40 CET. Around 13:00 CET we noticed unusual errors coming in. Quick investigation revealed a problem in our deployment. By human error accidentally other sites were also upgraded. These sites were reverted by 13:15 CET.

Later during the day we noticed irregularities with the mistakenly upgraded sites. We started investigating and realized the severity of this problem. All sites which were mistakenly upgraded and ran a processing during this time frame were affected by a decrease in imported products. We created a status page update and immediately started working on a solution to prevent further harm. We started to repair the affected sites data to their original state. By 18:00 CET affected sites were blocked from running and by 22:20 CET all data was restored and processes were triggered to export the data.

Note: no data sent to the API was eventually lost, after it being restored at 22:20 CET.

Corrective and Preventative Measures

We learned from the upgrade that our automated tests on the deployment need to be expanded to involve more test cases which focus more on the process of reverting upgrades. In addition test cases should be extended with deeper integration in the whole Platform, so indirect effects can be understood more quickly.

We will implement improved automated testings for before and after deployments. We also already improved our monitoring processes to catch potential issues as soon as possible.

Productsup is committed to continually and quickly improve our technology and operational processes to prevent future mishaps. Unfortunately, we were not able to prevent yesterday's problems. For this, we sincerely apologize for the inconvenience this has caused you, your team, and your organization. We thank you for your continued support.

Posted Nov 19, 2020 - 17:06 CET

Resolved
This incident has been resolved.
Posted Nov 19, 2020 - 17:05 CET
Update
After monitoring the imports form the Platform API for almost full working day, we cannot see any problems related to this issue.
Posted Nov 19, 2020 - 17:05 CET
Monitoring
The data for all affected sites has been restored. We triggered an import and export for each site, so that the latest data available will be exported to all channels.

The processing "pause" we introduced is now completely lifted, this means that the affected sites are fully operational again.
Posted Nov 18, 2020 - 22:20 CET
Update
The data of about 65% of all the sites is now restored. Once a site is restored we trigger a full run, so that all data gets imported and exported correctly.

We're continuing on the last 35%
Posted Nov 18, 2020 - 21:12 CET
Update
We started testing our automated solution to fix the data. Manually fixing the data goes slowly and has only resolved 5% of the affected sites so far.

All sites that will be fixed by the automated solution, already have been backed up, as an extra precaution.
Posted Nov 18, 2020 - 19:45 CET
Update
While we are fixing the data, we have paused the affected sites. After the repair we "unpause" the site again and it will process the most recent data.
Posted Nov 18, 2020 - 18:16 CET
Update
We have found a complete list of affected sites. We're working manually to fix the data, while in the meantime someone else is looking into an automated solution for the remaining sites.
Posted Nov 18, 2020 - 17:58 CET
Identified
We have received reports from clients that they experience major drops in imported product counts.
The sites affected ran an import between 12:00 and 13:00 CET via the Platform API today. Since then changes in imported product counts are noticed.
We've identified the problem and we're working on a fix. No data loss is expected, but in the meantime data might have been exported less products and caused problems down the line.
Posted Nov 18, 2020 - 17:12 CET
This incident affected: Platform API.