Platform API Import problems

Incident Report for Productsup

Postmortem

On the morning of Friday 06-12-2019 we received several complaints from customers; importing data from the Platform API Datasource resulted in the message "0 products imported".

In the last couple of months, we have occasionally experienced extreme delays between the upload of data and this data being made ready for import. This behaviour was noted and made aware to us by multiple clients, as well as our internal monitoring identifying this too.

In the most extreme cases, this delay was several hours. The delay was related to fundamental issues with our infrastructure, more specifically, how we share the data between our cluster of API servers.
We looked into several solutions to solve this in order to provide a more stable solution for our clients using the Platform API.

As a result, we wanted to implement specifically three major changes. Our first change was related to the amount of IO operations we were doing on our shared filesystem, and this was deployed on Monday 02-12-2019. The second change introduced some changes in our directory structure, which increased lookup times on the shared file system. This was deployed on Thursday 05-12-2019 and was also the root cause for the problems we experienced on Friday 06-12-2019.
The third change will occur in the coming weeks.

The deployment for the second change was finished around 17:00 CET on 05-12-2019 and we did a verification on our import server to ensure our logic was working as expected and the data to be imported was received in the correct order. This caused the original issue, that importing data from the Platform API failed with the notification; "0 Products imported". After some investigation, we realized that during the deployment we had an unexpected error and did not verify that the data was readable by our import scripts. It turned out that the API data was available on the import server, but the import script could not find the data.

Once we realized this, we stopped copying data from our API servers to our import server, to prevent the data from ending up in the wrong location. We immediately tackled this issue, writing scripts to fix the aforementioned problem. As it is our desire to provide as much safety as possible when moving our clients' data around, we decided to move the data in multiple steps. We could have done it in a single step which would have been faster, but if this failed, it would have been even more challenging to restore the original client data. We preferred therefore a safe and solid solution, over a quick one.

The process to restore the data to the correct location therefore took a longer time. At around 17:00 on Friday 06-12-2019 (the day the issue was reported) we resumed normal operations again, by enabling the copy from our API server to our import server and resuming the import for most of our clients. A few clients were excluded from these normal operations, as the level of data consumption through our API was so large, that the processes took much longer and we had to individually repair their data.

At 20:00 CET we updated our status on the status page. Initially, we needed to keep an eye on the whole situation to ensure that the amount of data that was going through would not lead to problems when importing.

After this and up until today, we kept an eye on the situation and all data has been copied correctly and successfully. In addition to that, the second fix for our delayed import issues was a success and so far we haven't had any abnormal delays. In order to provide more confidence, we'll monitor the queues which copy data between our API servers and import servers for a week. Next week we'll introduce our third and final fix.

What did we do to prevent such an issue happening again?
We've changed our QA processes; we've added additional steps after QA'ing the changed API functionality to also check the whole process from import to upload. This should guarantee that we even catch the errors which are not directly visible in just the API itself.

Posted Dec 10, 2019 - 15:13 CET

Resolved

After monitoring the incident for a full working day, all issues are resolved and all sites using the API should perform as regular.
Posted Dec 09, 2019 - 17:49 CET

Monitoring

The data has been restored to our import server and will be imported in the correct order.
The queue that built up during the day is also resolved and we're up to date on that as well.

Next run of your site could take longer, due to the built up of files. If any issues occur please let us know.

Post mortem will be followed up with soon!
Posted Dec 06, 2019 - 20:04 CET

Update

All delayed data is now being copied to our import server and next runs will slowly pick up the new data.

Once all data is fully copied, we'll write another update.
Posted Dec 06, 2019 - 17:58 CET

Identified

We've gotten several reports that API imports results in 0 products.
We've investigated the issue and found the problem with our logic. The logic is fixed, now we're looking to move the data to the correct location so it can be imported.

No data which was sent during this period was lost! We're restoring all data in the correct order, so that all delta's will be applied as expected.
Posted Dec 06, 2019 - 12:46 CET
This incident affected: Platform API.