On the morning of Friday 06-12-2019 we received several complaints from customers; importing data from the Platform API Datasource resulted in the message "0 products imported".
In the last couple of months, we have occasionally experienced extreme delays between the upload of data and this data being made ready for import. This behaviour was noted and made aware to us by multiple clients, as well as our internal monitoring identifying this too.
In the most extreme cases, this delay was several hours. The delay was related to fundamental issues with our infrastructure, more specifically, how we share the data between our cluster of API servers.
We looked into several solutions to solve this in order to provide a more stable solution for our clients using the Platform API.
As a result, we wanted to implement specifically three major changes. Our first change was related to the amount of IO operations we were doing on our shared filesystem, and this was deployed on Monday 02-12-2019. The second change introduced some changes in our directory structure, which increased lookup times on the shared file system. This was deployed on Thursday 05-12-2019 and was also the root cause for the problems we experienced on Friday 06-12-2019.
The third change will occur in the coming weeks.
The deployment for the second change was finished around 17:00 CET on 05-12-2019 and we did a verification on our import server to ensure our logic was working as expected and the data to be imported was received in the correct order. This caused the original issue, that importing data from the Platform API failed with the notification; "0 Products imported". After some investigation, we realized that during the deployment we had an unexpected error and did not verify that the data was readable by our import scripts. It turned out that the API data was available on the import server, but the import script could not find the data.
Once we realized this, we stopped copying data from our API servers to our import server, to prevent the data from ending up in the wrong location. We immediately tackled this issue, writing scripts to fix the aforementioned problem. As it is our desire to provide as much safety as possible when moving our clients' data around, we decided to move the data in multiple steps. We could have done it in a single step which would have been faster, but if this failed, it would have been even more challenging to restore the original client data. We preferred therefore a safe and solid solution, over a quick one.
The process to restore the data to the correct location therefore took a longer time. At around 17:00 on Friday 06-12-2019 (the day the issue was reported) we resumed normal operations again, by enabling the copy from our API server to our import server and resuming the import for most of our clients. A few clients were excluded from these normal operations, as the level of data consumption through our API was so large, that the processes took much longer and we had to individually repair their data.
At 20:00 CET we updated our status on the status page. Initially, we needed to keep an eye on the whole situation to ensure that the amount of data that was going through would not lead to problems when importing.
After this and up until today, we kept an eye on the situation and all data has been copied correctly and successfully. In addition to that, the second fix for our delayed import issues was a success and so far we haven't had any abnormal delays. In order to provide more confidence, we'll monitor the queues which copy data between our API servers and import servers for a week. Next week we'll introduce our third and final fix.
What did we do to prevent such an issue happening again?
We've changed our QA processes; we've added additional steps after QA'ing the changed API functionality to also check the whole process from import to upload. This should guarantee that we even catch the errors which are not directly visible in just the API itself.