FTP Service Degraded Performance
Incident Report for Productsup
Postmortem

Executive Summary:

In this postmortem report, we will analyze the recent incident related to our FTP service. The incident was triggered by a migration of the FTP service storage to a new location, resulting in file upload errors and customer dissatisfaction. Our primary goal is to identify the root causes of the incident, outline the actions taken to mitigate the issue, and propose preventive measures to avoid similar problems in the future.

Incident Timeline:

  • October 25, 2023: Migration of FTP storage to a larger but slower I/O location commenced.
  • October 28, 2023: Increased I/O activity on the new storage led to timeouts and file upload failures.
  • October 30, 2023: Further analysis revealed insufficient retry logic for FTP uploads.
  • October 31, 2023: Initiation of the migration from slow storage to a faster alternative.
  • October 30, 2023: Communication with affected customers to keep them informed.

Root Causes:

  1. Storage Migration Decision: The incident was primarily triggered by the decision to migrate the FTP service storage to a location with more capacity but slower I/O. This decision was based on the assumption that the FTP service was used infrequently by most customers, and the migration was assessed as a low impact operation that would not cause any downtime.
  2. Inadequate Retry Logic: The FTP upload process lacked a proper retry logic for handling timeouts. This issue was unforeseen and, had it been anticipated, could have been avoided with exponential backoff and a higher number of retries.
  3. Insufficient Monitoring: Our monitoring systems provided monitoring for uptime of the FTP service, but they did not provide any insights on increased upload error ratios.

Immediate Actions Taken:

  1. Migration Reversal: To mitigate the immediate impact, the migration process was reversed, and data was migrated back to faster storage.
  2. Retry Logic Enhancement: The retry logic for FTP uploads was improved by implementing exponential backoff and increasing the number of retries.
  3. Customer Communication: Affected customers were informed about the incident, the steps taken to address it, and the ongoing migration process.

Preventive Measures:

  1. Improved Monitoring: Enhance monitoring systems to detect performance issues more quickly, ensuring timely responses to unexpected incidents.
  2. Proactive Communication: Establish proactive communication protocols to inform affected customers promptly when service disruptions occur.
  3. Retry Logic Enhancement: Continuously review and improve retry logic for all critical services to account for unforeseen issues and reduce the impact of timeouts.
  4. Risk Assessment: Conduct thorough risk assessments before making significant changes to services or infrastructure to anticipate potential problems.
  5. Customer Feedback Integration: Encourage customers to provide feedback on service performance, and actively integrate their insights into our ongoing improvement efforts.

Conclusion:

This incident has highlighted the importance of anticipating and preparing for unforeseen issues in our systems. We apologize for the inconvenience and frustration this incident may have caused you and other affected customers. We are committed to applying the lessons learned from this incident to ensure that our services continue to meet your expectations.

Your trust in our services is invaluable to us, and we appreciate your patience and understanding during this challenging period. If you have any further concerns or questions, please do not hesitate to reach out to us.

We are committed to maintaining the highest standards of reliability and providing you with a smoother and more consistent experience in the future.

Posted Nov 03, 2023 - 22:48 CET

Resolved
Dear customers,

We are pleased to inform you that the recent incident related to our FTP service has been successfully resolved. Our team has worked diligently to address the issues, and we wanted to provide you with this brief update. The migration of our FTP service storage to a faster and more stable location has been completed, and the improvements to the retry logic for FTP uploads have been implemented. We have thoroughly tested these changes to ensure their effectiveness.

We sincerely apologize for any inconvenience this incident may have caused you, and we greatly appreciate your patience and understanding during this time. If you encounter any further issues or have any questions, please do not hesitate to reach out to us.
Posted Nov 03, 2023 - 22:42 CET
Update
Dear customers, for further clarification, although the service has been very stable for the past 12 hours, occasional failures may still happen while the FTP data is being migrated to a new cluster. Please bear with us in the meantime as the process is completing in the background. We are still actively monitoring the situation. Thanks for your understanding and we hope to update you shortly with better news.
Posted Nov 01, 2023 - 12:57 CET
Monitoring
The FTP service is fully operational again and did not encounter any timeouts in the past 2 hours. We will be monitoring the situation for the next 24 hours.
Posted Oct 31, 2023 - 17:10 CET
Identified
We have identified an issue with the storage cluster that backs up the Productsup FTP Service. To resolve this issue we need to migrate the data to another cluster. Unfortunately, this will cause reduced performance on the service while the migration is happening. We ask us kindly to bear with us while our engineers are working hard on solving the issue.
Posted Oct 31, 2023 - 10:45 CET
This incident affected: FTP Service.