Issue with delta processing

Incident Report for Productsup

Postmortem

Service Incident Notification - Ceph Storage Cluster (Processing)

Date: September 9, 2025
Incident Reference: CEPH-2025-009
Status: RESOLVED

What Happened

On September 9, 2025, our storage cluster experienced a service disruption while implementing a planned infrastructure upgrade to enable multi-datacenter operations. During the activation of "stretch mode" (a feature that allows data to be synchronized across multiple datacenters for improved disaster recovery), the cluster's internal authentication system encountered an unexpected failure.

Customer Impact

Duration: 12:30 → 15:00 (2hr30m)
Affected Services: Data Processing
Data Safety: All customer data remained secure and intact throughout the incident. No data was lost or corrupted.

During the incident period, customers may have experienced:

  • Temporary inability to access stored workspaces or export2datasource
  • Intermittent connectivity issues with applications using the storage service

Root Cause

The issue occurred when our storage system automatically reconfigured all data pools during the multi-datacenter setup process. This reconfiguration inadvertently affected critical system components responsible for user authentication, preventing normal access to the cluster even though the underlying data remained safe and accessible.

Resolution

Our engineering team worked to restore service by upgrading the storage cluster software to a newer version that provided additional recovery options not available in the previous version. This upgrade allowed us to safely reverse the multi-datacenter configuration and restore normal cluster operations.

What We're Doing to Prevent This

  • Enhanced Testing: We are improving our testing environments to better replicate production conditions and catch similar issues before they affect live services
  • Software Version Management: We are updating our deployment standards to use newer software versions that provide better recovery capabilities for major configuration changes
  • Monitoring Improvements: We are implementing additional real-time monitoring for authentication systems during major infrastructure changes
  • Rollback Procedures: We are developing more comprehensive rollback procedures for complex infrastructure modifications

Our Commitment

We sincerely apologize for any inconvenience this incident may have caused. Data security and service reliability are our highest priorities. We are committed to learning from this experience and implementing the necessary improvements to prevent similar incidents in the future.

If you have any questions or concerns about this incident, please contact our support team.

Posted Sep 09, 2025 - 18:36 CEST

Resolved

This incident has been fully resolved. All systems are operating normally. We will continue monitoring the infrastructure closely over the coming days to ensure continued stability.
Posted Sep 09, 2025 - 18:30 CEST

Monitoring

Dear customers,

The cluster has been rebooted into a working state. Delta operations, parallel processing and export2datasource features should function again normally. We are continuing to monitor the restoration process and will debrief this incident later on.

Thanks again for your patience and continued support,
Your Productsup Operations Team
Posted Sep 09, 2025 - 15:17 CEST

Identified

Dear customers,

the issue has been identified, our storage cluster which stores Site Workspaces is currently in a failed state. Due to this, you may encounter workspace errors which can be safely ignored. Delta processing runs and parallel processing may not work at all, while full processing runs will succeed nevertheless.

Our infrastructure team is working to restore the cluster to a full state as soon as possible, and we'll keep you posted if there are any new developments.

Thank you for your understanding,
Your Productsup Tech Operations Team
Posted Sep 09, 2025 - 14:48 CEST

Update

We are continuing to investigate this issue.
Posted Sep 09, 2025 - 14:46 CEST

Investigating

Dear customers,
We are seeing elevated numbers of full processing runs, instead of runs that should process the delta of changed products.
We are currently investigating the issue.

Best regards,
Your Productsup Tech Operations Team
Posted Sep 09, 2025 - 13:53 CEST
This incident affected: Data Processing.