404/502 Error Outage Issue-Dublin

Incident Report for Traveltek

Postmortem

Impact:

Between approximately 10:40 AM and 1:25 PM BST on the 29th May, customers using our platform would have encountered intermittent “502” errors and noticeable slowness. At one stage, the service appeared to return to normal, but navigating within iSell would lead to “404” errors (indicating “page not found”).

Cause & Resolution:

The incident was caused by an issue with accessing Ubuntu’s software resource repositories (archive.ubuntu.com), which are essential for building and provisioning our servers. These repositories became temporarily unavailable or extremely slow to respond, leading to timeouts during package downloads. When new servers were launched during this window, they could not complete their build process and were only partially configured. As a result, they would be terminated automatically due to failed health checks. We attempted to work around this by rolling back to a previous snapshot of our servers, which allowed us to bypass some of the missing package downloads and bring new servers online faster. However, this snapshot lacked some required configurations, resulting in “404” errors when navigating around iSell. We identified the missing configurations and promptly deployed them to the new servers, restoring full service by 1:25 PM. Later that day, Ubuntu resolved the issue on their end, and we were able to provision servers normally again.

Prevention Measures:

 While the unavailability of Ubuntu’s software resources is outside our direct control and is a rare occurrence, we are taking proactive steps to mitigate the impact of similar future events:

  • Failover Improvement: We plan to enhance our emergency fallback process. During this incident, rolling back to a previous snapshot helped to partially restore service, and we believe this process can be automated and improved to provide a more seamless failover path in the event of external repository outages.
  • Automation of Emergency Rollback: We will explore automating the configuration deployment step when falling back to a snapshot, to ensure full service readiness even if external resources remain unavailable.
  • Ongoing Monitoring and Collaboration: We will continue to monitor external resource availability and collaborate with upstream providers (such as Ubuntu) to stay informed about any ongoing or planned maintenance that could affect our platform’s build processes.
Posted May 30, 2025 - 16:15 BST

Resolved

We’d like to inform you that a fix has been successfully implemented to address the recent 404/502 error issues. Our technical team has worked diligently to identify and resolve the root cause, and the affected services have now been restored.

Thank you for your patience and support while we worked through these issues. Should any further updates be necessary, we will share them promptly. In the meantime, please don’t hesitate to reach out if you experience any further difficulties or have any questions.
Posted May 29, 2025 - 16:51 BST

Update

We are continuing to monitor for any further issues.
Posted May 29, 2025 - 12:52 BST

Update

We are continuing to monitor for any further issues.
Posted May 29, 2025 - 12:51 BST

Monitoring

We’d like to inform you that a fix has been successfully implemented to address the 404/502 error issue. Our team is now closely monitoring the system to ensure that services remain stable and fully operational.

Thank you for your continued patience and support. We will provide further updates if needed or as we complete our monitoring.
Posted May 29, 2025 - 12:49 BST

Identified

We’d like to provide a quick update regarding the 404/502 error issue. Based on the latest developments, a fix is currently in progress, and our technical team is implementing the necessary changes to resolve the problem.

We appreciate your continued patience and understanding while we work to restore full functionality. Further updates will be shared as we make progress.
Posted May 29, 2025 - 12:39 BST

Update

We are continuing to investigate this issue.
Posted May 29, 2025 - 12:29 BST

Update

We are continuing to investigate this issue.
Posted May 29, 2025 - 11:22 BST

Investigating

We are currently investigating an issue causing 404 errors on one of our servers, which may be impacting access to certain pages or services. Our technical team is actively working to identify the root cause and implement a solution as quickly as possible.

We truly appreciate your patience, understanding, and continued cooperation while we work to restore full service. We will keep you updated as we make progress toward resolving the issue.
Posted May 29, 2025 - 11:21 BST
This incident affected: iSell Dublin (Databases, Back-office / Front End, Scheduled Jobs, Searches, Suppliers).