502 Bad Gateway Errors - London iSell

Incident Report for Traveltek

Postmortem

Impact

During the incident, users experienced intermittent 502 (Bad Gateway) and 503 (Service Unavailable) errors when accessing the iSell platform. The issue affected the front-end application layer, resulting in failed or delayed responses from multiple load-balanced servers. The disruption lasted for approximately 30 minutes, with intermittent errors continuing until new instances were provisioned and stabilised.

Cause & Resolution

The incident was caused by a sudden spike in RAM usage across the front-end servers. As system memory became exhausted, the affected instances were unable to process incoming requests, leading to timeouts and connection failures when the load balancer attempted to route traffic to them.

The platform’s auto-scaling policy was triggered in response to the elevated load, automatically provisioning new instances to handle incoming requests. Once the additional servers came online and traffic was redistributed, normal service was restored and the error rate returned to baseline levels.

Prevention Measures

While the existing auto-scaling configuration performed as intended and mitigated the issue quickly, the response window has since been finely tuned to ensure faster reaction times to sudden memory pressure events.

Additional measures implemented include:

Lowering the scale-out thresholds for RAM utilisation to enable earlier provisioning of new instances.
Enhanced monitoring and alerting around memory usage trends to detect abnormal growth before critical thresholds are reached.
Reviewing application-level memory consumption patterns to identify potential optimisations and reduce the risk of future spikes.

These improvements aim to minimise service impact should similar conditions occur again.

Posted Nov 03, 2025 - 12:00 GMT

Resolved

The incident has been resolved.

We will continue to monitor this and look at prevention measures.

Posted Oct 17, 2025 - 16:56 BST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 17, 2025 - 14:35 BST

Investigating

We are currently investigating this issue.

Posted Oct 17, 2025 - 14:05 BST