Impact
During the incident, users experienced intermittent 502 (Bad Gateway) and 503 (Service Unavailable) errors when accessing the iSell platform. The issue affected the front-end application layer, resulting in failed or delayed responses from multiple load-balanced servers. The disruption lasted for approximately 30 minutes, with intermittent errors continuing until new instances were provisioned and stabilised.
Cause & Resolution
The incident was caused by a sudden spike in RAM usage across the front-end servers. As system memory became exhausted, the affected instances were unable to process incoming requests, leading to timeouts and connection failures when the load balancer attempted to route traffic to them.
The platform’s auto-scaling policy was triggered in response to the elevated load, automatically provisioning new instances to handle incoming requests. Once the additional servers came online and traffic was redistributed, normal service was restored and the error rate returned to baseline levels.
Prevention Measures
While the existing auto-scaling configuration performed as intended and mitigated the issue quickly, the response window has since been finely tuned to ensure faster reaction times to sudden memory pressure events.
Additional measures implemented include:
These improvements aim to minimise service impact should similar conditions occur again.