On October 10, 2019 at 7:18 PM ET, we experienced an outage impacting all customer sites. Customers experienced intermittent slow and failing requests in a 22-minute window, after which we made a full recovery. In the interest of transparency, this post details the root causes of the outage and what we've learned.
At 7:18 PM ET on October 10th, we were alerted to intermittent errors loading Squarespace websites.
At 5:38 PM ET, prior to the outage, our database team received alerts from internal monitoring that a portion of our main database cluster was having performance problems. These issues were affecting only secondary database replicas, and were not causing any disruption to service. Our on-call engineers began to triage the failures, and discovered that replicas were failing while attempting to allocate memory. At 7:18 PM ET, a primary replica failed which led the team to take mitigating action. To reduce the memory usage of the primary replicas, and provide more time for debugging, they began a rolling restart of all affected instances. This lasted for 22 minutes and prevented further impact. By 7:39 PM ET, all instances were in a healthy state, and the incident commander declared an all-clear at 7:58 PM ET.
After the systems were healthy again, the database team investigated further and found that the database failures were caused by instances hitting a memlock limit. The database uses memlock to lock pages into memory, preventing information from being written to disk. Memlock limits keep processes from overusing that system and degrading performance. After a recent configuration change to the database, the system had gradually approached the configured memlock limit resulting in the initial incident.
We took several steps to prevent a recurrence of this issue. First, we adjusted our internal memlock limits to reflect a value more appropriate for our expected usage. Second, we added monitoring to give us insight into when we’re approaching this particular limit. Implementing those mitigation steps prevents future outages, allowing us to provide a reliable platform for our customers.
We sincerely apologize for this outage and any inconvenience it may have caused our customers and their visitors. We have taken this opportunity to learn and to improve our infrastructure so that we can prevent similar incidents from occurring in the future.