Summary

On October 10, 2019 at 7:18 PM ET, we experienced an outage impacting all customer sites. Customers experienced intermittent slow and failing requests in a 22-minute window, after which we made a full recovery. In the interest of transparency, this post details the root causes of the outage and what we've learned.

Timeline and Investigation

At 7:18 PM ET on October 10th, we were alerted to intermittent errors loading Squarespace websites.

At 5:38 PM ET, prior to the outage, our database team received alerts from internal monitoring that a portion of our main database cluster was having performance problems. These issues were affecting only secondary database replicas, and were not causing any disruption to service. Our on-call engineers began to triage the failures, and discovered that replicas were failing while attempting to allocate memory. At 7:18 PM ET, a primary replica failed which led the team to take mitigating action. To reduce the memory usage of the primary replicas, and provide more time for debugging, they began a rolling restart of all affected instances. This lasted for 22 minutes and prevented further impact. By 7:39 PM ET, all instances were in a healthy state, and the incident commander declared an all-clear at 7:58 PM ET.

After the systems were healthy again, the database team investigated further and found that the database failures were caused by instances hitting a memlock limit. The database uses memlock to lock pages into memory, preventing information from being written to disk. Memlock limits keep processes from overusing that system and degrading performance. After a recent configuration change to the database, the system had gradually approached the configured memlock limit resulting in the initial incident.

Remediation and Prevention

We took several steps to prevent a recurrence of this issue. First, we adjusted our internal memlock limits to reflect a value more appropriate for our expected usage. Second, we added monitoring to give us insight into when we’re approaching this particular limit. Implementing those mitigation steps prevents future outages, allowing us to provide a reliable platform for our customers.

A Final Word

We sincerely apologize for this outage and any inconvenience it may have caused our customers and their visitors. We have taken this opportunity to learn and to improve our infrastructure so that we can prevent similar incidents from occurring in the future.

Posted Oct 30, 2019 - 10:15 EDT

Resolved

This incident has been resolved and we have confirmed that all systems are operational. Thank you for your patience.

Posted Oct 10, 2019 - 19:58 EDT

Monitoring

A resolution has been implemented and we are monitoring closely before resolving the issue.

Posted Oct 10, 2019 - 19:47 EDT

Identified

The issue has been identified and a fix is being implemented.

Posted Oct 10, 2019 - 19:31 EDT

Investigating

We are investigating connectivity issues related to most Squarespace sites. We will provide more information as soon as possible.

Posted Oct 10, 2019 - 19:23 EDT

This incident affected: Site Loading.