Connectivity Issues
Incident Report for Squarespace
Postmortem

SUMMARY

On Wednesday, April 20, 2016, traffic to Squarespace websites was impacted from 1:14 AM to 7:48 AM Eastern Time (ET). During this time, traffic was still being routed to our data centers but customers and their visitors received an error message (HTTP 504) indicating that service was unavailable. While all Squarespace sites were initially unavailable, a partial restoration of service at 4:31 AM ET restored the home page of all sites, and service was restored incrementally from there.

We sincerely apologize to our customers and their visitors for this very serious outage. We pride ourselves on being fast, responsive, and reliable. At this time, we fully understand the root cause of this issue and have resolved the problem.

TIMELINE AND ROOT CAUSE

When our systems were impacted at 1:14 AM ET on Wednesday, April 20, 2016, our monitoring system alerted our engineers, who responded immediately. Our initial investigation revealed that the cluster of software that serves web pages was offline. Further investigation revealed that until the moment our service stopped, every subsystem was operating within normal levels and externally measured response time was also normal. Basic attempts to reboot these nodes resulted in immediate failure. This was an unprecedented situation for us -- servers immediately exiting within a minute of startup, with no clear logging indicating what was the root cause. The performance of all systems were fine up until that event, and there were no code deploys within hours of the event. Our teams began searching for forensic evidence to explain the problem while pursuing a solution for partial restoration of service.

At 3:42 AM ET, we concluded that traffic to a particular website or traffic from a particular group of visitors was causing the issue. We successfully made a change in our web application firewall to block traffic to all URL patterns except to the root (home) page for all sites. We ruled out a problem from any visitors and knew it must be a particular URL pattern triggering a bug in our code. By 4:31 AM ET, the home pages of all Squarespace websites were loading properly.

For the next two hours, we methodically searched for the problematic URL pattern by replaying the requests found in our access logs against a subset of the servers in our web cluster. Our engineers continuously updated our web application firewall to allow more traffic to enter our systems. By 6:23 AM ET, we were successfully serving traffic at 95% of our normal level. At 7:48 AM ET, we identified and isolated the problematic URLs and all systems were back to normal.

Upon further investigation of the problematic URL, a certain data payload existing on a single high-traffic site was triggering an unexplored code path within our system. Whenever one of the servers in our web cluster received a request for that site, memory consumption would skyrocket immediately and the operating system would kill the underlying process. This event happened so quickly that no logging of the offending URL occurred, and the crash reports from the affected servers became corrupted, severely impeding our ability to determine the trigger for this issue.

Our engineering team has repaired the code defect that caused this issue. After thorough testing, we deployed the fix and removed the filters blocking the problematic URLs. In order to avoid a future problem of this nature, we’re adding additional checks and fail-safes to the affected code paths. Other efforts are underway to make these types of issues much faster to identify in the future.

A FINAL WORD

We take all outages very seriously and we are always working to improve our service’s reliability and uptime. Like many of our customers, we rely on Squarespace for our own business and livelihoods. This event was an especially difficult one for us, as we fell short of the high standards we set for ourselves and our service.

We hope that by being transparent around the causes, conclusions, and learnings from incidents such as this one, we can continue to build trust with our customers and offer reassurance that the reliability of all Squarespace products remains our number one priority.

Sincerely,
John Colton
Squarespace
VP Engineering

Posted Apr 22, 2016 - 17:37 EDT

Resolved
Systems are back to normal. We will post more detailed information at a later date.
Posted Apr 20, 2016 - 07:53 EDT
Update
We are now serving 95% of our normal traffic levels. The issue remains constrained to specific URLs, and does not affect generalized site traffic.
Posted Apr 20, 2016 - 06:33 EDT
Identified
Service has been restored to 88% of normal traffic levels, with the remaining affected traffic being at isolated URLs. We continue to work aggressively to restore access to 100% and will post a more detailed update at a later date. Thanks for your patience.
Posted Apr 20, 2016 - 06:23 EDT
Update
We are incrementally restoring access to sites.
Posted Apr 20, 2016 - 04:45 EDT
Update
Our team continues to pursue the root cause of this outage. Thanks for your patience.
Posted Apr 20, 2016 - 04:17 EDT
Update
All sites are down. Our team is actively engaged in determining the cause.
Posted Apr 20, 2016 - 03:31 EDT
Update
Our team continues to pursue the root cause of this outage. Thanks for your patience.
Posted Apr 20, 2016 - 02:13 EDT
Update
All sites are down. Our team is actively engaged in determining the cause.
Posted Apr 20, 2016 - 01:35 EDT
Investigating
We are investigating a general site access issue.
Posted Apr 20, 2016 - 01:22 EDT