Partial Outage Serving Site Media and Stylesheets

Incident Report for Squarespace

Postmortem

SUMMARY

On Tuesday, February 12, 2019, from 12:10 AM to 2:20 AM Eastern Time (ET), we experienced an issue with traffic to Squarespace's Content Delivery Network (CDN) origin fleet. This impacted the domains static.squarespace.com and static1.squarespace.com, which meant that images, scripts, and style sheets were not loading for some sites. Many sites were not affected since our global CDN providers successfully served requests for cached assets. We completed the restoration of our CDN origin at 2:20 AM ET and CDN traffic returned to normal.

We sincerely apologize to our customers and their visitors for this serious outage. We pride ourselves on being fast, responsive, and reliable. At this time, we have completed an investigation of this incident and begun engineering efforts to prevent this issue from happening again.

TIMELINE AND INVESTIGATION

(All times are ET)

At 12:10 AM on Tuesday, February 12, 2019, our monitoring system alerted the CDN origin fleet’s on-call engineer that the fleet had fallen below a safe number of healthy nodes. The on-call engineer responded immediately and began triaging.

We manage large outages using a dedicated Incident Commander on-call rotation. Although each team is on call for their own services, the Incident Commander is responsible for coordination and communication during an outage, including posting to our status page. Regrettably, our team initially misjudged the impact of the issue, and did not bring in an Incident Commander until it later became clear that customers had been significantly impacted.

By 1:22 AM, the on-call engineer had escalated the issue to additional team members and basic attempts to restart the affected service had proven unsuccessful. At this point, the severity of the issue became apparent. At 1:30 AM, the on-call engineer paged an Incident Commander, who created a Status Page post.

By 2:15 AM, further remediation steps restored partial service to the CDN origin fleet. At 2:20 AM, the service was restored to normal behavior. Our team continued to monitor the service. At 3:03 AM, with the fleet completely restored and our monitoring system indicating extended healthy operation, we declared the incident resolved.

The following morning, our engineering team performed an incident investigation. They found that an engineer had initiated a maintenance operation on February 11 at 10:30 PM across the Squarespace CDN origin fleet. This operation unexpectedly continued running in the background after the initiating engineer believed it had been terminated, and caused a CDN origin service outage at 12:10 AM.

REMEDIATION AND PREVENTION

The Squarespace engineer on call for this fleet was alerted to the issue only when the fleet of CDN origin servers fell below its critical availability threshold. We are revising our incident-response playbooks to include improved checks and actions that would quickly resolve most issues that affect this fleet. Initial checks in this updated process will determine customer impact and require the immediate involvement of an Incident Commander.

We have begun engineering efforts to improve our external monitoring capabilities so that we can more quickly assess the customer impact of issues affecting our CDN partners and our origin services. This effort will give our engineers better and more-immediate insight into what our customers are experiencing.

Additionally, we are revising operational policies and adding safeguards around how engineers issue maintenance commands, including limiting the portion of the fleet that such commands can operate on. Our long-term work with these systems also includes a strong focus on minimizing the need for direct manual maintenance operations and on continually improving the resilience of our infrastructure through the creation of independent failure domains and automatic routing of requests around unhealthy portions of the system.

A FINAL WORD

We know that not having to worry about keeping sites up is a big part of why people choose Squarespace, and take outages very seriously. It is critical to us that we learn from outages and lapses in communication so we can live up to the high standards that we set for ourselves and that our customers deserve.

Sincerely,
John Colton
Squarespace
SVP Engineering

Posted Feb 15, 2019 - 11:28 EST

Resolved

This incident has been resolved and we have confirmed that error rates have returned to acceptable levels. Thank you for your patience.

Posted Feb 12, 2019 - 03:03 EST

Monitoring

A resolution has been implemented and we are monitoring closely before resolving the issue.

Posted Feb 12, 2019 - 02:50 EST

Identified

We have identified the issue causing elevated error rates serving media and stylesheets and are implementing a resolution.

Posted Feb 12, 2019 - 02:24 EST

Update

We are investigating a partial outage of serving site media and stylesheets.

Posted Feb 12, 2019 - 01:53 EST

Investigating

We are investigating a partial outage of serving site media and stylesheets.

Posted Feb 12, 2019 - 01:38 EST

This incident affected: Site Loading.