PolarisMail Service Outage Postmortem

As mentioned in our previous post Service Outage Incident, we experienced a service outage on Friday, December 8th 2017 at approximately 16:30 EST. Today we’re providing an incident report that details the nature of the outage and our response.

Due to heavy investment in our infrastructure over the past few years we enjoyed no outages or other kinds of incidents in the past 10 years. Unfortunately this streak came to an end on Friday, due to events out of our control.
This service issue has impacted all of our customers, and we apologize to everyone who was affected.

Root Cause

The root cause was a power outage in downtown Montreal which turned into a service outage due to the failure of Peer1 generators to start when the power loss was detected.

Issue Summary

Times are EST

Friday December 8th

15:45 A fire broke out at a power station serving downtown Montreal, shortly before 16:00 EST.

16:15 Power station cuts off electricity to most of downtown Montreal. Peer1 DataCenter UPS’s go live and shoulder the load

16:30 Peer1 DataCenter backup generators fail to start and power is cut off to all Peer1 customers located at the 1080 Beaver Hall DC, including us.

16:45 We initially thought it was a networking/routing issue due to the complete lack of communication with the DC

16:55 Peer1 confirms the lack of power and we initiate outage procedures: Our main service domain is configured on 60 seconds refresh so we quickly diverted incoming e-mails to offsite pooling servers

17:00 A PolarisMail engineer is dispatched onsite

18:08 Power is restored to the DC. All servers come back online but e-mail service is not available.

18:30 As described in our “E-mail Backups” post, we keep several copies of all mail data on different kinds of architecture in order to avoid data loss. Following the sudden power loss, our main storage array came back up but required a lengthy data integrity check before becoming operational ( 24h+ ). Our live hotspare storage array was also affected and required the same data integrity check. Luckily, a third-level backup storage array which runs on ZFS was completely unaffected by the unexpected shutdown.

19:30 After a few attempts at restoring the original storage array and the live hotspare, it was decided it’s best to let the data integrity check complete before using that infrastructure again

19:30 It was decided to promote the ZFS backup storage array to primary status

23:00 ZFS backup storage is completely reconfigured to take on the primary role

23:15 Incoming e-mails are re-routed back to our main infrastructure

23:45 Outbound SMTP service is restored

Saturday December 9th

00:45 IMAP service is restored

01:20 POP3 service is restored

01:30 We keep complete, separate backups on yet another storage array of all incoming e-mails, before they are even delivered to the customers. These are the gateway level backups. The ZFS backup was lacking these latest e-mails so the customers were noticing a gap in their e-mail timeline.

01:30 Restore procedure from Gateway Backup starts for all customers

07:00 Restore procedure completes, all e-mails are safely back in the user’s mailboxes

07:15 All services running as expected with no loss of data. Some desktop client software ( Outlook, Thunderbird, etc ) starts resyncing the mail folders due to all these changes.

Sunday December 10th

13:00 Main storage array & live hotspare storage array finish the data integrity check

13:30 Mail data comparison checks are being run across all systems to ensure they are all consistent

23:00 Procedure to promote the original storage array to its primary role is started

23:30 All mail services are now running from the primary storage array with no incidents reported

Corrective and Preventative Measures

Despite the power being restored, Peer1 brought a backup generator the same night in order to mitigate further potential power losses.
We have made the decision to switch all of our storage arrays to the ZFS filesystem in order to minimize further downtime in case of unxpected power loss. This will be completed in the next few weeks.
We are analyzing the option of expanding to another DC in order to provide even more realibility.

Thank you for your support and understanding.

George Breahna

CEO, PolarisMail Inc

/* zE('webWidget:on', 'chat:message', function(event) { if (event.detail.type === 'chat.msg') { console.log('A chat message was sent'); } }); */