PolarisMail Service Outage Postmortem
As mentioned in our previous post Service Outage Incident, we experienced a service outage on Friday, December 8th 2017 at approximately 16:30 EST. Today we’re providing an incident report that details the nature of the outage and our response.
Due to heavy investment in our infrastructure over the past few years we enjoyed no outages or other kinds of incidents in the past 10 years. Unfortunately this streak came to an end on Friday, due to events out of our control.
This service issue has impacted all of our customers, and we apologize to everyone who was affected.
The root cause was a power outage in downtown Montreal which turned into a service outage due to the failure of Peer1 generators to start when the power loss was detected.
Times are EST
Friday December 8th
15:45 A fire broke out at a power station serving downtown Montreal, shortly before 16:00 EST.
16:15 Power station cuts off electricity to most of downtown Montreal. Peer1 DataCenter UPS’s go live and shoulder the load
16:30 Peer1 DataCenter backup generators fail to start and power is cut off to all Peer1 customers located at the 1080 Beaver Hall DC, including us.
16:45 We initially thought it was a networking/routing issue due to the complete lack of communication with the DC
16:55 Peer1 confirms the lack of power and we initiate outage procedures: Our main service domain is configured on 60 seconds refresh so we quickly diverted incoming e-mails to offsite pooling servers
17:00 A PolarisMail engineer is dispatched onsite
18:08 Power is restored to the DC. All servers come back online but e-mail service is not available.
18:30 As described in our “E-mail Backups” post, we keep several copies of all mail data on different kinds of architecture in order to avoid data loss. Following the sudden power loss, our main storage array came back up but required a lengthy data integrity check before becoming operational ( 24h+ ). Our live hotspare storage array was also affected and required the same data integrity check. Luckily, a third-level backup storage array which runs on ZFS was completely unaffected by the unexpected shutdown.
19:30 After a few attempts at restoring the original storage array and the live hotspare, it was decided it’s best to let the data integrity check complete before using that infrastructure again
19:30 It was decided to promote the ZFS backup storage array to primary status
23:00 ZFS backup storage is completely reconfigured to take on the primary role
23:15 Incoming e-mails are re-routed back to our main infrastructure
23:45 Outbound SMTP service is restored
Saturday December 9th
00:45 IMAP service is restored
01:20 POP3 service is restored
01:30 We keep complete, separate backups on yet another storage array of all incoming e-mails, before they are even delivered to the customers. These are the gateway level backups. The ZFS backup was lacking these latest e-mails so the customers were noticing a gap in their e-mail timeline.
01:30 Restore procedure from Gateway Backup starts for all customers
07:00 Restore procedure completes, all e-mails are safely back in the user’s mailboxes
07:15 All services running as expected with no loss of data. Some desktop client software ( Outlook, Thunderbird, etc ) starts resyncing the mail folders due to all these changes.
Sunday December 10th
13:00 Main storage array & live hotspare storage array finish the data integrity check
13:30 Mail data comparison checks are being run across all systems to ensure they are all consistent
23:00 Procedure to promote the original storage array to its primary role is started
23:30 All mail services are now running from the primary storage array with no incidents reported
Corrective and Preventative Measures
Despite the power being restored, Peer1 brought a backup generator the same night in order to mitigate further potential power losses.
We have made the decision to switch all of our storage arrays to the ZFS filesystem in order to minimize further downtime in case of unxpected power loss. This will be completed in the next few weeks.
We are analyzing the option of expanding to another DC in order to provide even more realibility.