Due to the outage on N6 that lasted a time frame that is unacceptable we have started working on new ways to get alerts and fix problems faster. The long and short of the event is that when this outage happened the people on call and that got alerted of the outage did not have the tools or experience to fix the specific error.
Due to the error being a small line error in a certificate PEM file it became extremely hard to find the error. When the error was not able to be found the lead was contacted and when got online was able to solve the error in a few minutes. Turns out this error has been seen before and a script saved to our internal documentation was made to track down the error and it was able to find it the error and with a manual file edit of the PEM file everything was back online.
The error was this simple section in a PEM file. During the file write process the file is mixing a old and new version of the certificate chain info and it results in a gap in the file with the previous line duplicated.
Once the doubled line and the blank line is removed the file is a valid PEM file and the web servers can boot (yes we have multiple web servers per node).
With this outage we have realized that we do not have a correct path for errors to get to the correct person to fix them, so we are making changes to our outage detection and alerting flow to contact the correct team members. We also did not have the direct response in contacting affected users until after the error was fixed and we will be working on a process to get an alert posted to link users when we are contacted as we work on fixing the error.
We will make sure an error like this cant happen with this amount of downtime again. We hope to make sure any outage that takes down any server although rare will last no longer than 1 hour. Our uptime is a priority but things do fail, and this is an example as to what happens when things fail at the wrong time.
Again we want to apologize for any affected users during this outage on N6 and will work to improve how we handle outages going forward.