Access member only content, take part in discussions with comments on blogs, news and reviews and receive all the latest security industry news directly to your inbox. Join now for free.
A confirmation email has been sent to your email address - SUPPLIED EMAIL HERE. Please click on the link in the email to verify your email address. You need to verify your email before you can start posting.
If you do not receive your confirmation email within the next few minutes, it may be because the email has been captured by a junk mail filter. Please ensure you add the domain @scmagazine.com.au to your white-listed senders.
Google has published a post-mortem of an incident in February in which Google Apps went down for over two hours. All Google App Engine applications were "degraded" from 7:48am to 10:09am PST on 24 February after a power failure at the company's main data centre, the firm said.About 25 percent of the servers failed within five minutes owing to a delay in back-up power generation. Google's message boards started showing questions from users almost immediately."By this time, our primary on-call engineer had determined that App Engine is down," the report said."The on-call engineer, according to procedure, paged our product managers and engineering leads to handle communicating the outage to users. A few minutes later, the first post from the App Engine team about this outage is made on the external group."There was confusion about the instructions for switching to a back-up data centre and the decision-maker for the crossover could not be found. The team then received data suggesting that the data centre was recovering and that a changeover was not neccesary.However, the data turned out to be inaccurate and this extended the outage considerably. By the time the move to the backup servers had been made, Google Apps had been down for more than two hours.The report found that Google had not developed plans for a partial data centre failure, nor for determining whether the data centre was able to continue running on such a reduced server count.The company will now hold regular drills for failure, with a wider spectrum of possible situations, and a bi-monthly audit of all operations documents.Google claimed that a similar failure today would cause a service slowdown for a maximum of 20 minutes with the new procedures, rather than a complete outage.
To begin commenting right away, you can log in below or register an account if you don't yet have one. Please read our guidelines on commenting. Offending posts will be removed and your access may be suspended. Abusive or obscene language will not be tolerated. The comments below do not necessarily reflect the views or opinions of SC Magazine, Haymarket Media or its employees.