A maintenance bug caused Facebook to be out of action for 6 hours, the company says
October 5 (Reuters) – A routine maintenance bug on the Facebook data center network caused on Monday Collapse of his global system for more than six hours, creating a spate of problems that delayed repairs, the company said Tuesday.
The outage was the biggest Downdetector, a web monitoring company, had ever seen. It blocked access to apps for billions of Facebook (FB.O), Instagram, and WhatsApp users, which further intensified the week-long review of the nearly $ 1 trillion company.
At a US Senate hearing on Tuesday, a former whistleblower employee accused Facebook of putting profits above people’s safety, which the company denies.
(Also read: Facebook urges judges to dismiss lawsuit to force the sale of Instagram, WhatsApp)
In one blog entry, Facebook vice president of engineering Santosh Janardhan stated that the company’s engineers issued an order that inadvertently separated Facebook data centers from the rest of the world.
Facebook’s systems are designed to check commands to avoid mistakes, but the audit tool had a bug and couldn’t stop the command that caused the outage, the company said.
The outage was not caused by malicious activity, he added.
While users lost access to one of the world’s most popular messaging apps – WhatsApp has more than 2 billion users – employees have also been locked out of internal tools.
(Also Read: Senator Asks Facebook CEO To Answer Teen Safety Questions)
The failure turned off tools that engineers would normally use to investigate and repair such failures, making the task even more difficult, Facebook said.
The company said it sent a team of engineers to its data center locations to try to debug and reboot the systems.
However, due to the high level of physical and system security, it took the company additional time to recruit engineers to work on the servers.
Even after the network connection to the data centers was restored, Facebook feared that an increase in traffic could cause its websites and apps to crash.
However, since the company had conducted exercises to prepare for such situations, access to its services returned relatively quickly.
“Every mistake like this is an opportunity to learn and get better,” wrote Janardhan. “From now on it is our job to … make sure that such events happen as infrequently as possible.”
Reporting by Sheila Dang in Dallas; Editing by Sonya Hepinstall, Grant McCool and Richard Pullin
Our standards: The Thomson Reuters Trust Principles.