10 Causes of Data Center Downtime “From Human Error to Fire”

While the severity of data center outages is declining, the cost of outages continues to rise. ‘The number one cause of critical site outages’ is power outages. Network failures and IT system problems also cause data center outages, and human error is often the cause.

The Uptime Institute pointed to this problem in its recent data center outage report, looking at the type and frequency of outages and analyzing the damage in terms of cost and impact.
ⓒ Getty Images Bank

Unreliable data continues to be a problem

Uptime said disruption-related data should be analyzed with skepticism, given the lack of transparency on the part of companies experiencing disruptions and the low quality of reporting mechanisms. “Outages information is opaque and unreliable,” Uptime Director of Research Andy Lawrence said during a briefing on Uptime’s Annual Outages Analysis 2023.

“We have to rely on our own means and methods to get the data,” says Lawrence. For various reasons, some companies do not want to share the details of an outage. So sometimes there is a very detailed root cause analysis, but sometimes there is almost no information.”

Uptime collects its data from three main sources: Uptime’s Abnormal Incident Report (AIR) database, its own surveys, and public reports. Public reports include news articles, social media, hang trackers, and corporate statements. Accuracy is different for each. For example, in the case of public reports, details may be lacking or the sources providing the information may be unreliable. Uptime believes that the data produced from its own surveys are fair and of high quality. This is because the respondents are anonymous and their roles are diverse. AIR is also considered to be of very high quality, as it consists of detailed facility-level data voluntarily shared within the industry by data center owners and operators.

Discontinuation rates are slightly declining

Uptime says the downtime rate has been gradually dropping in recent years.

This does not mean that the total number of outages is declining. In fact, the number of global outages is increasing each year as the data center industry expands. “While this number gives the illusion of higher outage rates versus IT load, the opposite is true,” Uptime reports.

Uptime tracked data center managers and operators through four self-reported surveys conducted from 2020 to 2022, and confirmed that overall, outage rates by site were steadily declining. In the 2022 survey, 60% of respondents said they had experienced downtime in the past three years, down from 69% in 2021 and 78% in 2020.

“The downtime rate seems to be improving little by little,” says Lawrence.

Reduce downtime severity

60% of data center sites have experienced an outage in the past three years, but only a small percentage of them are classified as severe or very severe.

Uptime measures the severity of an outage on a scale of 1-5, with 5 being the most severe. Level 1 outages are negligible and do not cause service disruption. Level 5 mission-critical disruptions are significant service and/or operations disruptions that cause real harm, often resulting in significant financial losses, safety issues, regulatory violations, lost customers, and reputational damage.

The historical share of stage 5 and 4 (severe) outages of all outages is about 20%. In 2022, discontinuations in the severe/very severe category fell to 14%.

Uptime’s chief technology officer, Chris Brown, said that a major reason for the reduction in catastrophic outages is the increased ability of data center operators to respond to unforeseen incidents. There have been fewer cases leading to disruptions,” he said.

Brown said today’s systems are built with redundancy, and operators are trained to create systems that can respond to abnormal situations and avoid outages.

Financial damage is increasing

At a time when the cost of outages is rising, the trend toward increasing costs is likely to continue as reliance on digital services increases.

Uptime survey data over the past four years shows an increasing percentage of large outages with direct or indirect costs of $100,000 or more. The percentage of outages with a recovery cost of less than $100,000 was 60% in 2019, but only 39% in 2022.

Also in 2022, 25% of respondents said their most recent disruption cost them $1 million or more, and 45% said their most recent disruption cost them between $100,000 and $1 million.

Brown said one of the reasons is that inflation has driven higher costs for alternative equipment and labor.

More important is the level at which companies rely on digital services to run their business. Loss of core IT services can lead directly to business disruption and lost revenue. “Outages like this, especially in severe or very critical stages, can affect many organizations and many people,” Brown said. The cost of resolving outages is also increasing.”

Most major disruptions are caused by third parties

As more workloads are outsourced to external service providers, the reliability of third-party digital infrastructure companies becomes increasingly important from an enterprise customer perspective, with the majority of publicly disclosed outages involving these providers.

Of all public outage incidents tracked since 2016, 66% occurred with third-party commercial IT and data center operators, including cloud providers, digital service providers and telecom providers. Year by year, the percentage is increasing. The percentage of outages caused by cloud colocation, telecom and hosting companies increased from 70% in 2021 to 81% in 2022.

“Companies need to do their due diligence when outsourcing IT services,” says Brown. This due diligence should continue even after the deal is signed.”

Human error is a factor that can be dealt with relatively easily

Estimates of uptime based on 25 years of data suggest that human error is rarely the sole or root cause of outages, but is a partial effect in 66% to 80% of all outages. But as Uptime admits, analyzing human error is tricky. Problems such as inadequate training, operator fatigue, and lack of resources are difficult to pinpoint.

Uptime found that human error-related outages are mostly caused by reps not following the process (selected by 47% of respondents) or by problems with the process itself (40%). Other common reasons include issues within the service (27%), installation issues (20%), staff shortages (14%), preventative maintenance frequency issues (12%), and data center design or omissions (12%).

On the bright side, investing in quality training and management processes can go a long way in reducing outages without significant outlay.

“You can solve this problem without going to the bank and raising a lot of money,” says Brown. “You have to create procedures, test them, make sure they’re correct, train employees to follow them, and then supervise that they’re actually following them.”

“Human error is involved in a lot of things, so addressing this part is an easy way to avoid outages,” says Lawrence.

Power issues remain the bane of data center reliability

Uptime said in its survey, as before, that on-site power issues remain the number one cause of severe site outages. Although most power outages have multiple causes and the quality of reports about them is uneven, it remains the main cause of outages.

In the 2022 survey, 44% of respondents said electricity was the main cause of recent large-scale accidents or disruptions. Power was also the number one cause of severe disruptions in the 2021 (43%) and 2020 (37%) surveys.

Uptime also cited network issues, IT system failures, and cooling failures as other causes of the problem.

Network Complexity Leads to More Outages

Uptime looked at network outage trends using its own data from the 2023 Uptime Resiliency Survey. 44% of survey respondents said they had experienced a major outage caused by network or connectivity issues in the past three years. 45% of respondents said no, and 12% said they did not know.

The two most common causes of networking and connectivity-related outages are configuration or change management issues (45% of respondents) and failures of third-party network providers (39%).

Uptime attributes this trend to the complexity of today’s networks. “In today’s dynamically shifting, software-defined environment, programs to manage and optimize networks are constantly tweaked and reconfigured,” said Uptime. Errors are inevitable in this process, and small errors that often occur in complex, high-throughput environments can propagate throughout the network, leading to cascades of failures that are difficult to contain, diagnose, and correct.”

Other common causes of major network-related outages include:
  • Hardware Failure: 37%
  • Dropped lines: 27%
  • Firmware/Software Errors: 23%
  • Cyber ​​attacks: 14%
  • Network/Congestion Disruption: 12%
  • Weather-related accidents: 7%
  • Corrupted firewall/routing table issues: 6%

Common causes of IT system and software outages

In its resiliency survey, Uptime asked if they had experienced a major outage due to an IT system or software failure in the past three years. 36% of respondents said yes, 50% said no, and 15% said they did not know. The most common causes of outages related to IT systems and software are:
  • Configuration/change management issues: 64%
  • Firmware/Software Failures: 40%
  • Hardware Failure: 36%
  • Capacity/congestion issues: 22%
  • Data Sync/Corruption: 14%
  • Cyber ​​attacks/security issues: 10%

Fires are rare but cause great damage

A publicly documented outage includes a media reported outage. The causes revealed here are diverse and may differ from the causes reported by data center operators and IT teams as the media sources’ knowledge and understanding of outages are based on their own perspectives. “What’s interesting is the diversity of causes,” says Lawrence. This may be partly a reflection of public and media perceptions of the cause.”

Among the publicly reported causes of outages, fire is a cause that does not rank highly in IT-related sources. Uptime confirmed that 7% of publicly reported data center outages were caused by fires. In a web briefing, the Uptime research team explained that data center fires are related to the increased use of lithium-ion batteries.

Compared to lead-acid batteries, lithium-ion batteries take up less space, are easier to maintain, and have a longer lifespan, but present a higher fire risk. “We’re looking at it as a lithium-ion battery fire,” Lawrence said, referring to a major fire that broke out on March 28, 2023 at the Maxnod data center in France. Lithium-ion batteries were also said to be the cause of a major fire that occurred on October 15, 2022 at a colocation facility in South Korea owned by SK Group and operated by C&C affiliates.

Lawrence said, “Fire is a constant item in the survey.”
[email protected]

Source: ITWorld Korea by www.itworld.co.kr.

*The article has been translated based on the content of ITWorld Korea by www.itworld.co.kr. If there is any problem regarding the content, copyright, please leave a report below the article. We will try to process as quickly as possible to protect the rights of the author. Thank you very much!

*We just want readers to access information more quickly and easily with other multilingual content, instead of information only available in a certain language.

*We always respect the copyright of the content of the author and always include the original link of the source article.If the author disagrees, just leave the report below the article, the article will be edited or deleted at the request of the author. Thanks very much! Best regards!