ColumnㅣDigital Twin and Chaos Engineering: How to Response to Data Center Failures

Over the past year, digital transformation has accelerated at a tremendous pace, and remote work has proliferated. But the IT infrastructure had to meet unprecedented demands. As a result, the importance of maintaining server uptime was highlighted. This is because it is key to preventing outages of infrastructure and preventing financial losses.

Digital twins and chaos engineering can help solve this challenge. With this technology, data center managers can prepare for incident scenarios and test changes ahead of time to reduce the impact of an outage.


Published by Uptime, a digital infrastructure consulting firm ‘Uptime Institute Global Survey of Data Center Managers’ 2021According to Research, 47% of outages cost between $100,000 and $1 million to restore. This is a significant increase from 40% in 2020 and 28% in 2019.

To prevent outages and minimize their damage, data center managers need to monitor potential risks more closely. It goes without saying that we need to focus on making our data centers more resilient. In the event of a downtime event, it should be possible to identify and prevent signs in advance beyond minimizing the damage.

Testing infrastructure changes to digital twins


One surefire way to avoid outages is to create a digital twin of your data center. Digital twins enable Computational Fluid Dynamics (CFD) technology to simulate airflow through a facility and virtually recreate a physical facility that exhibits thermal issues that can cause future downtime.
ⓒFuture Facilities

This technology allows managers to test and evaluate the impact of new changes in the digital realm before applying any changes to a physical facility. As a result, potential problems can be identified before they occur, reducing the likelihood of an outage.

Chaos Engineering with Digital Twins
Chaos Engineering is a concept first introduced by Netflix when it moved its infrastructure to AWS in 2011. It is a concept of ‘breaking things on purpose’ to test system stability in the face of unexpected failures. For example, if applied in the application layer, experiments such as intentionally causing failures in servers and clusters, dropping packets, or filling the hard drive are conducted. Chaos engineering provider Gremlin said that it can be used to improve system availability and reduce the mean time to repair (MTTR) of server incidents.

In 2015, Netflix conducted an experiment called ‘Chaos Kong’, which completely shuts down servers in an AWS region, and analyzed the aftermath. As long as the trend of the global metric at the top does not change significantly, the system can be considered highly resilient, the company explained. ⓒNetflix Engineering Blog

However, it is difficult to conduct chaos engineering experiments in a physical data center. In addition, many physical experiments are impossible to test. Do you want to know the responsiveness of your facility in the event of a complete failure of your cooling system? Or do you want to test what happens if the local cooling unit fails when a rack running the passive portion of an active-passive redundant application ramps up? Wouldn’t this test be done in extreme conditions like the hottest weather of the year? Although demanding, these extremes are the cause of unplanned outages. This is where digital twins can shine.

ⓒGetty Images Bank

Digital twin software allows data centers to be deployed and simulated in any configuration to test unexpected problems. Even catastrophic failures such as cooling or airflow failures or malfunctioning circuit breakers can be tested. We then see how the system responds to these extreme conditions.

This allows server administrators to not only discover and fix vulnerabilities, but also determine the amount of time needed to resolve issues in a disaster situation. The digital twin safely simulates the resilience of a facility and provides an environment to enhance incident response capabilities, helping administrators to take countermeasures to avoid the worst of prolonged server downtime.

preparing for the future
Operational outages can cause disastrous damage. An example is the Microsoft Azure shutdown at the end of last year. A problem with the cooling system brought servers down in the UK, causing huge losses to the UK government’s COVID-19 information portal and many others who depend on it.

It is clear that outage prevention measures should continue to remain a top priority. The good news is that there is a countermeasure called a digital twin. A good digital twin can help administrators prepare for failure scenarios in addition to the safe implementation of changes. As a result, the system can be kept running, providing reliable service to customers even when demand soars.

*Dave King is Product Manager at Future Facilities and has over 15 years of experience in data center simulation. It helps data center managers get the most out of their facilities by leveraging their knowledge of data center cooling technologies and thermal performance.
[email protected]

Source: ITWorld Korea by

*The article has been translated based on the content of ITWorld Korea by If there is any problem regarding the content, copyright, please leave a report below the article. We will try to process as quickly as possible to protect the rights of the author. Thank you very much!

*We just want readers to access information more quickly and easily with other multilingual content, instead of information only available in a certain language.

*We always respect the copyright of the content of the author and always include the original link of the source article.If the author disagrees, just leave the report below the article, the article will be edited or deleted at the request of the author. Thanks very much! Best regards!