How to Overcome Data Center Failures

January 3, 2022

Unplanned data center outages are often common occurrences — far more common than they should be. A data center failure is both inconvenient and dangerously expensive for the administrators running the operation.

A recent survey from Uptime Institute estimated that one in six data centers that suffered a major outage event incurred costs of over $1 million. Additionally, 48% of data center outages cost operators between $100,000 and $1 million. Many operators do not record the smaller data center failures they experience, and many even admit that they’d encounter fewer of these incidents with improved infrastructure resiliency.

Most data outages can be prevented by upholding proper maintenance tasks and following the correct procedures for daily functioning. By focusing on the common reasons failures occur, data center managers can reduce the chance of major outages, saving time and money.

7 Common Reasons Data Centers Fail

Understanding common data center failure scenarios is the first step to saving your data center from disastrous outages. Frequent incidents include:

  1. Insufficient backup power: The most common reason data centers fail is power loss. Power outages can happen at any time. Due to this possibility, data centers typically have additional power sources in case their primary one is interrupted. The most commonly used backup power sources are generators and batteries. However, issues arise when operators do not run power failure tests or replace batteries often enough. Without taking the necessary preventative steps, your backup power may not be available when you need it.
  2. Too many changes and updates at once: Administrators can find it tempting to make as many changes as possible to maximize future progress during maintenance windows. However, when too many tasks are scheduled for a short period, administrators may rush on tasks to make up for their lack of time. Doing this can lead to avoidable errors. Also, by implementing too many changes at once, you lose the ability to note which ones are actually working, making troubleshooting much more difficult in the future.
  3. Changes outside of maintenance windows: There may be a time when a minor change request comes in and you feel it can be easily made outside the formal data center change process. More often than not, it can be. However, sometimes a small modification can have a huge effect that could turn catastrophic for the rest of the data center. Not following update protocol can lead to unexpected outages and a substantial loss of money for a data center.
  4. Hoarding of old hardware: Though all hardware will likely fail at some point, the longer you keep older equipment, the more likely it is to fail. This knowledge does not always stop critical data center applications from going down due to them running on an out-of-date system.  Administrators must ensure they’re staying informed of updates and improvements in technology to avoid working off old systems.
  5. Wet fire suppression systems: Data centers’ most important equipment can be severely damaged by water. Because of this, most data centers use non-water fire suppression systems. Non-water fire suppression systems prevent equipment damage if the fire system is triggered. Although this safe solution exists, many older data centers still use wet fire suppression systems, which puts their equipment at risk of damage and major outages.
  6. Cooling failures: Because data centers generate an incredible amount of heat, effective cooling solutions are vital to preventing equipment from overheating or suffering from shortened life spans. If your cooling solutions don’t work as intended, your data center may experience erratic temperatures — it could be freezing one minute and sizzling the next. Failing to implement backup cooling procedures and properly maintain the ones you currently have can cause your data center’s productivity to take a hit.
  7. Cybersecurity threats: Cyberthreats, including phishing and ransomware attacks, are among the most dangerous causes of data center downtime. Cyber attackers can exploit the weaknesses within your organization and get access to your sensitive data, exposing vital information and endangering your business.

Ways to Overcome These Data Center Failures

You don’t have to accept data center and network outages as regular occurrences in your facility. With proper management and the preventative measures below, you can significantly reduce outages and maximize productivity:

  1. Minimize human error: Human error accounts for about 22% of unplanned outages. Lack of experience can cause major problems in day-to-day data center operations. Get ahead of this by conducting regular training and certification programs for data center staff to ensure your team is up to date on best practices. Doing so enhances their skills and provides a path for career advancement. Another way to control human error is to provide and document step-by-step directions on completing complex tasks. With clear guidelines, your team can provide a more consistent quality of work.
  2. Prepare your data center for severe weather: Natural disasters are unavoidable, but taking the appropriate preventative measures will minimize the potential impact of an outage. Ensure your facility has a severe weather contingency plan and test your backup power supplies regularly to make sure they will work when you need them.
  3. Prevent equipment failure: Perform regular inspections on your hardware to ensure it’s in excellent working condition. Replace outdated equipment with more enhanced and efficient machines. One faulty machine may be a single point of failure in your data center, but it can have consequences for the entire facility if not addressed appropriately.
  4. Invest in an uninterruptible power supply (UPS): An UPS can keep your data center up and running in the worst situations by providing you with surge-protected power for as long as you need. Additionally, always inspect your UPS for signs of failure or other issues — 25% of data center downtime can be attributed to UPS failures.
  5. Consider colocating with a reputable data center: Colocation companies are designed with redundant power capabilities and robust cooling systems. There are many benefits of colocating your servers and networking machinery with another facility, including better uptime reliability, enhanced security and access to hybrid cloud services.

Optimize Your Data Center With DataSpan

At DataSpan, we are a national technology solutions provider that helps our customers accomplish more with fewer resources. We deliver specialized products and services for all your data center needs, from installation to maintenance. Let us become an extension of your data center staff — you will be empowered by all we can help you achieve to keep your facilities up and running. Find a rep in your state or contact us to learn more about how we can help.

Linked Sources:

 

  1. https://www.computerweekly.com/news/252486928/Costs-incurred-by-major-datacentre-outages-continue-to-rise-Uptime-Institute-research-shows
  2. https://www.cnet-training.com/news/human-errors-the-biggest-challenge-to-data-center-availability-and-how-we-can-mitigate-them-part-1/
  3. https://dataspan.com/data-center/cooling-solutions/
  4. https://dataspan.com/blog/preparing-data-centers-for-severe-weather/
  5. https://dataspan.com/cloud-co-location-solutions/
  6. https://dataspan.com/about/find-your-rep/
  7. https://dataspan.com/contact-us/
  • SHARE