Data Centers require power and the vast majority of them use the local utility company as their primary power source. Inevitably utility power will fail and that is why multiple utility feeds, Uninterruptable Power Supplies (UPSs), Standby Generators and other on-site redundancy systems are common components in a data center’s infrastructure. Many of these infrastructure components are complex and their emergency response functions sit idle during normal operations. Also a loss of power often requires personnel to execute unique procedures.
Trusting that your infrastructure systems and emergency procedures will function correctly during a loss of utility power is vital. One of the surest methods to verify functionality is by utility power interruption testing. Depending on the test design, shutting off power can provide confirmation that the redundant systems work. However, that same outage has a small risk of causing a power loss on connected critical devices which could potentially damage power supplies and associated hardware. It’s a ‘Catch 22’. You want to test the redundant systems to maintain critical device operation, but critical device operation is put at risk as a result of the testing. In my experience, it’s better to find out systems or procedures don’t work as expected during a scheduled and controlled testing scenario versus during a high transaction period in the middle of the night.
One doesn’t casually decide to simulate a utility outage. A test of this magnitude should be well planned, scheduled during a lower risk time period and have buy-in from all necessary stakeholders. Assistance from your data center engineering partner in creating a customized Method of Procedure for the test can save a lot of time and reduce the chance of error. All building and data center teams, as well as your infrastructure maintenance contractors should be active participants. Depending on the design, the entire electrical system could systematically be tested from the utility to the In-Rack PDUs. Depending on your company’s safety polices, you can have the building facility personnel operate the equipment during testing so they can experience the feel, sight and sounds of the process.
The test should be scripted with details on who is performing each step and when. Risks at each step should be documented and communicated prior to final approvals. In addition to the electrical systems the test could evaluate the cooling and monitoring system responses during power events. Use the opportunity to determine how long it really takes for the fully loaded chiller or air cooled cooling units to restart. Evaluate whether the Building Management Systems (BMS) need to be on UPS or whether the chilled water and condenser water pumps stayed on. Test distribution equipment that might be inaccessible at other times. Have all monitoring and server teams watch for issues with the IT devices. They can quickly determine which devices were not fed by the correct power bus or which have bad power supplies.
Prior to the test, document expectations and afterwards compare to what actually happened. If possible, don’t let the loss of a server or other unexpected result stop the test. Take notes and keep moving. Getting approval for this type of testing on a live data center is usually challenging but keep the goal in mind. This test will verify that your team and facility are ready for the next utility interruption.