The following excerpt about data center management is taken from Data Centers: Servers, Storage, and Voice Over IP.
Causes of data center downtime
About 80 percent of the unplanned downtime is caused by process or people issues, and 20 percent is caused by product issues. Solid processes must be in place throughout the IT infrastructure to avoid process-, people-, or product related outages. Planned or scheduled downtime is one of the biggest contributors (30 percent). It is also the easiest to reduce. It includes events that are preplanned by IT (system, database, and network) administrators and usually done at night. It could be just a proactive reboot. Other planned tasks that lead to host or application outage are scheduled activities such as application or operating system upgrades, adding patches, hardware changes, and so forth.
Most of these planned events can be performed without service interruption. Disks, fans, and power supplies in some servers and disk subsystems can be changed during normal run-time, without need for power-offs. Data volumes and files systems can be increased, decreased, or checked for problems while they are online. Applications can be upgraded while they are up. Some applications must be shut down before an upgrade or a configuration change.
Data center outages for planned activities can be avoided by having standby devices or servers in place. Server clustering and redundant devices and links help reduce service outages during planned maintenance. If the application is running in a cluster, it can be switched to another server in the cluster. After the application is upgraded, the application can be moved back. The only downtime is the time duration required to switch or failover services from one server to another. The same procedure can be used for host-related changes that require the host to be taken off-line. Apart from the failover duration, there is no other service outage.
Another major cause of data center downtime is people-related. It is caused by poor training, a rush to get things done, fatigue, lots of nonautomated tasks, or pressure to do several things at the same time. It could also be caused by lack of expertise, poor understanding of how systems or applications work, and poorly defined processes. You can reduce the likelihood of operator-induced data center outages by following properly documented procedures and best practices. Organization must have several, easy-to-understand how-tos for technical support groups and project managers. The documentation must be placed where it can be easily accessed, such as internal Web sites. It is important to spend time and money on employee training because in economically good times, talented employees are hard to recruit and harder to retain. For smooth continuity of expertise, it is necessary to recruit enough staff to cover emergencies and employee attrition and to avoid overdependence on one person.
Avoiding unplanned data center downtime takes more discipline than reducing planned downtime. One major contributor to unplanned downtime is software glitches. The Gartner Group estimates that U.S. companies suffer losses of up to $1 billion every year because of software failure. In another survey conducted by Ernst and Young, it was found that almost all the 310 surveyed companies had some kind of business disruption. About 30 percent of the disruptions caused losses of $100,000 or more each to the company.
When production systems fail, backups and business-continuance plans are immediately deployed and are every bit worth their weight, but the damage has already been done. Bug fixes are usually reactive to the outages they wreak. As operating systems and applications get more and more complex, they will have more bugs. On the other hand, software development and debugging techniques are getting more sophisticated. It will be interesting to see if the percentage of data center downtime attributed to software bugs increases or decreases in the future. It is best to stay informed of the latest developments and keep current on security, operating system, application, and other critical patches. Sign up for e-mail-based advisory bulletins from vendors whose products are critical to your business.
Environmental factors that can cause downtime are rare, but they happen. Power fails. Fires blaze. Floods gush. The ground below shakes. In 1998, the East Coast of the United States endured the worst hurricane season on record. At the same time, the Midwest was plagued with floods. Natural disasters occur mercurially all the time and adversely impact business operations. And, to add to all that, there are disasters caused by human beings, such as terrorist attacks.
The best protection is to have one or more remote, mirrored disaster recovery (DR) sites. In the past, a fully redundant system at a remote DR site was an expensive and daunting proposition. Nowadays, conditions have changed to make it very affordable:
- Hardware costs and system sizes have fallen dramatically.
- The Internet has come to provide a common network backbone.
- Operating procedures, technology, and products have made an off-site installation easy to manage remotely.
To protect against power blackouts, use uninterruptible power supplies (UPS). If Internet connection is critical, use two Internet access providers or at least separate, fully redundant links from the same provider.