Chapter Download

Data center management: Servers, storage, and voice over IP

The following excerpt about data center management is taken from Data Centers: Servers, Storage, and Voice Over IP.

Causes of data center downtime

    Requires Free Membership to View

 

About 80 percent of the unplanned downtime is caused by process or people issues, and 20 percent is caused by product issues. Solid processes must be in place throughout the IT infrastructure to avoid process-, people-, or product related outages. Planned or scheduled downtime is one of the biggest contributors (30 percent). It is also the easiest to reduce. It includes events that are preplanned by IT (system, database, and network) administrators and usually done at night. It could be just a proactive reboot. Other planned tasks that lead to host or application outage are scheduled activities such as application or operating system upgrades, adding patches, hardware changes, and so forth.

Data center management tales from the tech turf

In 1995, a building in downtown Oklahoma City was destroyed by a terrorist. Many offices and data centers lost servers and valuable data. One law firm had no off-site backup of its data and lost its records. The firm was in the business of managing public relations for its clients. It was unable to restore any data pertaining to its customers. Sadly enough, it went out of business within three months. Another commodity-trading firm had a remote mirror site just outside the city and was able to bring up its applications on standby servers at the remote site. It quickly transferred its operations and business to the secondary site.

Most of these planned events can be performed without service interruption. Disks, fans, and power supplies in some servers and disk subsystems can be changed during normal run-time, without need for power-offs. Data volumes and files systems can be increased, decreased, or checked for problems while they are online. Applications can be upgraded while they are up. Some applications must be shut down before an upgrade or a configuration change.

Data center outages for planned activities can be avoided by having standby devices or servers in place. Server clustering and redundant devices and links help reduce service outages during planned maintenance. If the application is running in a cluster, it can be switched to another server in the cluster. After the application is upgraded, the application can be moved back. The only downtime is the time duration required to switch or failover services from one server to another. The same procedure can be used for host-related changes that require the host to be taken off-line. Apart from the failover duration, there is no other service outage.

Another major cause of data center downtime is people-related. It is caused by poor training, a rush to get things done, fatigue, lots of nonautomated tasks, or pressure to do several things at the same time. It could also be caused by lack of expertise, poor understanding of how systems or applications work, and poorly defined processes. You can reduce the likelihood of operator-induced data center outages by following properly documented procedures and best practices. Organization must have several, easy-to-understand how-tos for technical support groups and project managers. The documentation must be placed where it can be easily accessed, such as internal Web sites. It is important to spend time and money on employee training because in economically good times, talented employees are hard to recruit and harder to retain. For smooth continuity of expertise, it is necessary to recruit enough staff to cover emergencies and employee attrition and to avoid overdependence on one person.

Data center management: On-call woes

One organization I worked at rebooted their UNIX servers every Sunday morning at 4 a.m. to clear memory, swap, and process tables. Of course, sometimes the boxes would not boot up all the way and the NOC had to call someone at a weird hour. Later, the reboot time was moved to 6 a.m. This was done to avoid application-related problems on systems with high uptime. This was initially implemented due to a Solaris problem on Suns that had not been rebooted in the last 350 days and were running an old release of Oracle Database.

Avoiding unplanned data center downtime takes more discipline than reducing planned downtime. One major contributor to unplanned downtime is software glitches. The Gartner Group estimates that U.S. companies suffer losses of up to $1 billion every year because of software failure. In another survey conducted by Ernst and Young, it was found that almost all the 310 surveyed companies had some kind of business disruption. About 30 percent of the disruptions caused losses of $100,000 or more each to the company.

When production systems fail, backups and business-continuance plans are immediately deployed and are every bit worth their weight, but the damage has already been done. Bug fixes are usually reactive to the outages they wreak. As operating systems and applications get more and more complex, they will have more bugs. On the other hand, software development and debugging techniques are getting more sophisticated. It will be interesting to see if the percentage of data center downtime attributed to software bugs increases or decreases in the future. It is best to stay informed of the latest developments and keep current on security, operating system, application, and other critical patches. Sign up for e-mail-based advisory bulletins from vendors whose products are critical to your business.

More info on this book about data center management

Administering Data Centers: Servers, Storage, and Voice over IP
By Kailash Jayaswal
Published by John Wiley & Sons
ISBN: 0-471-77183-X
632 pages; November 2005

Environmental factors that can cause downtime are rare, but they happen. Power fails. Fires blaze. Floods gush. The ground below shakes. In 1998, the East Coast of the United States endured the worst hurricane season on record. At the same time, the Midwest was plagued with floods. Natural disasters occur mercurially all the time and adversely impact business operations. And, to add to all that, there are disasters caused by human beings, such as terrorist attacks.

The best protection is to have one or more remote, mirrored disaster recovery (DR) sites. In the past, a fully redundant system at a remote DR site was an expensive and daunting proposition. Nowadays, conditions have changed to make it very affordable:

  • Hardware costs and system sizes have fallen dramatically.
  • The Internet has come to provide a common network backbone.
  • Operating procedures, technology, and products have made an off-site installation easy to manage remotely.

To protect against power blackouts, use uninterruptible power supplies (UPS). If Internet connection is critical, use two Internet access providers or at least separate, fully redundant links from the same provider.

 


This was first published in December 2005

There are Comments. Add yours.

 
TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: