There is an industry phobia about the “single point of failure.” It is largely assumed that redundant systems are more reliable than non-redundant systems. Obviously it is inarguable that given a long enough time line, a truly redundant/fault tolerant system will be more reliable than than non-redundant/non-fault-tolerant one, however, a fully redundant/fault tolerant system is sometimes an order of magnitude more costly than one which is not and may not deliver the reliability you seek.
The purpose of redundant/fault-tolerant systems is to improve reliability. Which, of course, it does, but it should be noted that a good high quality system (quality components, good cooling, etc.) can offer a practical reliability approaching that of a redundant system. Equally important is that operating systems like Linux, FreeBSD, and Solaris (on good quality hardware) have shown continuous operation times measured in years. In the real world, you'll probably upgrade the system before you see a failure. In real practical terms, the expected reliability can be quite high.
When we speak of reliability in the context of a web service infrastructure, we are really trying to quantify the risk, impact, and condition of failure which affects "availability." The risk of failure of any system is more than merely the product of its parts. It must include random chance and events out of our control. It must also be evaluated in a larger context which includes the impact of failure, i.e. what happens when it does fail? Lastly, what are the conditions of a possible failure? Can they be prioritized?
A while back, U.S. administrators were defending the cost of components in a troop transport aircraft. One item was the airplane's coffee maker. The coffee maker costs thousands of dollars. Why? Because it was part of the aircraft and it had to operate in all the possible conditions of every other component of the plane. In the words of one of the engineers, "The plane could be crashing to the ground with the entire crew dead, and this would still make a good cup of coffee."
The blind bureaucracy failed to evaluate the impact of failure, which is minimal, no coffee. Furthermore, they failed to understand the conditions by which the system would fail. If the plane is crashing or the crew was in peril, it is hardly worth ensuring that the coffee maker was working.
Small and medium sized companies or the veritable shoe string "internet startup" often waste valuable resources creating and deploying systems so reliable and expensive that the cost and maintenance of the system becomes one of the biggest risk of failure to the business. If the company goes out of business because it runs out of money before it becomes profitable, does it matter if the system they built has five nines reliability?
Furthermore, it is impossible to guarantee something 100% or even 99.999% reliable, there are always events that can not be predicted. A 99.999% guarantee does not mean you will not have failures, it only means that someone has done the math on a more realistic estimated probability of failure and figured out a pricing strategy that allows profitability when the inevitable failures do happen, its nothing more than an insurance policy.
One of the often overlooked issues of the 5 nines guarantee, is the "planned outage" which doesn't count. So, although someone has guaranteed that you'll have no more than 0.001% downtime a year, you'll still have downtime, only you'll know about it in advance. Is that really 5 nines? Will your user's know (or care about) the difference?
In the late 1990s I was working at an internet search company who had produced a fully redundant site, it was great, you could kill a bunch of systems, and no one would ever notice and no data would be lost. It was in a well known commercial data center with full battery backup, generators, and redundant connectivity. Nothing was left to chance. Well, one day the site stopped responding. It appears that an electrician had casually turned off the main power switch to the cage.
The point is that while reliability is obviously very important, it can not be guaranteed to 99.999% in any real sense. There are always factors out of your control. Trying to achieve a 5 nines system is an admirable goal, but to the small and medium sized company, probably a waste of time and money. You may be able to do it in the abstract, but a backhoe can dig into the ground and cut your wires, an electrician can pull the plug by accident, a hurricane can flood the generator room with water, or your ISP may plan an outage.
Even if your system somehow achieves five nines, the weather is good, the electricians are locked out of the building, there are no construction crews in the neighbourhood, and your ISP hasn't planned any downtime, your users are still connected to the internet through DSL, cable, or the telephone. On any given day, a measurable portion of your users, or potential users, will be unable to reach your site. That is just a fact of life on the internet.
This raises the really important question, how much downtime can you risk? Five nines (99.999%) comes out to about five minutes a year not including planned outages. How about five minutes a month? How about 15 minutes every three months? How much "unplanned" downtime can you accept? From a business perspective, this is the most important question that must be answered before you design your data center. Even though it's "IT," you still have to do a proper cost/benefit analysis.
The only real way to never(1) experience an outage is to have multiple geographically separate and independent sites, located in different weather systems, on different power grids, and using different internet backbones. Anything less faces diminishing returns on investment as the odds of something out of your control happening becomes more likely than the failures for which you are planning.
Short of building a private internet empire, your best bet for a single cost effective data center is to focus on OS and software stability, hardware maintenance, backups, and reasonable redundancy. Within a reasonable period of time the probability of a stable system built with quality components failing is quite low. Where redundancy is difficult, a hot spare strategy with periodic replication can limit downtime.
You will always, even with 5 nines, have to accept some risk of downtime (planned or unplanned). There is no magic bullet, however, with careful planing, it is possible to simplify and reduce the cost of your architecture without any practical impact on availability.
(1) never It should be clear that never is not possible. Even Microsoft's domain name expired once and their site was temporarily unreachable for several hours.
By Mark L. Woodward
February 27, 2007
Copyright © Mohawk Software 2001, 2010