High availability. What are the “five nines” and how to achieve them?

0 Comment

High availability is often demonstrated as a number. We’ve eaten up enough marketing to think that an availability of 99% is extraordinarily high. Only a fraction of clients realize that an availability of 98-99% is rather poor, sometimes straight up unacceptable.

Take a look at these numbers, and you’ll see the difference between a 90% availability and a 99.999% availability:

Availability Downtime/month Downtime/year
90% 3 days 37 days
98% 14.6 hours 7.3 days
99% 7.3 hours 3.7 days
99.8% 1.5 hours 18 hours
99.9% 44 minutes 8.8 hours
99.99% 4.4 minutes 53 minutes
99.999% 26 seconds 5.3 minutes

From the table above one can note that a data center that guarantees a 99% availability can allow up to 7 hours a month of downtime. Imagine this: a data center is in maintenance all day, your site is not available, you’re taking losses; however, you cannot file complaints against the data center — since they are holding up their end of the bargain.

I find network availability of 99% to be poor. I give my preference to data center that provide at least 99.9% uptime.

Perhaps, there exist internet-projects that can survive even 37 days of downtime per year (more than a month!). Most internet-stores, portals and sites (especially those that handle transactions internally) can’t even afford 18 hours per year of time wasted. It’s always difficult to restore one’s reputation: if the cause of its poor shape is your admin’s off-day, well that’s just unfortunate.

“Five nines” — that’s what high availability is all about.

The term “fine nines” equates to an availability of 99.999% and can be encountered in marketing literature about is often as in materials of the technical kind. A system with a “five nines” level of availability is considered to be a high uptime system.
It follows from the table that 99.999% of availability means a mere 5.3 minutes of downtime per year, but even data centers that guarantee 100% availability often take advantage of marketing tricks.

For example, scheduled maintenance is not being taken into account when calculating availability time. A data center may promise a 99.99% availability, but when planned maintenance is being carried on you’ll see a message similar to “maintenance for 2 hours”, and this won’t be considered downtime. To conclude, it’s essential that you pay close attention when familiarizing yourself with the SLA (Service-Level Agreement).

If you want to ensure maximum uptime for you site on a single server, choose a data center with a high SLA-guaranteed availability.

Important! The SLA should guarantee hardware replacement time. And, ideally, response time as well.

In addition, your admin must monitor all service-related activity and quickly react to downtime.

A few words on what high availability comprises of:

The two kinds of availability are network and service availability.
Network availability is when your server is accessible through a network.
Service availability is when your server is able to serve clients.

Service availability cannot exceed network availability unless you use alternative connections (that have independent network availability).

Service availability depends on:

  • your server’s network availability
  • your admin’s reaction time
  • response time of the data center support team
  • hardware replacement speed at the date center

Downtime is caused by:

  • network problems
  • hardware issues
  • server load problems (can’t process the information fast enough)
  • software problems (bugs introduced by programmers)

Monthly (unless there are hardware issues) and even more so yearly availability of 99.8% can be achieved in a good data center on a single server without going out of your way to take care of fault-tolerance. Some luck is required to achieve a number of 99.9%.
If you require an availability of 99.8% or higher, you should invest into fault-tolerance. There also must be multiple servers. But that’s another story.