Redundancy in the Data Center

Redundancy Overview & Importance

It's funny how the concept of redundancy can be perceived in completely different ways — on the one hand, it's often associated with surplus, excess, or even losing one's job. However, in other scenarios, redundancy in the form of spares or backups is considered beneficial and necessary, much like in the case of a spare tire or a reserve parachute (um, yes, I think I'd like to have a little redundancy there).

In the engineering and IT sphere, it's common practice to plan for extra components, paths, and processes that may not be strictly necessary but are crucial in case of failures. But knowing what we should do and putting it into practice are often very different things. We've seen all kinds of environments, and in many instances, a more robust redundancy plan would have made things easier and significantly reduced downtime.

With this in mind, we put together some recommendations for making a redundancy plan that makes sense for your environment, budget, and needs. No need to live dangerously.

Meme showing Austin Powers. Text reads I see you don't have enough redundancy in your data center. I also like to live dangerously.

Data Center Redundancy Recommendations

Every environment will have its own recipe for the ideal amount of redundancy based on the size, how critical the devices are, the budget, and more. But having a single point of failure in any part of the data center is a recipe for a headache. While the basic formula of N+1 applies to pretty much every area of the data center, there are some nuances as to why and how to implement in different categories.

Redundancy in Power Supply

Flames engulf outlet and cord in smoke and blaze.

There's a good reason most devices have spots for two or more power supplies — if you lose power, you lose access to the device. That can have far-reaching consequences — if a power supply failure occurs on a production device, it can bring the systems and processes to a screeching halt. And if you don't have locally stocked spares on hand, systems will be down while waiting for the fix.

While it's generally a pretty basic fix, power supply issues are one of the most common failures in the data center.

What You Can Do

If your device has spots for multiple power supplies, like the one pictured here, don't hesitate to use it! If possible, take it a step further and send each of those cables to different power sources. That way, if you have a failure either with the cord or at the outlet, you are covered.

You might also consider keeping spare power supplies on hand — they're relatively easy and inexpensive to source and stock.

Tip: Don’t plug everything into the same power strip plugged into the same outlet! You are asking for trouble!

Networking Redundancy

Aerial view traffic junction with highways, streets, overpasses and freeway

Just like other areas of the data center, redundancy in networking hardware and connections is key to smooth operation. Having additional data paths and multiple switches in a stack is like having alternate routes during rush hour traffic.

For instance, traffic would be a nightmare if I-5 (for those of us on the West Coast of the US) were to be closed during peak times! Having another way around would be crucial to traffic flow. Similarly, having another data path provides an additional outlet to move the data, allowing operations to continue.

It's the difference between dealing with a failure while everything is still operational versus panicking and rushing to fix things because the whole system is down. Having the network go down could mean business comes to a screeching halt until it's fixed.

What You Can Do

If possible, have multiple switches in a stack so that if one of them fails, another one will pick up the slack. It's also a good idea to have dual data paths for the same reason — the alternate route for the data will keep connectivity in the event of a failure. Even with a stack of multiple switches, it's still a good idea to stock spares of the replaceable components of the switches (if applicable).

Tips: Perform periodic updates and reboots. Environment situations may vary, but most network equipment benefits from once a year check. Make sure it is up to date with the most current versions.

Keep a copy of your configuration in the event that all of your redundancy fails and you have to replace the switch — it will prevent the chaos! Extra points if it is in a text doc that is easy to access.

Storage & Data Redundancy

Hot air balloon has collapsed but man in basket has a backup balloon

Most storage devices have a certain amount of built-in redundancy, and you can customize the amount in various elements, such as power supplies, controllers, and hard drives.

The purpose of redundancy in storage devices is much like in the rest of the data center — keeping the system operational despite a failure. Hard drive redundancy (we'll get more into RAID groups in a minute), additional controllers, and dual power supplies all help, so that when a failure happens, the spare component steps in and keeps things running.

A common (and very handy) feature of storage devices is the ability to have hot spare hard drives. A hot spare is a drive that kicks in automatically when a failure happens (instead of keeping copies of the drive like in RAID groups). It adds another layer of protection and means that you may not even notice right away when a failure occurs. We've helped customers who live on the edge with no spare configs and there are many instances where a hot spare would benefit their environment!

Disk Redundancy: A Little About RAID Groups

RAID Groups (or Redundant Array of Independent Disks) work on the basic concept of safety in numbers. Disks keep copies of the data stored on other disks in the group so the data can be retrieved in the event of a failure.

There are different ways of setting up RAID groups based on your environment capacity, budget, needs, and risk comfort levels. They use classifications like RAID 0,1,5 and 6.

Large school of fish - safety in numbers

RAID 0 - JBOD (just a bunch of disks) — no redundancy, but in theory improves performance.

RAID 1 - usually involves 2 disks mirroring each other. If one fails, the other kicks in. A RAID 1 configuration can be done with multiple disks but needs to be symmetrical for efficiency (even numbers 2,4,6, etc.)

RAID 5 - is a group of disks with distributed parity, which means that the information is spread across the other disks in the group. In the event of a failure, the data can be retrieved from the remaining disks. One drive can fail without data loss. A minimum of three disks are needed for a RAID 5 group.

RAID 6 - works on the same idea as RAID 5 but adds a level of redundancy with dual parity so that you can lose 2 drives without data loss. It requires a minimum of four drives.

What You Can Do

The more redundancy you implement in your RAID group, the more physical space will be used to store the same amount of data. Basically, more redundancy = more hard drive space. This, in turn, could impact the budget as the infrastructure needs, physical space, and energy consumption will increase without increasing the overall storage capacity.

However, too little redundancy can also be problematic — you might like to live dangerously like Austin Powers, but data loss can be a big pain. It's important to evaluate and make decisions based on your unique situation. It's a common practice to combine RAID 1, especially in servers for the OS, and RAID 5 in storage devices.

Tips: Make sure that your multipathing is configured properly (if the cables are not plugged into the correct spots, the redundancy can fail).

Consider setting up a hot spare if you are able — it can save time and help things continue smoothly if you aren't able to deal with the failure right away.

Best Backups

No discussion about redundancy would be complete without mentioning the ever-controversial but oh-so-necessary topic of backups. Having a backup of your data can make the difference between an issue that's kind of a pain but fixable to an absolute headache and nightmare.

Car with flat tire and spare leaning on car

Multiple drives can and do fail, sometimes causing catastrophic data loss (especially if your redundancy plan is a little lacking or you are a bit behind on swapping out the failed drives that haven't caused problems yet). If the data has been backed up, you can restore what was lost.

Though it is far from ideal, a 5 or even 10-year-old backup is better than nothing, but it's not nearly as useful as if it were up to date.

The debate rages on about how often backups should be updated and how necessary they are for particular situations. Rather than outline specifics that will likely depend on your environment, staff resources, and more, we want to encourage our customers to err on the side of safety when it comes to backups.

What You Can Do

Because we often see equipment at its worst and in scenarios that could have been avoided, we advocate for updated and tested backups. We've seen them save the day, and we've seen the lack of usable backups cause major headaches.

Tips: Test your backups to make sure they work. Often, people will do regular backups but never test them to make sure they can be used for a restore. We've seen it happen where the backup was in the wrong format and was unusable. Yikes!

Perform disaster recovery testing and exercises (if you are an M Global client, let us know before you run these exercises so we can be ready in case of an issue).

The ultimate goal of redundancy in the data center is to eliminate single points of failure wherever they might be. A data center is an ecosystem that is often evolving, so evaluating your redundancy plan regularly is a great way to avoid kicking yourself with the would've, could've, should've regret that comes from catastrophic (but avoidable) failures.

Let M Global Help

We want you to consider us an extension of your team, a trusted resource and advisor. Call us today at 855-304-4600 to find out more.