Making the Grade: Guidelines to Ensure Solid Disaster Recovery with Testing
Assume recovery will fail. Use testing to reduce failure points and keep the timeline realistic.
December 28, 2020
By Dan Timko
Dan Timko
While managed service providers and enterprise IT pros backup and restore data all the time, few have ever had to recover from a true disaster. Disaster recovery (DR) is a different, more complex beast than restoring an application or discrete data from backups. Especially given the limited experience most have in these situations, taming DR requires regular and comprehensive testing. Unfortunately, testing, if it happens at all, is often partial and slipshod. A study from Spiceworks showed that nearly a quarter of organizations do no DR testing at all.
Without thorough testing, a service provider or enterprise can suffer serious damage to their business and brand should it fail to recover promptly from a massive cyberattack, fire or other disaster. Essentially, all DR efforts are wasted without regular testing, because without a restoration plan that’s been tested to show it works, applications may not restore to a running state, data could be infected with malware and IT may not restore applications in the proper order so that dependencies are met.
These aren’t problems you want to learn about as employees sit idly by and the C-suite anxiously paces.
Why It’s Best to Test
Assume recovery can fail. Why? There are many scenarios where they fail, but, for example, if backups are corrupted due to a storage issue, recovery is doomed — and that’s an issue that would be exposed with testing. But that’s just the start of what testing can unearth, and there are related shortfalls that create vulnerabilities, including the following:
Unrealistic recovery objectives (RTO): Having recovery time objectives (RTOs) is vital, but recovering a given server, application or environment can take longer than planned. Only testing will reveal what is realistic and prevent other stakeholders from making incorrect assumptions.
Recovery is more than restoring: Data and apps must be available to everyone. Recovery may require installing a virtual private network (VPN) client on remote endpoints, site-to-site connections, or virtual desktops – testing is the only way to know. And until every user is connected, recovery isn’t finished.
Environments change faster than plans: Building a DR plan requires a lot of effort, and often, organizations don’t update these after making changes to the environment. Testing can bring those issues out of the shadows and into the IT light.
Guidelines for Solid DR Testing
Define what needs to be tested. IT should first evaluate what must be recovered, assess each application’s criticality to establish priorities, and identify interdependencies. There are more complex operational-centric aspects that may also need testing, including:
Dependencies: Ensure any app or service that’s depended upon by another is available. Then be sure anything involved with either side of the dependency is up to date.
Alternate sites: Some organizations plan to recover in the cloud, some have varying contingency sites. No matter how you approach it, each must be tested, ideally more than once.
Failover: Recovery is about restoring operations, not systems and data. Practicing the actual failover of some or all of the environment is imperative.
Employee remote connectivity: Again, recovery is useless if no one can connect from their remote locations, and with the pandemic creating a home-based workforce, this is essential.
Testing Models
There are three primary purposes when testing DR: Test the plan, test the people, test the technology. Each of the following models …
… covers at least one or more of these elements:
Plan review: In this model, the DR team walks through the plan looking for missteps, unaddressed scenarios, missing details, etc. If taking this route, at least do periodic testing of backups.
Tabletop run-through: Rather than perform recovery simulation offsite or in the cloud, some organizations do a sit-down, scenario-based walk-through. This involves discussing activities, bringing up potential issues and generally preparing the DR team for a true recovery situation.
Scenario simulation: Some environments are so complex, a simulation would be too time consuming. Pick a test scenario – workload or application-based, location-based or one devised by management – so the DR team can work through the actual steps to bring those parts back.
Full simulation: In this case, the whole enchilada is at stake and that means it’s critical to evaluate an organization’s ability to recover an entire copy of operations. For smaller entities, it’s about as complex as a simulation scenario as it is for larger organizations.
Making the Grade
When finances get tight, decision makers, and sometimes even IT leaders, will cut DR and count on the unlikelihood of a catastrophe working in their favor. It’s a classic case of being penny wise and pound foolish when you consider how even minor downtime can result in missed sales, unhappy customers and blemished reputations, all to save a few bucks in the short term.
MSPs and IT need to convince decision makers that disasters can and do happen. If they’re not swayed, remind them that at the start of 2020 no one was predicting a global workforce would soon go remote and with only a few days’ notice. And really, with this new paradigm, IT must now re-evaluate and adjust strategy so that business keeps going under any circumstance.
Testing is more important than ever, and by doing so, you can ensure that come what may, your DR efforts will make the grade.
Dan Timko serves as chief strategy officer for cloud backup at J2 Global, which includes J2’s KeepItSafe, OffsiteDataSync, Livedrive and LiveVault businesses. Prior to joining J2, Dan co-founded Cirrity, a BaaS and DRaaS provider that was sold to Green Cloud Technologies in 2017. Follow him on LinkedIn or @dan_timko or @j2global on Twitter.
You May Also Like