How to Cost Optimize Cloud Disaster Recovery
Cost optimizing one of the most expensive decisions you can make
Disaster recovery is a hot topic this year among many organizations. Everything from the past few years coalesced into 2024, where protecting against outages became a top concern.
As someone who talks with a variety of customers of differing sizes and needs around this topic, I can definitively state that DR is an expensive topic no matter your size. That shouldn’t surprise anyone considering what it is supposed to do—recreate your environment and make it functional during an outage. However, can it be done more efficiently? Absolutely.
How fast that needs to happen drives costs to the extreme, so I’ll start with that and work my way through all the things to be considered when pricing out a DR scenario.
Perform a Business Impact Analysis
This might come across as “old-school,” but every business should perform a business impact analysis before creating a DR plan. This doesn’t change because you move to the cloud. You just have more tools to address it.
Why? DR fits under a broader umbrella called Business Continuity, which covers both Disasters and Incidents. A Business Continuity Plan (BCP) includes a disaster recovery plan (DRP) and an incident recovery plan (IRP).
What are the goals of a BIA?
Prioritize the criticality of each business process
Estimate amount of downtime (MTO)
Identify resource requirements for each critical process
How do you perform a BIA? By first performing an asset inventory. A business should be able to quantitatively state what their assets are worth to apply the proper mitigation during an outage.
From this, you need to identify the possible threats, likelihood, the annual rate of occurrence (ARO), the relative risk, the acceptable interruption window, and then determine the appropriate response for each asset.
For each asset, calculate the Single Loss Expectancy (SLE). SLE = Asset Value (AV) x Exposure Factor (EF). If it was a total outage, that EF would be 100%.
You can then calculate the Annual Loss Expectancy (ALE) by taking the SLE x the Annual Rate of Occurrence (ARO). If it happens once a year, that would be XYZ $’s (SLE) x 1.0 (ARO). Once every 4 years would be 0.25 by comparison. Your mitigation then should be up to the cost of the ALE!
What comes out of a BIA? The Recovery Time Objective (RTO). You will then use these to create a priority order of assets to be restored and associate a cost with each. Quantifying that RTO is extremely important because it will in turn directly affect the mitigation measures. The shorter the RTO, the more expensive the mitigation.
The BIA is useful because it provides clarity. You can’t accurately put a dollar sign on what a mitigation should cost if you don’t understand the value of the asset it’s protecting, or the risks the mitigation is meant to protect against.
Frequently, I see this done in reverse. IT is told to create a DR plan not based on business needs, but simply based on technical requirements. “We need a 1 hour RTO for our documents that get accessed once every few months”—wait, what? That’s a totally inappropriate technical mitigation for the requirement.
This would be akin to hiring security guards to protect a bicycle while you’re at work. The guards probably cost more than the bike! Buy a lock instead. It’s cheaper and more likely to stop the most common threat—a common bike thief.
All this to say: the BIA is a useful tool for appropriately pricing out assets to be protected and mitigated for. If there’s not at the very least an effort to appropriately value IT assets, creating a DR plan will be based entirely on theory, not actual numbers. And in my experience, that almost always ends up too expensive. IT leaders should be cognizant of this and push to get these numbers right the first time.
Mitigate Appropriately
Performing this step correctly is what stumbles many organizations.
It’s tempting to lump everything in your IT umbrella under a single hard RTO. It makes planning simple, but it costs disproportionately expensive most of the time, and not in the way you might think.
Unless you have an extremely small and purpose-built IT environment, it’s unlikely that the “one RTO to rule them all” approach is for you.
It’s far more likely you have a variety of home-grown and 3rd party applications which reside throughout your IT environment, which might be hosted throughout the country at various sites, and spread across different clouds. Your users are also likely geographically dispersed and will require access in a variety of ways.
Scenario
I’m going to make up a scenario here.
Let’s say that you work for an online retail organization that only serves US customers and is cloud-based in one region. There are roughly 10,000 visitors per hour to your website during business hours, of which you convert about 5% of them purchasing roughly $50 dollars’ worth of goods each ($25k/hr). As an online retail store you have your frontend web application which serves customers as well as a way for them to authenticate and access backend data like their customer account, shopping cart, inventory of available goods, etc.
You have internal services accessed by roughly 200 employees scattered throughout the USA, online hosted file servers, a few 3rd party applications for authentication (Okta), for productivity (O365), leverage ServiceNow for ticketing, VPNs to connect to cloud resources, yada yada.
The cost of all that infrastructure monthly is about $50k/month, or about $69/hr.
Cost of downtime in this case would be Number of Employees (200) x Average Employee Wage ($50/hr) x A Productivity Impact of 80% (Estimated) = $8,000/hr.
All told, you have:
Critical Retail App Loss | $25k/hr (SLE)
Productivity Loss | $8k/hr (assuming 80% productivity impact)
Client Loss (10k potential buyers/hr from 8AM-5PM EST)
Infrastructure Costs ($69/hr)
Total: ~$33,000/hr
What happens if there’s a regional outage in the main cloud provider? Regional outages often last a few hours, although they might only occur once a year at most.
Let’s assume a 4 hour outage that occurs once a year. That outage would cost ~$132,000 to your organization (ALE) at an absolute minimum. In order to make this worthwhile, the infrastructure to mitigate should cost < $11k/month to maintain, or < $132k annually.
Imagine you had gone ahead and applied a blanket 1 hour RTO without knowing how much revenue was brought in by the critical retail app alone. You’d be willing to lose $25k, significantly more than the cost of infrastructure over that same time period, which is all that would be visible to you as an IT leader.
In reality, that application would need a much more stringent RTO than any of the internal services. In order to avoid any possibility of going negative over an hour timeframe, you’d need to have a 15 minute RTO for that critical application. Even if you lost 1 hour for employee productivity, you’d still be in the green. Building any infrastructure required to make that app available in another region would make total sense considering the alternative!
Does paying a bunch more money for infrastructure that supports your employees’ internal activities make sense when the cost would be roughly $8k/hr in productivity? Not really. For the relative amount of infrastructure needed to support it, you’d have to be very careful about the mitigation to ensure it made monetary sense. You could afford a longer RTO for those internal services and lean more heavily on a backup/restore strategy, or a basic pilot light implementation instead.
This is why the BIA is so darn important. If you can’t quantify what your most critical applications cost, how can you possibly design a mitigation for them?
However, when you don’t have a killer money-making application like that one, the math changes significantly.
Imagine instead that you’re a healthcare organization and that an outage for you won’t be some revenue generating app but will have a direct impact on the functionality of your hospitals. With very little upside, it’s a matter of mitigating how negative you are willing to go and for how long. You have more expensive employees, more substantial infrastructure investments, and patients depending on having their records readily available.
A loss of an hour or longer in this case could be devastating in many ways. All the more reason to do that BIA and ensure that you’re focusing on recovering what needs to be recovered in a timely manner.
Cloud Benefits
Cloud DR has some serious benefits over traditional DR setups.
Benefits (Off the Top of My Head):
No physical hardware
No physical site
Simpler administration
PAYG
Rapid scalability
Backbone networking (private and lower latency)
A GUI that oversees multi-account, multi-region environments
Click-button style recovery
Managed DR services
Drills and testing as a feature
The biggest benefit to a cloud DRP is that it doesn’t require you to have hardware or a physical site. In traditional DRPs that was always a concern, and a big hill to get over in order to justify the plan.
With cloud, the ability to spin up services and do PAYG gives you much more flexibility when it comes to how to approach less stringent RTO and RPO requirements. Cloud made the pilot light and backup/restore configuration options much more viable. Also, having a GUI and maintaining continuous access over a multi-region environment is a huge advantage vs. requiring a team of people to spin-up a new physical site.
Maximizing that flexibility is a cost optimization exercise in itself. I recommend getting as granular as possible to identify all the individual assets and which ones can be restored without requiring a substantial infrastructure requirement in the DR region. For monolithic items, this is more straightforward. However, with some of the more complex and dependency-driven designs, it might not always be possible. Use those RTOs to effect to identify what will qualify and what won’t.
Identifying how you can best use cloud flexibility and scalability to cut costs as part of a wider DR plan is one of the best ways to make the investment more palatable.
The Hidden Costs
I’ll be honest, some of these aren’t really hidden but it’s important to see them at a high level.
Professional Services
Earlier I talked about performing the mitigation—what if you can’t do that internally?
Now we’re talking about a pro services cost to do that. How much should an organization pay for this?
It’s important to answer one question first:
Do we not have the expertise or not have the availability?
If you lack the expertise to build out a new DR site in the cloud, you should certainly consider paying up to what it would cost to hire employees or contractors to do it. Since it can’t be done internally, you’re handcuffed until you hire someone who can.
Building a DR site is a one-time deal usually, so if it costed you 50k-100k in a single year to do it, and you lack the expertise to do it, wouldn’t it make sense to pay pro services to do that work?
On the other hand, if it’s just availability, you need to determine how big of a DR site is in question and how much time you’d have to pull away from other projects and initiatives, and the relative value of what all those internal resources would cost. Often times it’s still advantageous to leverage services, but other times it’s not. Be scrupulous and math it out.
Testing
Oh yes, this one often gets omitted from the planning. There are various levels to testing, and not all of them have to be expensive. However, every increase in confidence comes from doing a more involved DR exercise.
Tabletops have their place to refresh on processes, roles, and responsibilities, but there’s nothing quite like having a full-on rehearsal to get the point across.
You shouldn’t do a full-on DR exercise often. It’s typically a once-a-year exercise, although that’s of course decided by business needs. However, the cost for that is going to be high. It involves actually doing a mock DR exercise where resources are spun up and replicated to prove functionality and that the established RTOs and RPOs can be hit.
Cloud does make it easy to cost attribute resources appropriately, but there are other costs associated with testing like employee involvement, training, not to mention some potential impact to other priorities.
There are some handy tools to test and perform drills with Cloud in a more managed fashion. Both vendors and native solutions offer that capability. These are suitable for smaller scale DR exercises and far easier to control, manage, and observe.
Infrastructure Maintenance
If you opt to maintain infrastructure in another region, and most DRPs require that, it will become part of not only your monthly expenses but also part of your maintenance plan.
Taking care of infrastructure you don’t plan on using is always contentious, but it’s important to ensure that it will function as intended when required. Allocating time and resources for it might not be a significant lift at all for small pilot light implementations, but for bigger warm-standby and active/active configurations it will certainly require more attention and of course money.
Being upfront about that when taking on a new DR site helps to ensure everyone is informed and the business leadership isn’t surprised.
Multi-Cloud
I’m only going to touch upon this because I see some organizations pursuing this as a reasonable DRP in the event of a whole cloud outage. What I’m talking about here is failing from one cloud into another, not having 2 separate clouds that fail into other regions in the same cloud.
Multi-cloud DR is an expensive endeavor.
On top of requiring multi-cloud expertise, there will likely be cloud-agnostic tooling required, a higher complexity of management, licensing requirements, cloud-specific failback procedures, as well as many other items.
Do not go into multi-cloud DR without a clear understanding of what is required. It’s a huge investment where the risk and likelihood of impact is so low that only the most demanding customers should even consider it as a possibility.
Conclusion
I say all this because I have met a lot of customers confronting the DR monster without the appropriate numbers in hand. You can’t cost optimize it without those numbers up front. Do that BIA and it will make your life 10x simpler on the backend.
There are certainly cheaper ways to go about it as a whole, but as a long-term play, maintaining a DR environment and having a plan that will effectively do what it is supposed to when a disaster occurs is typically a big expense and should be treated like it. I promise the additional planning required is well worth the time investment!