How to master disaster recovery in a DevOps world by Jonathan Hassell raised a point about the idle capacity of DR infrastructure. He set me thinking over the weekend and my thoughts have been shaped out in this article.
DR (Disaster Recovery) and BCP (Business Continuity Plan) are a reality for almost all sized companies and should be handled with priority. Small and mid-sized companies can avoid getting into the trap of not utilizing their disaster recovery infrastructure thereby minimizing Capex and Opex at the same time providing 24X7 uptime.
A typical disaster recovery setup on test and prod environments are hosted onto a primary site. Disaster recovery environment is on a secondary site. The code-base and DB are synced up with DR site at regular intervals with delta to meet the Disaster Recovery agreements for RPO, RTO and RCO. DR Drills are scheduled frequently, so we typically have at least 3 environments in all with one environment being non-operational while the Prod is operational. A good start to know more about DR is here.
CXO’s more often than not, feel cost pinched with a non-operational disaster recovery site hoping that when DR happens, routinely non-operational disaster recovery environment works fine as expected. Such an expectation is warranted with enormous efforts that goes in performing so many successful DR Drills round the year. But does a routinely non-operational DR always do the wonders for you? At least the state non-operational makes me uncomfortable and here is a proposition.
If you feel familiar with the above situation read on for a solution that can possibly help you.
Just keeping a parallel (non-operational) DR environment ready on a DR site and having frequent DR Drills may not ensure a successful DR and BC. Hoping it operates “expectedly” on D-day is far from real, most of the teams involved remain behind the curve in catching up with the delta changes that keep happening on the production LIVE site all the time. Only when DR (Drill) happens these delta changes surface and then the changes are done on the DR site but this could be too late or costly. The below strategy will reduce delta and make the DR environment closer to prod. Again going into the Disaster Recovery nitty-gritty is out of the scope of this article.
One way of handling this situation is to follow ToDR aka “Test on DR”. Move your test environments on DR site. The primary site shall host only the production environment. Test and DR environments remain on the DR or secondary site.
The benefits of this are:
- The DR site remains always operational – not “only” during the drills.
- All infrastructure components in DR site remain tested “all” the time.
- Issues on DR site will start surfacing routinely rather than last minute surprises rather shocks!
- Scaling of infrastructure can happen in seconds using DevOps depending on the load as needed.
Whenever disaster strikes, test environment can go down and the DR environment will be made “up and running” by applying needful network / DNS changes (I love this concept of using DNS names – helps a lot, but again out of scope of this article) and making the last copy of Prod DB available. No customer would object having their test environments down during a DR. Moreover RPO, RTO and RCO do not apply to test environments so this fact can be capitalized as well during the DR.
Many small and mid-sized companies with SaaS-based offering can utilize this proposition for internal but critical or even customer facing products or services successfully and benefit a lot with saved Opex.
As correctly mentioned by Jonathan, DR & BCP should benefit a lot by spinning new environments in no time using DevOps, but let’s have idle DR infrastructure to a minimum, giving the folks at finance team a reason to smile and cheers! Happy DaRing!!
“Wishing Won’t Keep You Safe, Safety Will”