While researching on DR best practices, I uncovered a statistic from Bank of Japan 2012 survey on business continuity. When asked how many days can a bank’s power generator runs on fuel ? The answer is surprising low. Look at this chart on page 25.
New York Blackout
I remember in 2003 New York Blackout, I was working for a bank and their New York data centre staff was forced to drive a long distance and wait for hours to buy generator fuel. When over 40% of from JP banks do not have fuel supply over 1 day, this number is quite worrying. Japan banks and professionals are well aware of large scale of catastrophes and yet their risk assessment/impact analysis arrive in one day fuel stockpiles.
Think deeper, there are some reasons for not able to store extra fuels. First, fire safety issues. Storage of a large amount of fuel permanently will require extra safety measures. Industrial buildings or data centre location may not allow such storage of inflammable substances.
Second would be cost. What else?
Third is the estimation of recovery time is not directly link to fuel supply. Within 24 hours, most people would believe they can replenish fuel with confidence. However, the 2003 large scale blackout in New York lasts for 2 days. All generators were put into use and thus supply are going to be tight, you should expecting a long queue. The assumption of continuous fuel supply when disaster or large scale blackout happened simply does not hold.
One better approach is to secure priority access to fuel supply when disaster strikes, So in your next data centre audit , you should ask “Show me the contract?” Auditor are paid to ask tough questions.
In last post I mentioned an analysis done by a group of VCPs. In their ppt, one slide is worth more discussion which is the 4 hours RTO defined in MAS notice to banks.
VCPs technical analysis on the MAS Technology Risk Management guidelines.
The Concept of Recovery Time Objective
Recovery time objective is a well established concept and has been seeing it in large scale project design documents and also procurement RFPs. Wiki has this definition “The recovery time objective (RTO) is the duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.”
The reader has to distinguish between recover to full services and recover to a service level. When disaster happens, everything has to be prioritized. Not all program are the same when you have limited resources and time. We may not expect to pay telephone bill via ATM when there is serious flooding but you expect the ATM shall still let you draw money.
The slide (shown below) highlighted the time differences between event happen and disaster is declared. Due to complexity of current system and network, the time to fully assess an system malfunction may take hours. Usually the incident handling procedure will require a few clarification (if not finger pointing) until senior staff is informed about the major outage.
How a bank response to outage is now a critical element in meeting MAS requirement on RTO. The authors of this slide contended that it is far less than four hours and manual steps are not going to meet this requirement. I believe they do have a point.