As a CTO I am involved in considering risks in infrastructure, and to consider the risks in the event of an IT disaster.
You consider all options; What happens if your data centre fails, power fails, hardware faults, it transit faults and design solutions which mitigate risks based upon probability of failure and the impact of the risk(s).
You consider the data loss and the RTO time (return to operations).
Our enterprise customers have DR – its essential for any major service. Below is an example of one such system which had a DR of essential services via a replicated SAN in two data centres (with an off-site back up in a 3rd). It was an active-passive replica to manage costs. In the DR environment web server images would be instantiated in the event of DR but it provided a 2h RTO.
DR is tested every 6 months where we assume failure and test the readiness of systems and staff. ITIL processes defines regular communication providing updates of key services.
On the 27th May, BA reported that their primary data centre, incurred a power failure, which caused their systems to fail, taking hours to come back online. It was necessary to repair/rebuild systems. Their RTO was significantly longer than 2h.
If their primary system failed why didn’t they switch to DR. It must exist, even with minimal data loss it would have protected 99.9% of customers? Even if their DR wasn’t of the same capacity they could have protected airport services.
I find the amount of downtime a mystery and I doubt the real issues will surface, but this is a reminder to everyone that we must expect systems to fail, and only where we plan and assume failure, can we minimise losses in real life scenarios.
Communication in times like this is essential and for me, regardless of the technical issues was a failure in BA, and one I hope they learn from. (provide secondary systems for communication for one!).