This joins my other non security pieces: a smarter more social bank, preparing for chrome living without Windows and turning bankers into engineers in a decade.
I wrote this as a response to a question on help a reporter on my iPhone on the tube returning from work, and thought it maybe of interest to you also.
How do you recover from a data centre failure?
The high level recovery strategy is going to depend on whether you have a hot, warm or cold recovery site
There are some common steps to all though:
- Don't panic
- Execute your plan, you will not rise to the occasion you will fall to you training. Trust in the practice and your training
- Confirm that you have actually had a full datacenter failure. The most common causes are network link down, power down and ups/backup failed, whole racks failed (usually caused by power, cooling or user error), acts of god like earth quake or 9/11 are pretty rare
- Mobilize your team - you should have a call tree this is the time to get people out of bed
- Execute your disaster comms plan - making sure the issue is escalated, your users, customers and vendors and kept frequently and accurately informed is critical
- Make sure whoever is authorized calls it a disaster and invokes the disaster recovery and business continuity plans. If you need to get your users to an alternate site this needs to be started in parallel with the DR plan. Your job is t ensure they have the right information (i.e. confirm extent of disaster, realistic time and cost to evoke DR rather than just recover primary)
Hot: automated fail-over and/or load balanced services
Not much to do really, your DR site(s) should already be handling the load. Just run diagnostics, get application owners to confirm everything is working and monitor you load. If you have horizontally scalable applications that are virtualized bring up extra servers as required for capacity. If not shut down or freeze some non core services. If you have quality if service on your network your high value applications should have the bandwidth, if you don't again assign a lower priority/value to any non core traffic on your network or just block things like video streaming sites temporarily.
Warm: servers built, patched, racked and powered - just needs manual switch over
- Work down your prioritized list
- Confirm the servers, network links, databases are up
- Perform the manual switch
- Get application owners to run diagnostics
Cold: god help you :) - Have site with hardware connected but nothing else
- Get someone over to the cold site to power on the servers
- Work down your prioritized list
- Verify your last backups integrity - hopefully it is on disk not tape
- Once servers are powered and have ilo connectivity run build scripts
- Perform the restore
- Run diagnostics
- Switch users and get app owners to verify
After you have survived and got everything back to the primary, take a breath and examine the root cause and take stock of lessons learned during the recovery
Importance of planning and preparation
The key to doing any sort of incident recovery well is everything you did before the incident ie
- Plan - Having a solid and complete plan that is well understood and simple enough to be followed in a crisis. Having copies of this on Dropbox or other online storage you can access on an iPhone or Blackberry is vital. Paper copies are useless because no one remembers to update them
- People and process as well as IT - Both a DR and BCP plan
- RTO / RPO - All applications having a business known and accepted recovery time and point objective
- Resilient architecture - that supports the required level of resilience and recovery e.g. if you have near zero RTO you need a multi datacenter or cloud strategy, virtualized systems, horizontally scalable, redundant everything including network connections and ISPs. I know that security professionals bang on about network segmentation and quality of service - but they can really benefit you in recovering from even a non security scenario
- Testing and rehearsal - A plan that has been completely and frequently (at least 6 monthly for critical systems, annually for everything else) tested by all relevant stakeholders. Yes it is expensive to test and it impacts productivity and the business will bitch they don't have the change window. But if it is ever needed they will thank you for it, if it all goes to hell who is going to get fired?
- Excellent comms plans and uptodate call trees
- Experienced staff