Your data centre has just blown up

Source Flikr
Just as 9/11 passes I thought this was a pertinent time to say: the fact that your primary data centre will fail is not a question of IF but WHEN (ok that sounded a bit like FUD). Still, it pays to be prepared and unlike most security risks this real not theoretical and the business actually cares if your systems are working (as opposed to secure).

This joins my other non security pieces: a smarter more social bank, preparing for chrome living without Windows and turning bankers into engineers in a decade.

I wrote this as a response to a question on help a reporter on my iPhone on the tube returning from work, and thought it maybe of interest to you also.

How do you recover from a data centre failure?

Executing recovery

The high level recovery strategy is going to depend on whether you have a hot, warm or cold recovery site

There are some common steps to all though:
Source Flikr
  1. Don't panic
  2. Execute your plan, you will not rise to the occasion you will fall to you training. Trust in the practice and your training
  3. Confirm that you have actually had a full datacenter failure. The most common causes are network link down, power down and ups/backup failed, whole racks failed (usually caused by power, cooling or user error), acts of god like earth quake or 9/11 are pretty rare
  4. Mobilize your team - you should have a call tree this is the time to get people out of bed
  5. Execute your disaster comms plan - making sure the issue is escalated, your users, customers and vendors and kept frequently and accurately informed is critical
  6. Make sure whoever is authorized calls it a disaster and invokes the disaster recovery and business continuity plans. If you need to get your users to an alternate site this needs to be started in parallel with the DR plan. Your job is t ensure they have the right information (i.e. confirm extent of disaster, realistic time and cost to evoke DR rather than just recover primary)

IT recovery

Hot: automated fail-over and/or load balanced services
Source Flikr



Not much to do really, your DR site(s) should already be handling the load. Just run diagnostics, get application owners to confirm everything is working and monitor you load. If you have horizontally scalable applications that are virtualized bring up extra servers as required for capacity. If not shut down or freeze some non core services. If you have quality if service on your network your high value applications should have the bandwidth, if you don't again assign a lower priority/value to any non core traffic on your network  or just block things like video streaming sites temporarily.


Warm: servers built, patched, racked and powered - just needs manual switch over

  • Work down your prioritized list
  • Confirm the servers, network links, databases are up
  • Perform the manual switch
  • Get application owners to run diagnostics


Cold: god help you :) - Have site with hardware connected but nothing else

Source Flikr
  • Get someone over to the cold site to power on the servers
  • Work down your prioritized list
  • Verify your last backups integrity - hopefully it is on disk not tape
  • Once servers are powered and have ilo connectivity run build scripts
  • Perform the restore
  • Run diagnostics
  • Switch users and get app owners to verify

After you have survived and got everything back to the primary, take a breath and examine the root cause and take stock of lessons learned during the recovery


Importance of planning and preparation

The key to doing any sort of incident recovery well is everything you did before the incident ie

Source Flikr
  • Plan - Having a solid and complete plan that is well understood and simple enough to be followed in a crisis. Having copies of this on Dropbox or other online storage you can access on an iPhone or Blackberry is vital. Paper copies are useless because no one remembers to update them
  • People and process as well as IT  - Both a DR and BCP plan
  • RTO / RPO - All applications having a business known and accepted recovery time and point objective
  • Resilient architecture - that supports the required level of resilience and recovery e.g. if you have near zero RTO you need a multi datacenter or cloud strategy, virtualized systems, horizontally scalable, redundant everything including network connections and ISPs. I know that security professionals bang on about network segmentation and quality of service - but they can really benefit you in recovering from even a non security scenario
  • Testing and rehearsal - A plan that has been completely and frequently (at least 6 monthly for critical systems, annually for everything else) tested by all relevant stakeholders. Yes it is expensive to test and it impacts productivity and the business will bitch they don't have the change window. But if it is ever needed they will thank you for it, if it all goes to hell who is going to get fired?
  • Excellent comms plans and uptodate call trees
  • Experienced staff

Conclusion

Source Flikr
Hope that helped or at least gave you something to think about. An ounce of planning and preparation is worth a pound when you need a cure. Obviously responding to a security incident that takes out your data centre is a bit different: e.g. a distributed denial of service attack is just going to flood your secondary sites as well, a rampaging worm is the same - so you need some different tactics. The things about staying calm, having a plan and rehearsing your response and having an architecture that supports defence is still the same though. You are interested in some of those let me know in the comments and I will write another post on it.

No comments:

Post a Comment

Author

Written by