Business continuity test FAQs – 2015
Why was the testing necessary?
As an institution, we are now so dependent on IT that anything which interrupted our services could have a significant impact on the University’s reputation and its ability to continue to operate as normal. Disasters such as fire, flood, gas explosion or terrorist attack are obvious potential catastrophes for the University and are covered by our critical incident planning but in the world of IT even seemingly small incidents – a bug in the system, the failure of our network connection, an email hacking attack or the loss of a server could all severely impact our services. We need to know that we could run the University’s IT services from an alternative site or using different servers in the event of any system failure.
Bugs, hacking attacks and network failures happen all the time. Does that mean our IT systems are very vulnerable?
Yes and no. These kinds of threats are common and potentially devastating but staff in IS&T spend a lot of time and effort making sure that such incidents don’t cause the University problems. Most of the time, when something undermines the University’s IT infrastructure, you won’t notice because we’re continually monitoring and managing the way the systems operate to compensate for any failures and cope with the repercussions.
Occasionally, however, a set of circumstances conspires to thwart our best efforts. Last summer, for instance, a continually repeated disruption to the data centre’s power supply damaged a lot of our equipment – including the back-up systems and servers we would normally have switched services to.
We managed to get the University’s key services up and running in about a day. Less critical services, however, took a little longer to be recovered because of the number and complexity that we run. Situations like that are, thankfully, very rare but we need to be better prepared for them.
Why have you only tested those particular services? There are many more that the University uses on a daily basis – printing for instance.
Yes, there are many more services but these are the ones which the University has determined are essential to the survival of the University. If these were down for a long time at a critical time of year, the reputation and financial viability of the University could be compromised. However, each year, we plan to test differently, to continually improve our resilience, so different systems and different scenarios can be prepared and practiced for.
What are the advantages of a large scale test like this?
• It allows us to test different scenarios – for instance, the loss of our internet connection, damage to a particular server or a problem at one of our data centres. We have processes in place to run all the University’s services as normal in any of these situations but because such failures aren’t common, the processes are usually only tested in theory unless we do a test of this kind.
• It lets us practice the process of recovering services and check that the instructions we have created for doing so work in a simulation of a real-life disaster. We can also tweak and update those processes because our systems and services will have changed since the last time a test of this kind was undertaken.
• The test also gave us the opportunity to investigate hidden flaws in some of our suppliers’ software. Because different systems have individual configurations and servers don’t fail very often, issues can lie undiscovered. Even the software writers (many are major corporations) don’t understand all these weaknesses as elements of some programmes depend on third party applications (sometimes many years old.). Deliberately taking the servers down enabled us to document potential issues and write work around solutions to prevent disruption of our services if a server failure should occur.
What would an IT failure cost the University?
If the worst happened and our IT failed completely, it would be very costly to the University. These estimates are conservative but illustrate why it’s so important to prepare for disasters which might affect our IT provision:
• An hour of complete IT downtime is estimated to cost about £63,000 to the University (in lost staff time).
• During clearing, an hour of complete IT downtime might cost many hundreds of thousands of pounds, notwithstanding the reputational damage of a loss of services to our prospective students.
Because of this, we need to invest in contingency arrangements to keep everything running when disaster strikes. We can plan and examine our systems as much as possible but the only way to properly check that they are capable of dealing with failures is by conducting what we call ‘failover tests’. This requires us to deliberately take down services and their underpinning systems.
Why haven’t we done this before?
Actually we have. About 5 years ago we did a business continuity test but our systems were very different then and much has changed. Since then, we’ve found it very difficult to find the opportunity to do the test again. When we did the last test, our use of IT was different and staff and students didn’t use it much out of normal working hours. Now most of our services are available 24 hours a day, 7 days a week, on and off campus. As an institution, we are now so dependent on IT that any break in provision causes a good deal of inconvenience and risks interfering with the University’s business. Of course, it’s this very dependence on IT that makes the tests so important.
Do other Universities conduct tests of this kind?
Some are already doing it and others are following. We’re ahead of the field but most institutions realise the huge risk they’re taking if they don’t do this kind of planning. More and more businesses now build testing into their disaster recovery planning.
Aren’t there other ways of testing these services?
Yes and we use these continually to check for problems during our day-to-day management of the systems to ensure reliable services. None of these tools though can completely simulate actually deactivating the infrastructure that supports the services to check that they can operate in a different way.
If it’s that important, shouldn’t we be testing them more often?
It’s a balance. Testing is a disruptive, inconvenient and time-consuming process and the configuration of our IT systems is always changing. We have many different suppliers providing software and systems – all of which are continually updated – and the University is constantly tweaking and improving the way its various services use the systems provided. There are also external threats which could alter the way the different elements of our service interact with each other. In a year, much will have changed but we hope that our processes will be able to cope with the shifts which might arise over the course of a few months. The pace of change in IT is very fast though and it may, in years to come, be necessary to carry out tests like this more often. There is a balance also in ensuring any planned downtime does not affect the University’s Academic schedule, this is why an annual test around the Christmas/New Year break is possibly the only time where we can test the University’s IT without impacting upon Research, Teaching and Learning.