Posts Tagged problems

When IT goes Critical

Although I took a few days leave at the end of last week I didn’t manage to make much of a break from work. With a critical incident around our VLE service, it was important to keep in touch to check what was happening and that issues were being escalated and communicated as needed.

Back at work the pressure continues as the same systems problems occured again today. What makes it particularly difficult is that the solution isn’t in our hands, and we can at best only provide mitigation and communication.

We have many small and medium IT incidents (individual instances of queries or failures) and in the main these are dealt with using our Incident Management process, which logs and records incidents and tracks them through to resolution. We also look to escalate incidents that aren’t resolved in the kinds of timescales we’d expect (we have targets for these dependent on category and urgency). Some incidents though have such massive impact that they are treated differently, as critical IT incidents.

The recent VLE issues are definitely in this category. The impact, when the system has been down, has been on all students and almost all academic staff. That we have had such instability for an extended period of time, not simply as a one-off unusual occurrence means that people are forced to put in place workarounds as they cannot rely confidently on the service being available when needed.

In this instance, our role is to ensure that whatever action we can take to reduce the likelihood of a failure is taken and that whatever we can put in place to ensure that any failure is recovered as quickly as possible is put in place. We need to communicate the situation with all affected users and make sure we’re escalating the issues with our suppliers. In this case, we’re reliant on the supplier finding the cause of the problems and solving them.

And it does seem to be a ‘them’ and not a single cause. As such its not a simple or quick matter to resolve and we’re focused at the moment on stability of the service whilst we resolve the underlying problem or problems.

My role is to make sure we’re doing all that we can, that we’re identifying any actions we need to take (and making available money, resource, time, whatever is needed), to make sure we’re communicating clearly and regularly, to brief very senior managers and to keep talking to our supplier so they know and understand how critical this is to us. I’m also liaising with our supplier’s lead technical person daily to get updates on progress and then meeting with key people here to assess the situaiton and whether there’s more we need to do.

, , , , , ,

No Comments

And another week

These weeks are whizzing by.

More work on business planning as the deadline for the 1st lot of information nears.  Strategic objectives and investment/disinvestment proposals this week, service costing next week.

Major programmes and projects governance and the portfolio groups this week plus discussions on specific work and also a view of how things are progressing around our systems development agenda. We are seeing considerable demand for systems development (enhancements to current systems as well as bringing in new systems) both as a result of the changing sector needs and responding to government, as well as the objectives and aims of the university.

It was also a temporary goodbye to one colleague, off on maternity leave. It hardly seems any time at all since she was first telling us her news. We all obviously wish her luck and the chance to get some rest before the birth.

Finally, some issues we’d been having with our SAN manifested themselves in more serious ways at the end of the week, causing some disruption to a range of services. Whilst most services were recovered quickly there was still an impact for staff and students. Having got the suppliers involved they’ve identified the fault and solution and our staff identified a short term fix for now that should resolve the issues that have affected services in the meantime.

Not the best end to the week, especially with teaching resuming.

, , ,

No Comments

Results

A pressured day for a number of staff today – none more so than for our students though.

Today results were available online for a couple of faculties but we began to experience problems in the morning when the web service began to show signs of serious overloading. As that got worse we found connections timing out so it was almost impossible for anyone to login.

A team of people in Infrastructure and in Systems worked on diagnosis but we also had to bring the supplier in as the problem wasn’t obvious or simple to fix.

Mid afternoon, we had managed to get the service back working but still have to do some work to understand what happened and why. With more results due tomorrow we’re monitoring how the system copes overnight and tomorrow.

We’re also looking at contingency plans in the event that we start to see the kinds of calls on systems resources we saw today. There’s no obvious reason for the problems, which makes it harder to be sure we won’t get a repeat. However, with plans in place we should be able to respond quickly if needed.

Later, we’ll need to look at what did happen and how we dealt with it for any lessons but right now the priority is about the service being up and available.

A lot of people helped today – both in IS&T and elsewhere – it was a real team effort.

, , , , ,

No Comments

Thursday's Woe

After some communications problems between components in the data centre, 100′s of ‘distress’ signals were received overnight as servers (physical and virtual) reported problems. With some of the team over here in Cork, it’s something of an international effort to sort things out between people here and people back in Sheffield.

This morning a workaround was put in place that should have lessened the impact of the problems for most people, although staff at Collegiate may have had problems logging on. By 1030ish, most if not all the priority services were back and working normally. I sent an All Staff email out via Corporate Comms around 1030 to let people know what the issues were and what was being done, so people would know what’s happening.

Staff back in Sheffield had also identified problems with SI (the Student Information system) yesterday and an emergency meeting of relevant staff was called this morning, with input from colleagues over here. The decision was taken to stop and restart the service to clear residual problems. However, that doesn’t seem to have gone totally smoothly and colleagues here and in Sheffield again have been working to try resolve things. Looks like it is now beginning to get sorted but we’ll be keeping close to an internet connection to monitor things this afternoon.

It’s been very instructive in some ways, watching people remotely administering the systems, receiving alerts, changing configurations, etc. When I think back to my early days in computing, as a computer operator, there was very little we could do with the system away from the console on the server itself. Shows how much things move on, and for the better.

Meanwhile, at Arundel Gate Court, the building has now been declared safe from an electrical perspective I’m told but the carpets are still soggy underfoot. Those key staff who were asked to come in to manage the clean-up with FD are checking on whether it it makes more sense now, given the time, to only bring back in key local staff for today and ask everyone else to come back into work as normal tomorrow. The alternative is to have everyone back in this afternoon and working either from a comandeered PC lab or in the less soggy parts of AGC. Either way, managers will be using the Telephone Tree again to contact staff and let then know.

A challenging day for people, to say the least.

, , , , , , ,

No Comments