Post-mortems on the JSE IT failure highlight the costs of disaster IT HASN`T HAPPENED since the on-again-off-again IT problems of 1996, but last week`s IT shutdown at the JSE highlighted how costly system failures can be.

The Johannesburg Stock Exchange`s system experienced six hours of downtime last week Monday, resulting in a staggering R7 billion in lost trade. After being down for most of the day, the JSE only managed to get its systems up and running by 3:15pm. Trading hours were extended to 7pm, but even so, the JSE only did about R5 billion worth of deals that day. JSE CEO said the JSE`s usual daily trade averaged about R12 billion.

At the time of going to print, the JSE was issuing a formal apology on its website and the bourse`s CIO, Riaan van Bamelen, was still conducting a post-mortem on what led to a network fault bringing down the entire network system.

While not wanting to disclose exactly what caused the problem until further investigations had been concluded, Loubser said "the hardware or software that doesn`t fail from time to time has not yet been made".

He said: "We isolated the problem and solved it. Not even a full disaster recovery site could have avoided it - we could still have encountered it. If we failed over to the disaster recovery site, there is no guarantee that it would not have happened again."

DOING THINGS RIGHT

Regardless of the losses, it looks like the JSE did all the right things, said Craig Jones, Econarch Data Centre Services operational director. "Disaster recovery is always a learning experience. There is no way to foresee every eventuality."

The process of disaster recovery requires businesses to perform an impact analysis, he notes. "Within that, organisations then need to make a cost-versus-risk decision: essentially, how high is the risk of a particular failure, and then to decide whether preventing that failure is worth the risk." It is most likely that the particular failure the JSE experienced was considered a low risk and would normally be covered by insurance. "If they haven`t failed like this since 1996, then the risk of this failure is minimal."

This kind of massive failure was last experienced at the JSE in 1996, when the trading floor was on and off for a period of five days.

Jones says there could have been many reasons why the company decided not to fail over to the disaster recover (DR) site. "There could have been a delay in replication, which would mean the DR site would not be up to date. A trading environment needs to be current."

There could also have been an undetected vulnerability which would also be on the DR site, or even a connectivity issue between the two sites, he adds. "There is no way to speculate on what the problem could have been, unless they tell you."

FIXING THE PROBLEM

, GM for service delivery at Continuity SA, said the JSE would have done a risk mitigation exercise and, had it identified its network as a risk area, would have built in extra redundancy. But, he adds, "this kind of problem can happen".

He said: "To reduce their risk, they should do some kind of duplication of their network infrastructure, which is where their DR site comes in. They can triangulate to it, or they can have multiple [data] feeds into their production site [the JSE itself]."

The JSE will have to review its impact assessment, and decide whether this kind of failure is now worth the cost, says Jones. "The exchange will probably analyse the problem in detail and change its DR site."

Late last year, the JSE announced that it had decided to move its IT function back in-house, after only two years of outsourcing it to Accenture. Loubser stressed that this move had no bearing on the problem as the bourse`s CIO was well equipped to deal with the issue at hand.

Tags: Business  Technology