THE MAGAZINE FOR FINANCIAL DIRECTORS AND TREASURERS
  Home | Free email newsletter | Site map | Contact us 
 

TECHNOLOGY December/ January 2003

RETREAT AND RECOVER
When disaster strikes, the only option is to cut downtime by switching operations to a secret location, be it an exotic island in the Indian Ocean or a nondescript building in Kowloon Bay.
By Karen Winton

The phone call is anonymous, the male voice calm, threatening: “I’ve put a bomb in Exchange Square,” he says. Police evacuate the targeted building in Hong Kong’s financial district, which houses the stock exchange and hundreds of financial and legal firms. They also decide to empty two other high-rise towers in the vicinity, plus the nearby Jardine House and International Financial Center. Several thousand people flee their offices and, as police search for the bomb, minutes, then hours, pass. The entire square-kilometer area is cordoned off with meters of official yellow tape and grim-faced policemen. Dealing room floors sit abandoned, telephones ring in deserted back offices, cold coffee congeals, keyboards collect dust on desks left in a hurry.

Sadly, the scenario is far from fanciful. These days, it’s perfectly reasonable to think the unthinkable. Not least in Asia, where political and economic turmoil, earthquakes and typhoons, the occasional nuclear stand-off, or even the odd bombing, make it fertile ground for a host of risks – and, arguably, turn disaster recovery into more of a priority here than anywhere else in the world. US-based technology consultancy Gartner has published research that shows that despite the attention given to disaster recovery since September 11, IT spending on business continuity has actually declined worldwide since mid-2000. This factoid may not ease Asian CFOs’ paranoia, but there’s no need to fret. New options have emerged on Asian soil, making business continuity planning feasible, and even affordable. Some facilities have popped up in exotic locales , others in such familiar territory as workaday Kowloon.

Many of Hong Kong’s multinational financial institutions and trading houses subscribe to duplicate facilities run by COL Ltd on the north side of Hong Kong’s Victoria Harbor that mirror their own dealing room and back office positions. These include a data center that has backed up their critical data and applications since they signed up to the service several years or months ago. It’s a classic outsourcing model courted by several companies that pay a monthly subscription fee of between HK$20,000 and US$50,000 to the 30-year-old IT services provider in order to secure a disaster recovery site and back-up facilities. A quick phone call as the emergency unfolds and a dozen people spring into action at two recovery sites. Unmarked rooms in secret locations are unlocked and opened up, power, telecommunications and feeds to Bloomberg and Reuters switched on, and staff dispersed to lobbies to welcome and direct disorientated clients to their respective positions in each facility. Almost, but not quite, a case of business as usual.

Bomb threats – and bombs – are an unfortunate reality of life, but daily existence for most is a lot less eye-popping. So far, the majority of real “disasters” in Hong Kong have been caused by rather more mundane plumbing-induced power failures, says Norris Hickerson, director and general manager of COL Ltd. “You can drive yourself insane thinking about the different types of disaster scenario,” says Hickerson. “But because of the cost of space in Hong Kong, few people have control over their own buildings, and they don’t know what the people upstairs or downstairs are doing,” he adds.

Hickerson has helped companies including international bank Rothschild engage their disaster recovery plans and continue to operate in the face of an interruption. In Rothschild’s case, a seawater pipe that burst in August 2001 disrupted electricity supply to its Hong Kong headquarters building, and the subsequent power failure affected treasury dealing room and settlement facilities. A telephone call to COL saw the restoration of the bank’s data and applications on the back-up servers in COL’s Kowloon Bay recovery center, and staff transferred quickly to the center to work. The power outage lasted less than 24 hours but Rothschild was able to minimize its downtime. This was crucial. Such a financial institution, could lose up to US$6.5 million an hour during downtime, according to estimates from the US-based Disaster Recovery Institute International (DRII). A failure at an airline reservations system could cost up to US$90,000 an hour, a credit card sales system knockout US$2.6 million an hour, according to DRII’s figures. No surprise then that in Hong Kong, Steve Beason, the Hong Kong Jockey Club’s (HKJC) executive director of Information Technology, says disaster recovery and business continuity planning are “no brainers” for him and his peers – including the CFO. The HKJC turns over US$150 million on each race day in betting transactions, half of which are cash-less. On average, 40,000 people bet each race. The club is, says Beason, Hong Kong’s biggest user of IT.

“Horse racing is not resilient as far as having a customer wait and then reinvest,” he says. “A horse race happens over 90 seconds. If people miss that race they don’t bet twice as much on the next one, and we lose that revenue. So it’s pretty easy for us to make a cost benefit analysis for business continuity when we’re putting in place different channels or products, or something that affects the betting system.”

Beason’s worst nightmare – an outage with the potential to disrupt the HKJC’s lucrative betting systems – came home to roost twice in 1999, once as a result of bad weather and the second time because of a bug in the computer operating system. The first incident saw the enormous light-board displaying the odds, which looms over the track at Happy Valley racecourse, simply fizzle and die. For a split second, it seemed that the HKJC’s entire betting system had gone down. The punters certainly thought so. But Beason knew better. Drizzle and a strong wind aided by exhaust fans intended to cool the light-board had driven moisture into it and shorted its circuits. “Electricity and water don’t like each other much,” comments Beason, wryly. It quickly became obvious with multiple light-boards and CCTVs still operational throughout the racecourse that it was the failure of one piece of equipment – easily fixed – rather than the entire betting system, which would probably have created riots among Hong Kong’s gambling-mad populace.

The second “disaster” saw the failure of some betting terminals (for the selling and payout of wagers) controlled by a computer bug-stricken operating system. “It wasn’t an entire system outage, just an outage of that specific box,” says Beason. “We try and spread the terminals between boxes according to how much we want to be affected by an outage of multiple components that share these terminals,” he adds. Unfortunately, even if half the racecourse terminals are on one system and half on another, having an outage in one is still going to be seen as a major situation. But in this instance, with half an hour between races and a 15-minute postponement of the next race, Beason’s team was able to get the system back up without losing a cent of revenue.

Belt And Suspenders

The HKJC’s ability to recover from a situation is based around the service level agreements (SLAs) that Beason’s IT group, in its role as an IT services provider to the rest of the organization, has with the HKJC’s business users, for example, the executive director of betting. These include the requirements from that user for business continuity in the event of an outage of some nature. “I try and put into my SLAs what my recovery times will be for different outages and what number of components would have to go down for an outage to take place,” says Beason.

He adds that he has a “belt and suspenders mentality” over the transaction processing and results entry of the betting system that says no single outage will cause service disruption. In other words he has to maintain 99.9 percent uptime – at a significant cost – and replicate every “mission critical” piece of data or application in at least two separate locations in order to be able to run a back-up in the event of failure. The HKJC’s two sites are at its racecourses – Happy Valley and Shatin – and operations can run out of either. “We can suffer single outages within each site and remain running. If there’s a secondary outage we have anything up to 30 seconds downtime depending on what component fails,” says Beason. Communication lines are on separate circuits and never co-located, even cross-harbor connections are in different locations so “if someone drops an anchor in the harbor and digs up one of our lines, we have another as a back-up in a physically different location,” he adds.

The data replication in the betting system is split into two areas – the transactional replication of data; and the replication of all the results, pay-outs, etc. Both are backed up onto a storage area network infrastructure based on EMC’s SymmetrixTM facility and TimeFinderTM software. Sitting between the two transactional areas is what Beason calls a command and control box that pulls totals from all the different processors that busily calculate the odds coming in via a distributed system. This happens every 12 seconds and puts a phenomenal amount of pressure on the box sitting in the middle of the processors. The “mission critical” nature of that operation is that the odds have to be correct all the time. When the club calculates how much a winner is paid, this figure is sent out to all terminals on the distributed system so that when that pay transaction comes up, the terminals know how to pay out and for how much money. The box at the center of the transactions is backed up locally with another and then remotely over a network link to Shatin. Both boxes are “hot”, meaning that the data is constantly updated in real time. “We have the data stored in two locations before we finish any transactions,” says Beason. “That ensures my ability to take them over in real time to another site and know that my data is updated as of the minute.”

The EMC software has another trick. It allows Beason to take a snapshot of all critical data – both transactional and command and control – so that he can go back to any point in time and do a spot financial analysis. While the system cuts the data required for analysis, it continues processing as normal, not stopping operations while the snapshot is taken. It’s used for end-of-day accounting purposes, serving as a reference point for sales for that day, and it makes money for the HKJC because, says Beason: “We can continue operating. We tie in our back-up and our roll-over from one day to another so that we back up the data that is relevant to that financial day and don’t have to stop during the snapshot.” A common shared system from EMC’s storage area network allows data to be backed up to the physical medium – tape – automatically. One set of tapes is stored onsite in Happy Valley or Shatin, and another offsite in a secret location.

Silvana Cheng, EMC’s Hong Kong-based financial controller for Asia Far East, says that the TimeFinderTM software was a Y2K development, intended to help customers forward-project their mission critical data from 31 December 1999 to 1 January 2000 to establish whether it would continue running trouble-free. The ability to freeze the data at a point in time while still continuing to run operations and, therefore, back-ups, was an added bonus, as was the speed of information retrieval. “Real time data helps in reporting numbers. Your system is updated, always in real time, and if you want to know how you’re doing this month, this quarter, this year, you can access the information readily,” she says. An uninterrupted system also means real time data 365 days a year, so it also assists forward planning because it helps Cheng forecast much further ahead. The software does various analyses using historic and real time data for comparison by the month, quarter or year, and allows rolling forecasts based on these expected revenues and expenses. And it cuts expenses. “If I have four downtimes in one month, my staff are sitting there idle. Real time data and application mirroring means that there is not a lot of downtime,” Cheng comments.

The Y2K Disaster

Look at Y2K as the pinnacle of the business world’s romance with disaster recovery. Worldwide spending on software applications and network storage facilities for disaster recovery and back-up purposes grew steadily during the 1990s to peak in mid-2000, according to September 2002 research from Gartner in the US. In the aftermath of a disaster – for example, the 1993 World Trade Center bombing – enterprises rushed to invest on their disaster recovery systems and practices, spending on mainframe data centers. Spending on mainframes as a percentage of data center budget grew from 2 percent in 1993 to peak at 4.8 percent in mid-2000. Interestingly, the research also shows a widespread spending decline on disaster recovery mainframe technology since September 11, 2001 – a decrease of 5.2 percent for mainframes, 12.2 percent for Windows NT and 15.1 percent for Unix systems.

“Organisations’ focus was higher, their attention was higher after September 11, but I never saw that translated into major expenditure,” says Phil Sargeant, research director, Servers and Storage for Gartner in Sydney. “Companies revisited the plans they had made but suddenly the vendors that were hyping up disaster recovery weren’t actually seeing the revenue flow,” he says.

Some vendors, particularly those on the software side like Veritas, have bucked the downslide and are still experiencing revenue growth in excess of 25 percent year-on-year in Asia, partly as a result of being able to partner with server or storage platform vendors like EMC. But Sargeant says that instead of plunging wholesale into new disaster recovery software or hardware platforms, many CFOs and CIOs are becoming more concerned about the preservation and integrity of their information. “I’ve seen a lot of activity in things like replicating data and backing up data more regularly,” he says. Hence the HKJC’s HK$36 million spent on consolidating its storage area networks in the past three years. But, says Sargeant, “business continuity is more than the information. It’s about people and communications, and I’ve yet to see huge expenditure in those areas,” he says.

What has changed, and what has perhaps mitigated spending since September 11 is that organisations have become more prudent about what processes they’re going to provide disaster recovery and business continuity plans for. They’re looking more carefully at the cost of their downtime and the cost of downtime on an application basis. In a stagnant economy it makes sense to look at the “mission critical” applications – the betting systems at the HKJC, for example – and provide for disaster recovery and business continuity application by application rather than for the organisation in its entirety. The nature of a critical application depends on each organisation but revenue-generating and customer-facing applications tend to be earmarked as crucial.

“Many applications in organisations don’t need the same sort of mechanisms in place to restore them quickly. If they’re restored in a day, maybe several days or a week, that’s fine,” says Sargeant, “the company still continues. CFOs are going through a process of understanding what is important. Once they do that they can get a more realistic plan, and one that’s better from a price perspective.”

Offshore Escapes

There are those, however, daring to challenge the trend in terms of spending by setting up their own disaster recovery facilities. Not surprisingly, they are IT vendors. Infosys, the Bangalore-based IT services giant recently announced the establishment of its first disaster recovery center outside India, on the island of Mauritius. At a cost of US$25 million in terms of capital expenditure, you could view this move as something of a desert island risk, a facility for 1,500 software developers lying practically vacant until the hour of need, on a tropical island several thousand kilometers east of Africa. Then again, you could just label it sound planning. Mauritius has good relationships with India on a government level. It’s close to Europe and India, with direct flights to Bangalore taking about five hours. It has a good technology infrastructure and telecommunications facilities and, most importantly, its government was willing to organise work permits for Infosys in advance. Should nuclear war break out between India and Pakistan over border tensions involving Kashmir, Infosys can command an armada of aircraft to airlift its 1,500 developers and engineers to Mauritius in a matter of hours.

“In a disaster recovery situation, you need to have the space available,” says S Gopalakrishnan, chief operating officer and deputy managing director of Infosys. “That means it has to be unused till a disaster happens. It’s empty unless there is a disaster,” he says.

He maintains that once the capex investment is made, there is little to spend in terms of ongoing costs. The servers used to store programs and data, the network and connectivity are in place. The theory is that in the event of an emergency that cuts India off, the 1,500 developers will arrive and hook up their notebooks to the network. Voilà. Instantaneous and effective business communication with customers.

Infosys already has eight data recovery centers, any one of which can take over from another. But they’re all based in India. After September 11 and the border tensions, customers started asking how the company could protect their data and operations. Mauritius is the first attempt to satisfy these questions, but Gopalakrishnan says there may be more similar recovery centers once Mauritius is operational in January 2003.

“This whole plan is for Infosys to serve its customers,” says Gopalakrishnan. “In the event of a disaster, Mauritius, along with people already based outside India, should help us keep all the customer projects running.”

Slightly less exotic, but still an island located outside India, Singapore is providing a haven for Polaris Software Lab, the US$200 million market cap Indian IT company. Polaris’ S$1.5 million (US$770,000) 150-seat Business Continuity Center was set up in Singapore in August 2002 to provide recovery services for Polaris’ mainly banking and insurance clients. It is linked to the company’s five development centers in India through a dedicated private link, used exclusively for replicating data to the server installed at the Singapore BCC and directly accessible by key clients. “Disaster recovery and business continuity are imperative to our clients, given the mission critical nature of their business,” says N Vaidyanathan, CFO of Polaris. “With the BCC, clients will have direct access to their data, and the assurance that the back-up data will remain intact.”

Unlike Infosys or Polaris, India’s Tata Consultancy Systems (TCS), a division of the US$11.3 billion business conglomerate Tata Group, is using its development centers worldwide as back-up facilities for customers’ critical business processes. In conjunction with a well-documented handbook detailing everything a client needs to know in the event of having to negotiate the critical steps to recovery, the centers – 24 in India plus six in the US, and one each in the UK, Hungary, Melbourne, Yokohama, Hangzhou in China, and Uruguay – are production support centers whose cost is underpinned by software development and customer servicing onsite.

“I’m not a great believer in creating the one Fort Knox and assuming that it is the be-all and end-all of crisis management,” says Girija Pande, regional director for TCS in Singapore. “We have to give customers more flexibility and a global crisis resolution story because some of them will require help in different places.”

Pande, who has a banking background, says part of TCS’ competitive advantage for its clients is in the geographical spread of its development centers – effectively across all time zones – and in its inclusion of disaster recovery into already existing facilities. “Disaster recovery is a process and a mindset, not a physical environment only,” he says. “The right processes, documented, are the key. People have to be aware that unless disaster recovery and business continuity are thoroughly detailed, quality-controlled processes, there’s no point in having the expensive fixed assets only,” he says.

The Cost Of Recovery

For every company lacking contingency plans and experiencing from two to five days of network downtime, 25 percent will go bankrupt immediately, 40 percent will close their doors within two years and of the 35 percent remaining, virtually none will exist five years on. Vendors might try to scare CFOs into acting on the US-based Contingency Planning Organization statistics and the oft-quoted cliché that: “The day after disaster recovery plans are created, they are out of date”. But what’s worth bearing in mind is that a disaster recovery plan’s effectiveness to protect an organization fades progressively with time. CFOs should at least consider how to prevent this from happening as well as the merits of keeping an organisation upright in the face of disaster by investing in business continuity strategies of some nature.

But is investment in a disaster, which may or may not happen, a simple leap of faith or a poorly executed back flip? The HKJC’s Beason says that for the club’s new off-course betting network, he and his peers looked at the ROI in terms of whether the investment was an opportunity cost or an opportunity lost. They then looked at the kind of pricing involved to make the network reliable, versus having it redundant. The basic issue was whether to have two physical telephone lines going into every off-course betting center opposed to one. They looked at how many outages might occur over five years and what that additional cost would be. “We said if it happens once a year it will pay for itself,” says Beason. “If for five minutes we go down at the wrong time – and it’s happened once in the last five years – we’ll make up all that money within that time. If it never happens and we never have an outage we’ll have spent another HK$10 million that we didn’t need to spend. But our historic information says that it’s likely to happen, and it has happened before. And if it happens twice then we’re saving money. We made the call that way. We’re forced to make all of our decisions that way.”

Malcolm Harkins, director of information security and business continuity at Intel in the US, acknowledges that there’s a spectrum of cost trade-off, depending on how much a CFO is prepared to pay to mitigate against a potential disaster. But he also notes that there are some things in a business continuity sense that lend themselves to a clean ROI. “For back-up solutions or desktops in the office environment, for example, you can work out a strong ROI just by looking at industry average hard-disk failure rates, determining the cost of that, your cost of replacement, your IT support costs, and your end-user issues. Then you ask, ‘how much does it cost me to provide them with back up solutions?’ That’s a very clear ROI that’s a no-brainer if you go through those calculations,” says Harkins.

His approach when ROI is less clear is to look at the strategic value and competitive advantage inherent in contingency planning like the reduction in customer and employee uncertainty. “If you have the ability to operate continuously and in a nonstop fashion, in the event of a disaster or disruption, when your other competitors may not be able to do that, you’ve got a great competitive advantage,” says Harkins.

It’s also possible to quantify the cost of downtime and the cost of disruption. For example, take an online brokerage company with millions of transactions a day on its website for web trading. The cost of a disruption of that site being down is enormous. There’s a clear cost benefit trade-off to make and it’s not what you might think, says Harkins. “Because the business is automated and trading is on the web, business continuity is also cheaper than if that brokerage had 500 people taking phone calls that they then needed to replicate in another location,” he says.

In other words, if disaster strikes, be it a burst pipe or a terrorist attack, the cost of replicating a call center is vastly more expensive in the long run than replicating a server environment and some internet connections in another location. And the cost of disaster recovery and business continuity could dwindle further as more and more critical business processes are transacted over the internet and backed up on storage area networks connected via high-speed network links to servers offshore.

Karen Winton is executive editor of eCFO and a senior writer at CFO Asia based in Hong Kong.

Recovery Objectives

There are two key concepts in disaster recovery that vendors attempt to clarify before they draw up a contract with a customer: recovery point objective (RPO) and recovery time objective (RTO). Both cover the risks inherent in managing data or application loss, and having systems out of action for a period of time.

Alvin Ow, regional technical consulting manager, Asia South, Veritas Software, based in Singapore, says that different clients and processes require tailor-made RPOs and RTOs. “For one particular application the RPO could be no data loss at all,” he says, “while the RTO might be a couple of days. It’s up to us to map our solutions to the RPO and RTO requirements of each end user.”

A typical RTO for “mission critical” applications is four hours, says Norris Hickerson, director and general manager of Hong Kong-based IT services provider COL Ltd. “Four hours can sound like a long time, but think about the logistics of getting people moved after a disaster has occurred or you’re under threat of one,” he comments, dryly.

There are three questions to ask yourself in the process towards determining RPO and RTO.

What’s my IT support to the business processes I’m trying
to protect?
What’s an acceptable recovery time for each business process
in the event of failure?
What’s the acceptable amount of data loss for each business process in the event of failure?

“You need to evaluate each point differently,” says Malcolm Harkins, director of information security and business continuity at Intel in the US. “Once you’ve determined the framework for continuity, it gives you the key elements necessary to do a risk analysis, document the trade-off and determine what you’re willing to spend to mitigate those issues.”

How you spend depends on whether you plan to outsource or handle business continuity in-house. If you don't outsource to a provider that can offer some continuity of operations for a period of time in the event of a disaster, you can implement your own storage and back-up capabilities in either “hot” or “cold” sites. A hot site is one that mirrors your full business processes, including the data and applications, real time. A cold site gives you some basic capabilities but if it's not entirely dedicated to backups be prepared for some data loss and effects should the primary production site fail before the secondary site is operational. KW