| TECHNOLOGY |
December/
January 2003 |
RETREAT AND RECOVER
When disaster strikes, the only option
is to cut downtime by switching operations to a secret location,
be it an exotic island in the Indian Ocean or a nondescript
building in Kowloon Bay.
By Karen Winton
The phone call is anonymous, the male
voice calm, threatening: “I’ve put a bomb in Exchange
Square,” he says. Police evacuate the targeted building
in Hong Kong’s financial district, which houses the
stock exchange and hundreds of financial and legal firms.
They also decide to empty two other high-rise towers in the
vicinity, plus the nearby Jardine House and International
Financial Center. Several thousand people flee their offices
and, as police search for the bomb, minutes, then hours, pass.
The entire square-kilometer area is cordoned off with meters
of official yellow tape and grim-faced policemen. Dealing
room floors sit abandoned, telephones ring in deserted back
offices, cold coffee congeals, keyboards collect dust on desks
left in a hurry.
Sadly, the scenario is far from fanciful.
These days, it’s perfectly reasonable to think the unthinkable.
Not least in Asia, where political and economic turmoil, earthquakes
and typhoons, the occasional nuclear stand-off, or even the
odd bombing, make it fertile ground for a host of risks –
and, arguably, turn disaster recovery into more of a priority
here than anywhere else in the world. US-based technology
consultancy Gartner has published research that shows that
despite the attention given to disaster recovery since September
11, IT spending on business continuity has actually declined
worldwide since mid-2000. This factoid may not ease Asian
CFOs’ paranoia, but there’s no need to fret. New
options have emerged on Asian soil, making business continuity
planning feasible, and even affordable. Some facilities have
popped up in exotic locales , others in such familiar territory
as workaday Kowloon.
Many of Hong Kong’s multinational
financial institutions and trading houses subscribe to duplicate
facilities run by COL Ltd on the north side of Hong Kong’s
Victoria Harbor that mirror their own dealing room and back
office positions. These include a data center that has backed
up their critical data and applications since they signed
up to the service several years or months ago. It’s
a classic outsourcing model courted by several companies that
pay a monthly subscription fee of between HK$20,000 and US$50,000
to the 30-year-old IT services provider in order to secure
a disaster recovery site and back-up facilities. A quick phone
call as the emergency unfolds and a dozen people spring into
action at two recovery sites. Unmarked rooms in secret locations
are unlocked and opened up, power, telecommunications and
feeds to Bloomberg and Reuters switched on, and staff dispersed
to lobbies to welcome and direct disorientated clients to
their respective positions in each facility. Almost, but not
quite, a case of business as usual.
Bomb threats – and bombs –
are an unfortunate reality of life, but daily existence for
most is a lot less eye-popping. So far, the majority of real
“disasters” in Hong Kong have been caused by rather
more mundane plumbing-induced power failures, says Norris
Hickerson, director and general manager of COL Ltd. “You
can drive yourself insane thinking about the different types
of disaster scenario,” says Hickerson. “But because
of the cost of space in Hong Kong, few people have control
over their own buildings, and they don’t know what the
people upstairs or downstairs are doing,” he adds.
Hickerson has helped companies including
international bank Rothschild engage their disaster recovery
plans and continue to operate in the face of an interruption.
In Rothschild’s case, a seawater pipe that burst in
August 2001 disrupted electricity supply to its Hong Kong
headquarters building, and the subsequent power failure affected
treasury dealing room and settlement facilities. A telephone
call to COL saw the restoration of the bank’s data and
applications on the back-up servers in COL’s Kowloon
Bay recovery center, and staff transferred quickly to the
center to work. The power outage lasted less than 24 hours
but Rothschild was able to minimize its downtime. This was
crucial. Such a financial institution, could lose up to US$6.5
million an hour during downtime, according to estimates from
the US-based Disaster Recovery Institute International (DRII).
A failure at an airline reservations system could cost up
to US$90,000 an hour, a credit card sales system knockout
US$2.6 million an hour, according to DRII’s figures.
No surprise then that in Hong Kong, Steve Beason, the Hong
Kong Jockey Club’s (HKJC) executive director of Information
Technology, says disaster recovery and business continuity
planning are “no brainers” for him and his peers
– including the CFO. The HKJC turns over US$150 million
on each race day in betting transactions, half of which are
cash-less. On average, 40,000 people bet each race. The club
is, says Beason, Hong Kong’s biggest user of IT.
“Horse racing is not resilient as
far as having a customer wait and then reinvest,” he
says. “A horse race happens over 90 seconds. If people
miss that race they don’t bet twice as much on the next
one, and we lose that revenue. So it’s pretty easy for
us to make a cost benefit analysis for business continuity
when we’re putting in place different channels or products,
or something that affects the betting system.”
Beason’s worst nightmare –
an outage with the potential to disrupt the HKJC’s lucrative
betting systems – came home to roost twice in 1999,
once as a result of bad weather and the second time because
of a bug in the computer operating system. The first incident
saw the enormous light-board displaying the odds, which looms
over the track at Happy Valley racecourse, simply fizzle and
die. For a split second, it seemed that the HKJC’s entire
betting system had gone down. The punters certainly thought
so. But Beason knew better. Drizzle and a strong wind aided
by exhaust fans intended to cool the light-board had driven
moisture into it and shorted its circuits. “Electricity
and water don’t like each other much,” comments
Beason, wryly. It quickly became obvious with multiple light-boards
and CCTVs still operational throughout the racecourse that
it was the failure of one piece of equipment – easily
fixed – rather than the entire betting system, which
would probably have created riots among Hong Kong’s
gambling-mad populace.
The second “disaster” saw
the failure of some betting terminals (for the selling and
payout of wagers) controlled by a computer bug-stricken operating
system. “It wasn’t an entire system outage, just
an outage of that specific box,” says Beason. “We
try and spread the terminals between boxes according to how
much we want to be affected by an outage of multiple components
that share these terminals,” he adds. Unfortunately,
even if half the racecourse terminals are on one system and
half on another, having an outage in one is still going to
be seen as a major situation. But in this instance, with half
an hour between races and a 15-minute postponement of the
next race, Beason’s team was able to get the system
back up without losing a cent of revenue.
Belt And Suspenders
The HKJC’s ability to recover from
a situation is based around the service level agreements (SLAs)
that Beason’s IT group, in its role as an IT services
provider to the rest of the organization, has with the HKJC’s
business users, for example, the executive director of betting.
These include the requirements from that user for business
continuity in the event of an outage of some nature. “I
try and put into my SLAs what my recovery times will be for
different outages and what number of components would have
to go down for an outage to take place,” says Beason.
He adds that he has a “belt and
suspenders mentality” over the transaction processing
and results entry of the betting system that says no single
outage will cause service disruption. In other words he has
to maintain 99.9 percent uptime – at a significant cost
– and replicate every “mission critical”
piece of data or application in at least two separate locations
in order to be able to run a back-up in the event of failure.
The HKJC’s two sites are at its racecourses –
Happy Valley and Shatin – and operations can run out
of either. “We can suffer single outages within each
site and remain running. If there’s a secondary outage
we have anything up to 30 seconds downtime depending on what
component fails,” says Beason. Communication lines are
on separate circuits and never co-located, even cross-harbor
connections are in different locations so “if someone
drops an anchor in the harbor and digs up one of our lines,
we have another as a back-up in a physically different location,”
he adds.
The data replication in the betting system
is split into two areas – the transactional replication
of data; and the replication of all the results, pay-outs,
etc. Both are backed up onto a storage area network infrastructure
based on EMC’s SymmetrixTM facility and TimeFinderTM
software. Sitting between the two transactional areas is what
Beason calls a command and control box that pulls totals from
all the different processors that busily calculate the odds
coming in via a distributed system. This happens every 12
seconds and puts a phenomenal amount of pressure on the box
sitting in the middle of the processors. The “mission
critical” nature of that operation is that the odds
have to be correct all the time. When the club calculates
how much a winner is paid, this figure is sent out to all
terminals on the distributed system so that when that pay
transaction comes up, the terminals know how to pay out and
for how much money. The box at the center of the transactions
is backed up locally with another and then remotely over a
network link to Shatin. Both boxes are “hot”,
meaning that the data is constantly updated in real time.
“We have the data stored in two locations before we
finish any transactions,” says Beason. “That ensures
my ability to take them over in real time to another site
and know that my data is updated as of the minute.”
The EMC software has another trick. It
allows Beason to take a snapshot of all critical data –
both transactional and command and control – so that
he can go back to any point in time and do a spot financial
analysis. While the system cuts the data required for analysis,
it continues processing as normal, not stopping operations
while the snapshot is taken. It’s used for end-of-day
accounting purposes, serving as a reference point for sales
for that day, and it makes money for the HKJC because, says
Beason: “We can continue operating. We tie in our back-up
and our roll-over from one day to another so that we back
up the data that is relevant to that financial day and don’t
have to stop during the snapshot.” A common shared system
from EMC’s storage area network allows data to be backed
up to the physical medium – tape – automatically.
One set of tapes is stored onsite in Happy Valley or Shatin,
and another offsite in a secret location.
Silvana Cheng, EMC’s Hong Kong-based
financial controller for Asia Far East, says that the TimeFinderTM
software was a Y2K development, intended to help customers
forward-project their mission critical data from 31 December
1999 to 1 January 2000 to establish whether it would continue
running trouble-free. The ability to freeze the data at a
point in time while still continuing to run operations and,
therefore, back-ups, was an added bonus, as was the speed
of information retrieval. “Real time data helps in reporting
numbers. Your system is updated, always in real time, and
if you want to know how you’re doing this month, this
quarter, this year, you can access the information readily,”
she says. An uninterrupted system also means real time data
365 days a year, so it also assists forward planning because
it helps Cheng forecast much further ahead. The software does
various analyses using historic and real time data for comparison
by the month, quarter or year, and allows rolling forecasts
based on these expected revenues and expenses. And it cuts
expenses. “If I have four downtimes in one month, my
staff are sitting there idle. Real time data and application
mirroring means that there is not a lot of downtime,”
Cheng comments.
The Y2K Disaster
Look at Y2K as the pinnacle of the business
world’s romance with disaster recovery. Worldwide spending
on software applications and network storage facilities for
disaster recovery and back-up purposes grew steadily during
the 1990s to peak in mid-2000, according to September 2002
research from Gartner in the US. In the aftermath of a disaster
– for example, the 1993 World Trade Center bombing –
enterprises rushed to invest on their disaster recovery systems
and practices, spending on mainframe data centers. Spending
on mainframes as a percentage of data center budget grew from
2 percent in 1993 to peak at 4.8 percent in mid-2000. Interestingly,
the research also shows a widespread spending decline on disaster
recovery mainframe technology since September 11, 2001 –
a decrease of 5.2 percent for mainframes, 12.2 percent for
Windows NT and 15.1 percent for Unix systems.
“Organisations’ focus was
higher, their attention was higher after September 11, but
I never saw that translated into major expenditure,”
says Phil Sargeant, research director, Servers and Storage
for Gartner in Sydney. “Companies revisited the plans
they had made but suddenly the vendors that were hyping up
disaster recovery weren’t actually seeing the revenue
flow,” he says.
Some vendors, particularly those on the
software side like Veritas, have bucked the downslide and
are still experiencing revenue growth in excess of 25 percent
year-on-year in Asia, partly as a result of being able to
partner with server or storage platform vendors like EMC.
But Sargeant says that instead of plunging wholesale into
new disaster recovery software or hardware platforms, many
CFOs and CIOs are becoming more concerned about the preservation
and integrity of their information. “I’ve seen
a lot of activity in things like replicating data and backing
up data more regularly,” he says. Hence the HKJC’s
HK$36 million spent on consolidating its storage area networks
in the past three years. But, says Sargeant, “business
continuity is more than the information. It’s about
people and communications, and I’ve yet to see huge
expenditure in those areas,” he says.
What has changed, and what has perhaps
mitigated spending since September 11 is that organisations
have become more prudent about what processes they’re
going to provide disaster recovery and business continuity
plans for. They’re looking more carefully at the cost
of their downtime and the cost of downtime on an application
basis. In a stagnant economy it makes sense to look at the
“mission critical” applications – the betting
systems at the HKJC, for example – and provide for disaster
recovery and business continuity application by application
rather than for the organisation in its entirety. The nature
of a critical application depends on each organisation but
revenue-generating and customer-facing applications tend to
be earmarked as crucial.
“Many applications in organisations
don’t need the same sort of mechanisms in place to restore
them quickly. If they’re restored in a day, maybe several
days or a week, that’s fine,” says Sargeant, “the
company still continues. CFOs are going through a process
of understanding what is important. Once they do that they
can get a more realistic plan, and one that’s better
from a price perspective.”
Offshore Escapes
There are those, however, daring to challenge
the trend in terms of spending by setting up their own disaster
recovery facilities. Not surprisingly, they are IT vendors.
Infosys, the Bangalore-based IT services giant recently announced
the establishment of its first disaster recovery center outside
India, on the island of Mauritius. At a cost of US$25 million
in terms of capital expenditure, you could view this move
as something of a desert island risk, a facility for 1,500
software developers lying practically vacant until the hour
of need, on a tropical island several thousand kilometers
east of Africa. Then again, you could just label it sound
planning. Mauritius has good relationships with India on a
government level. It’s close to Europe and India, with
direct flights to Bangalore taking about five hours. It has
a good technology infrastructure and telecommunications facilities
and, most importantly, its government was willing to organise
work permits for Infosys in advance. Should nuclear war break
out between India and Pakistan over border tensions involving
Kashmir, Infosys can command an armada of aircraft to airlift
its 1,500 developers and engineers to Mauritius in a matter
of hours.
“In a disaster recovery situation,
you need to have the space available,” says S Gopalakrishnan,
chief operating officer and deputy managing director of Infosys.
“That means it has to be unused till a disaster happens.
It’s empty unless there is a disaster,” he says.
He maintains that once the capex investment
is made, there is little to spend in terms of ongoing costs.
The servers used to store programs and data, the network and
connectivity are in place. The theory is that in the event
of an emergency that cuts India off, the 1,500 developers
will arrive and hook up their notebooks to the network. Voilà.
Instantaneous and effective business communication with customers.
Infosys already has eight data recovery
centers, any one of which can take over from another. But
they’re all based in India. After September 11 and the
border tensions, customers started asking how the company
could protect their data and operations. Mauritius is the
first attempt to satisfy these questions, but Gopalakrishnan
says there may be more similar recovery centers once Mauritius
is operational in January 2003.
“This whole plan is for Infosys
to serve its customers,” says Gopalakrishnan. “In
the event of a disaster, Mauritius, along with people already
based outside India, should help us keep all the customer
projects running.”
Slightly less exotic, but still an island
located outside India, Singapore is providing a haven for
Polaris Software Lab, the US$200 million market cap Indian
IT company. Polaris’ S$1.5 million (US$770,000) 150-seat
Business Continuity Center was set up in Singapore in August
2002 to provide recovery services for Polaris’ mainly
banking and insurance clients. It is linked to the company’s
five development centers in India through a dedicated private
link, used exclusively for replicating data to the server
installed at the Singapore BCC and directly accessible by
key clients. “Disaster recovery and business continuity
are imperative to our clients, given the mission critical
nature of their business,” says N Vaidyanathan, CFO
of Polaris. “With the BCC, clients will have direct
access to their data, and the assurance that the back-up data
will remain intact.”
Unlike Infosys or Polaris, India’s
Tata Consultancy Systems (TCS), a division of the US$11.3
billion business conglomerate Tata Group, is using its development
centers worldwide as back-up facilities for customers’
critical business processes. In conjunction with a well-documented
handbook detailing everything a client needs to know in the
event of having to negotiate the critical steps to recovery,
the centers – 24 in India plus six in the US, and one
each in the UK, Hungary, Melbourne, Yokohama, Hangzhou in
China, and Uruguay – are production support centers
whose cost is underpinned by software development and customer
servicing onsite.
“I’m not a great believer
in creating the one Fort Knox and assuming that it is the
be-all and end-all of crisis management,” says Girija
Pande, regional director for TCS in Singapore. “We have
to give customers more flexibility and a global crisis resolution
story because some of them will require help in different
places.”
Pande, who has a banking background, says
part of TCS’ competitive advantage for its clients is
in the geographical spread of its development centers –
effectively across all time zones – and in its inclusion
of disaster recovery into already existing facilities. “Disaster
recovery is a process and a mindset, not a physical environment
only,” he says. “The right processes, documented,
are the key. People have to be aware that unless disaster
recovery and business continuity are thoroughly detailed,
quality-controlled processes, there’s no point in having
the expensive fixed assets only,” he says.
The Cost Of Recovery
For every company lacking contingency
plans and experiencing from two to five days of network downtime,
25 percent will go bankrupt immediately, 40 percent will close
their doors within two years and of the 35 percent remaining,
virtually none will exist five years on. Vendors might try
to scare CFOs into acting on the US-based Contingency Planning
Organization statistics and the oft-quoted cliché that:
“The day after disaster recovery plans are created,
they are out of date”. But what’s worth bearing
in mind is that a disaster recovery plan’s effectiveness
to protect an organization fades progressively with time.
CFOs should at least consider how to prevent this from happening
as well as the merits of keeping an organisation upright in
the face of disaster by investing in business continuity strategies
of some nature.
But is investment in a disaster, which
may or may not happen, a simple leap of faith or a poorly
executed back flip? The HKJC’s Beason says that for
the club’s new off-course betting network, he and his
peers looked at the ROI in terms of whether the investment
was an opportunity cost or an opportunity lost. They then
looked at the kind of pricing involved to make the network
reliable, versus having it redundant. The basic issue was
whether to have two physical telephone lines going into every
off-course betting center opposed to one. They looked at how
many outages might occur over five years and what that additional
cost would be. “We said if it happens once a year it
will pay for itself,” says Beason. “If for five
minutes we go down at the wrong time – and it’s
happened once in the last five years – we’ll make
up all that money within that time. If it never happens and
we never have an outage we’ll have spent another HK$10
million that we didn’t need to spend. But our historic
information says that it’s likely to happen, and it
has happened before. And if it happens twice then we’re
saving money. We made the call that way. We’re forced
to make all of our decisions that way.”
Malcolm Harkins, director of information
security and business continuity at Intel in the US, acknowledges
that there’s a spectrum of cost trade-off, depending
on how much a CFO is prepared to pay to mitigate against a
potential disaster. But he also notes that there are some
things in a business continuity sense that lend themselves
to a clean ROI. “For back-up solutions or desktops in
the office environment, for example, you can work out a strong
ROI just by looking at industry average hard-disk failure
rates, determining the cost of that, your cost of replacement,
your IT support costs, and your end-user issues. Then you
ask, ‘how much does it cost me to provide them with
back up solutions?’ That’s a very clear ROI that’s
a no-brainer if you go through those calculations,”
says Harkins.
His approach when ROI is less clear is
to look at the strategic value and competitive advantage inherent
in contingency planning like the reduction in customer and
employee uncertainty. “If you have the ability to operate
continuously and in a nonstop fashion, in the event of a disaster
or disruption, when your other competitors may not be able
to do that, you’ve got a great competitive advantage,”
says Harkins.
It’s also possible to quantify the
cost of downtime and the cost of disruption. For example,
take an online brokerage company with millions of transactions
a day on its website for web trading. The cost of a disruption
of that site being down is enormous. There’s a clear
cost benefit trade-off to make and it’s not what you
might think, says Harkins. “Because the business is
automated and trading is on the web, business continuity is
also cheaper than if that brokerage had 500 people taking
phone calls that they then needed to replicate in another
location,” he says.
In other words, if disaster strikes, be
it a burst pipe or a terrorist attack, the cost of replicating
a call center is vastly more expensive in the long run than
replicating a server environment and some internet connections
in another location. And the cost of disaster recovery and
business continuity could dwindle further as more and more
critical business processes are transacted over the internet
and backed up on storage area networks connected via high-speed
network links to servers offshore.
Karen Winton is executive editor of eCFO
and a senior writer at CFO Asia based in Hong Kong. |