Tag: next generation IT

The Facts, not the Fluff and Puff: What Really Happened with Amazon’s Cloud Outage | Gaining Altitude in the Cloud

In the more than 1,000 articles written about Amazon’s April 21 cloud outage, I found myriad “lessons,” “reasons,” “perspectives,” and a “black eye,” a “wake up call” and a “dawn of cloud computing.”

What I struggled to find was a simple, but factual, explanation of what actually happened at Amazon. Thankfully, Amazon did eventually post a lengthy summary of the outage, which is an amazing read for IT geeks, to which the author of this blog proudly belongs, but may induce temporary insanity in a more normal individual. So this blog is my attempt to describe the event in simple, non-technical terms. But if I slip once or twice into geek-speak, please show some mercy.

The Basics of Amazon’s Cloud Computing Solution

When you use Amazon to get computing resources, you’d usually start with a virtual server (aka EC2 instance). You’d also want to get some disk space (aka storage), which the company provides through Amazon Elastic Block Store (EBS). Although EBS gives you what looks like storage, in reality it’s a mirrored cluster of two nodes … oops, sorry … two different physical pieces of storage containing a copy of the same data. EBS won’t let you work using only one piece of storage (aka one node) because if it goes down you’d lose all your data. Thus in Amazon’s architecture, a lonely node without its second half gets depressed and dedicates all its effort to finding its mate. (Not unlike some humans, I might add.)

Amazon partitions its storage hardware into geographic Regions (think of them as large datacenters) and within Regions to Availability Zones (think of them as small portions of a datacenter, separated from each other). This is done to avoid losing the whole Region if one Availability Zone is in trouble. All this complexity is managed by a set of software services collectively called “EBS control plane,” (think of it as a Dispatcher.)

There is also a network that connects all of this, and yes, you guessed right, it is also mirrored. In other words, there are two networks… primary and secondary.

Now, the Play-by-Play

At 12:47 a.m. PDT on April 21, Amazon’s team started a significant, but largely non-eventful, upgrade to its primary network, and in doing so redirected all the traffic to another segment of the network. For some reason (not explained by Amazon), all users traffic was shifted to the secondary network. Now you may reply, “So what…that’s what it’s there for!” Well, not exactly. According to Amazon, the secondary network is a “lower capacity network, used as a back-up,” and hence it was overloaded instantly. This is an appropriate moment for you to ask “Huh?” but I’ll get back to this later.

From this point on, the ECB nodes (aka pieces of storage) lost connections to their mirrored counterparts and assumed their other halves were gone. Now remember that ECB will not let the nodes operate without their mates, so they started frantically looking for storage space to create a new copy, which Amazon eloquently dubbed the “re-mirroring storm.” This is likely when reddit.com, the New York Times and a bunch of others started noticing that something was wrong.

By 5:30 a.m. PDT, the poor ECB nodes started realizing harsh reality of life, i.e., that their other halves were nowhere to be found. In a desperate plea for help, they started sending “WTF?” requests (relax, in IT terminology WTF means “Where is The File?”) to the Dispatcher (aka EBS control pane). So far, the problems had only affected one Availability Zone (remember, a Zone is a partitioned piece of a large datacenter). But the overworked Dispatcher started losing its cool and was about to black out. Realizing the Dispatcher’s meltdown would affect the whole Region, causing an outage with even greater reach, Amazon at 8:20am PDT made the tough, but highly needed, decision to disconnect the affected Availability Zone from the Dispatcher. That’s when reddit.com and others in that Availability Zone lost their websites for the next three days.

Unfortunately, although Amazon sacrificed the crippled Availability Zone to allow other Zones to operate, customers in other Zones started having problems too. Amazon described them as “elevated error rates,” which brings us to our second “Huh?” moment.

By 12:04 p.m. PDT, Amazon got the situation under control and completely localized the problem to one Availability Zone. The team started working on recovery, which proved to be quite challenging in a real production environment. In essence, these IT heroes were changing tires on a moving vehicle, replacing bulbs in live lamp, filling a cavity in a chewing mouth… and my one word for them here is “Respect!”

At this point, as the IT folks started painstakingly finding mates for lonely EBS nodes, two problems arose. First, even when a node was reunited with its second half (or a new one was provided), it would not start operating without sending a happy message to the Dispatcher. Unfortunately, the Dispatcher couldn’t answer the message because it had been taken offline to avoid a Region-wide meltdown. Second, the team ran out of physical storage capacity for new “second halves.”  Our third “Huh?”

Finally, at 6:15 p.m. PDT on April 23, all but 2.2 percent of the most unfortunate nodes were up and running. The system was fully operational on April 24 at 3:00 p.m. PDT, but 0.07 percent of all nodes were lost irreparably. Interestingly, this loss led to the loss of 0.4 percent of all database instances in the Zone. Why the difference? Because when a relational database resides on more than one node, loss of one of them may lead to loss of data integrity for the whole instance. Think of it this way: if you lose every second page in a book, you can’t read the entire thing, right?

Now let’s take a quick look at our three “Huh?s”:

  1. Relying on SLAs alone (which in Amazon’s case is 99.95 percent uptime) can be a risky strategy for critical applications. But as with any technological component of a global services solution, you must understand the supplier’s policies.
  2. Cloud platforms are complex, and outages will happen. One thing we can all be certain of in IT; if it cannot happen, it will happen anyway.
  3. When you give away your architectural control, you, well, lose your architectural control. Unfortunately, Amazon did not have the needed storage capacity at the time it was required. But similar to my first point above, all technological components of a global services solution have upsides and downsides, and all buyer organizations must make their own determinations on what they can live with, and what they can’t.

An interesting, final point: after being down for almost three days, reddit.com stated, “We will continue to use Amazon’s other services as we have been. They have some work to do on the EBS product, and they are aware of that and working on it.” Now, folks, what do you think?

Expect Changes in the IT Security Landscape | Sherpas in Blue Shirts

The worldwide IT security market is already quite sizeable, exceeding US$25 billion. And all industry analysts are predicting 20-30 percent growth in the next three years. Multiple drivers will fuel this growth, including increasing complexity of IT solutions – which raises the level of challenges for supporting security – and much higher value assigned to proprietary information.

Yet, I believe the structural nature of demand will drive quite an important shift in customer buying preferences going forward. As large enterprise clients recover from the global economic crisis of 2008, they are increasing their emphasis on costs. And despite increased willingness to pay, IT security cost is not immune to this pressure. In order to avoid separate management costs associated with standalone IT security service agreements, enterprises prefer to bundle IT security support with either large IT outsourcing deals or existing telecommunications contracts, as the network is still perceived as the most security-exposed element of IT delivery. Moreover, large corporate clients prefer to deal with a single point of responsibility for actual IT delivery and corresponding security support, which eliminates any risk of finger pointing, and streamlines their governance activities.

So what are the implications of buyer preferences for the existing provider landscape? I believe they will be game-changers primarily for niche IT security service providers and traditional security software vendors. Under the threat of missing their portion of anticipated incremental demand, they will be actively seeking alternative distribution channels and experimenting with different forms of industry cooperation. Everest Group also expects to see increased M&A activity in the IT security industry as large, integrated IT suppliers will be seeking ways to further enhance their capabilities in efforts to capitalize on the rapid growth of this market.

So let’s check in one year from now on the state of the IT security industry, although there is no doubt that it will look different from what we see now.

Size Does Matter – The Real Pecking Order of Indian IT Service Providers | Sherpas in Blue Shirts

Earlier today, Cognizant reported its financial results for the first quarter of 2011, bringing to an end the earnings season for the Big-5 Indian IT providers – affectionately referred to as WITCH (Wipro, Infosys, TCS, Cognizant, and HCL). Cognizant’s results were yet again distinctive: US$1.37 billion in revenues in 1Q11, which represents QoQ growth of 4.6 percent and YoY growth of 42.9 percent. The latest financial results reaffirmed – yet again – Cognizant’s growth leadership compared to its peers and are a testament to Cognizant’s superb client engagement model.

Q1 2011 financial highlights for WITCH:

WITCH Q1-2011 Financial Highlights

In a recent blog post, my colleague Vikash Jain commented on the changes in the IT services leaderboard, and especially the questions and speculation on the relative positions of Wipro and Cognizant in the Indian IT services landscape. Cognizant’s 1Q11 revenues are now just US$29 million below Wipro’s IT services revenues, and based on current momentum, Cognizant could overtake Wipro as early as 2Q11, making it the third largest Indian IT major in quarterly revenue terms. The guidance provided by the two companies for the next quarter – Cognizant (US$1.45 billion) and Wipro (US$1.39-1.42 billion) – provides further credence to the projected timelines.

How important is this upcoming change in the relatively static rank order of the Indian IT industry (the last change happened in January 2009 post the Satyam scandal)? Not very, in our opinion. As and when this happens, the event will indeed create news headlines and the occasional blog entry, but the change in rankings does not imply a meaningful change to the overall IT landscape. Further, other than providing Wipro with even more conviction to make the changes required to recapture a faster growth trajectory, the new rank order does not suggest any changes in the delivery capabilities of either of these organizations.

As we advise our clients on selecting service providers, we believe that it is more important to understand the service provider’s depth of capability and experiences in the buyer organization’s specific vertical industry. While total revenues and financial stability are important enterprise-level criteria, performance in the vertical industry bears greater relevance and significance as buyers evaluate service providers. In our 1Q11 Market Vista report, we examine the CY 2010 revenues of the WITCH group to determine the pecking order in three of the largest verticals from a global sourcing adoption perspective – banking, financial services and insurance (BFSI); healthcare and life sciences; and energy and utilities (E&U).

As we recognize there are differences in the way these providers segment results, for simplicity we are relying on reported segmentation (which we believe does not meaningfully alter the results). The exhibit below summarizes the results of our assessment:

Industry leaderboard for WITCH:

WITCH Industry Leaders1

Our five key takeaways:

  1. The ranking of WITCH based on enterprise revenues has limited correlation to industry vertical rankings. The leader in each of the three examined industries is different.
  2. In BFSI, while TCS is the clear leader, Cognizant is rapidly closing in on Infosys for the second spot. (Note: Wipro is already #4 in this vertical).
  3. In Healthcare and Life Sciences, Cognizant emerges as the clear leader with 2010 revenues greater than those of Wipro, TCS, and HCL combined. (Note: Infosys does not report segment revenues for Healthcare).
  4. In E&U, Wipro leads the pack and is expected to widen the gap through its acquisition of SAIC’s oil and gas business. TCS achieved the highest growth in 2010 to move to third position ahead of HCL (TCS was #4 in 2009) and narrow the gap with Infosys (Note: Cognizant does not report E&U revenues).
  5. Finally, the above ranks are going to change quickly. Based on the results announced for the first calendar quarter of 2011 alone, we anticipate a change in the second position for each of the three examined verticals:
    • Cognizant’s Q1 BFSI revenue of US$570 million is nearly identical to that of Infosys’ US$572 million
    • TCS’ Q1 Healthcare and Life Sciences revenue at US$ 119 million is higher than Wipro’s US$111 million (which also includes services)
    • TCS reported Q1 E&U revenues of US$103 million, versus Infosys’ US$93 million

While it will be interesting to see the impact on a full year basis, the above changes in momentum already indicate further changes in the industry leaderboard before the end of the year.

On an unrelated note, by the time we revisit the Wipro versus Cognizant debate when the Indian majors announce their Q2 results starting mid-July, WITCH will assume an additional meaning – the last installment of the Harry Potter movies is due for release on July 15, 2011!

The Smart Metering Wave and Its Impact on Utilities’ Meter-to-Cash Process | Sherpas in Blue Shirts

Meter-to-Cash (M2C) is a significant process for utility companies as it not only represents their revenue cycle but also touches the end customer directly. Essentially M2C is the utility industry’s version of the generic Order-to-Cash (O2C) process. These are times of change for the utility industry due to a variety of reasons, the advent of disruptive technologies such as smart metering being one of them.

While the future of smart metering is still being debated, many facts suggest that it’s no longer an “if” but a “when.” For example, the United States in November 2009 directed US$3.4 billion in federal economic stimulus funding to smart grid development. The European Union in September 2009 enacted a “Third Energy Package,” which aims to see every European electricity meter smart by 2022. A recent study by ABI Research projects the global deployment of smart meters to grow at a CAGR of nearly 25 percent from 2009-2014. A combination of factors including regulatory push, intense competition (particularly in deregulated markets), and the business benefits that smart meters offer to utilities’ operations – in terms of tightened revenue cycles and increased customer satisfaction – are driving the adoption of smart meters.

However, deploying a smart metering infrastructure is no small task for a utility as it brings in many fundamental changes to the M2C operations. Managing the cutover from traditional to smart meters, dealing with new network technologies, the diminished role of field services, and the upgrades required to meter data management systems (MDMS) are just some of the key challenges that utilities undergoing smart metering implementation need to overcome. But the even bigger challenge arrives after implementation – how does a utility manage the massive explosion in meter data in the “smart” world? As opposed to meter reads every month or two, we are now talking about reads once every hour that thousands of smart meters throw back to the utility’s MDMS. Even more importantly, how should a utility leverage the data for meaningful business intelligence purposes?

Many utilities have found the answer in outsourcing some of the M2C functions to external service providers that have jumped on the smart metering wave with services ranging from pre-implementation advisory to post-implementation services such as smart analytics offerings. For example, Capgemini‘s smart energy services offering focuses on the requirements of utilities undergoing smart metering implementation. It recently launched a new smart metering management platform – labeled as Smart Energy Services Platform – for utilities to support all the end-to-end business processes necessary for the deployment and ongoing operation of a smart meter estate.

If you’re an M2C BPO provider that hasn’t yet considered including smart metering services in your portfolio, are you still debating the future of smart metering?

Learn more about M2C BPO at Everest Group’s May 10 webinar.

Will the Sun Come out Tomorrow? | Gaining Altitude in the Cloud

Cloud computing promises increased flexibility, faster time to market, and drastic reduction of costs by better utilizing assets and improving operational efficiency. The cloud further promises to create an environment that is fully redundant, readily available, and very secure. Who isn’t talking about and wanting the promises of the cloud?

Today, however, Amazon’s cloud suffered significant degradation in its Virginia data center following an almost flawless year+ long record. Yes, the rain started pouring out of Amazon’s cloud at about 1:40 a.m. PT when it began experiencing latency and error rates in the east coast U.S. region.

The first status message about the problem stated:

1:41 AM PT We are currently investigating latency and error rates with EBS volumes and connectivity issues reaching EC2 instances in the US-EAST-1 region.

Seven hours later, as Amazon continued to feverishly work on correcting the problem, its update said:

8:54 AM PDT We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.

No! Say it’s not so! A cloud outage? The reality is that cloud computing remains the greatest disruptive force we’ve seen in business world since the proliferation of the Internet. What cloud computing will do to legacy environments is similar to what GPS systems did to mapmakers. And when is the last time you picked up a map?

In the future, businesses won’t even consider hosting their own IT environments. It will be an automatic decision to go to the cloud.

So why is Amazon’s outage news?

Only because it affected the 800-pound gorilla. Amazon currently has about 50 percent of the cloud market, and its competitors can only dream of this market share. When fellow cloud provider Coghead failed in 2009, did anyone know?  We certainly didn’t.  But when Amazon hiccups, everybody knows it.

Yes, the outage did affect a number of businesses. But businesses experience outages, disruptions, and degradation of service every day, regardless of whether the IT environment is legacy or next generation, outsourced or insourced.  In response, these businesses scramble, putting in place panicked recovery plans, and having their IT folk work around the clock to get it fixed…but rarely, do these service blips make the news.   So with the spotlight squarely shining on it because of its position in the marketplace, Amazon is scrambling, panicking, and working to get the problem fixed. And it will. Probably long before its clients would or could in their own environments.

Yes, it rained today, but really, it was just a little sprinkle. We believe the future for the cloud is so bright, we all need to be wearing shades.

“The Gambler” and Developing Win-Win Contractual Relationships in the Healthcare Industry | Sherpas in Blue Shirts

As a country music fan, several lines in Kenny Rogers’ hit song “The Gambler” tend to make me think about outsourcing relationships, especially in the healthcare industry, as that’s where I’ve spent the bulk of my career.

“If you’re gonna play the game, boy, ya gotta learn to play it right”

In the 1995 to 2004 timeframe there was a proliferation of outsourcing among healthcare provider and health plan companies. The outsourcing advisory community began to cater to large and complex Integrated Delivery Networks (IDNs) and Academic Medical Centers to ensure they received the same type of world-class outsourcing services that Fortune-rated companies in other industries had already been receiving. As a result, a large number of outsourced ITO, APO and BPO contracts were inked, and the healthcare provider and health plan organizations came to depend on the third-party service provision to more efficiently manage their middle business services and IT needs.

Unfortunately, the healthcare firms got caught in the same conundrum as do all organizations that enter into long-term service delivery contracts. Once SLAs and joint governance models are agreed upon, service providers have little incentive to do anything but satisfy the contractual commitment in the most cost-effective manner. They also rarely get any clear understanding of what more the client may want. On the flip side, as technologies, markets, competitive drivers and growth objectives dynamically evolve over time, service recipients require, and expect, additional value from their providers to meet their continually changing business needs. But they rarely articulate what more they want from their service providers.

This disconnect ultimately leads to a mutual loss, as the original metrics of the contract are quickly outdated, value cannot be measured or realized, the incentives for both parties are misaligned, and a tension-riddled relationship develops.

“You have to know when to walk away and know when to run” (or maybe not)

A case in point: In 2005, a major healthcare provider contracted with a global provider of healthcare technology infrastructure and application sourcing services to support critical Electronic Medical Record, Operating Room, Scheduling and Billing services, all of which are essential for providing patient care and revenue functions. SLAs and a governance structure were negotiated, resulting in a complex, 10-year relationship with delivery defined in the traditional structure. However, as the contract didn’t allow for dynamic and ever-changing needs dictated by the marketplace, enhanced technologies and changes in regulatory compliance requirements, the business value proposition was lost and the relationship was ultimately dissolved. This could have been avoided simply by creating a flexible contracting mechanism that the service provider and service recipient could continually update to meet necessary changes. Yet, when a relationship gets to this point, many buyers believe they must rebid the contract, change providers or bring the services back in-house.

 “You got to know when to hold ’em”

But there is another solution that can result in a win-win situation for both parties. We think of it as service effectiveness, which is a big step up the value chain from traditional service efficiency models. Rather than focusing on things such as unit prices, process output, service levels and delivery risk, service effectiveness addresses those things that are the real priorities for service recipients – business value, process impact and receiving what they truly require, not just what is specified in the contract.

To come to mutual understanding on what service effectiveness means in a given outsourcing engagement – and change the way third-party services are perceived and accepted in the buyer organization – all delivery and recipient stakeholders should provide assessable input on three different dimensions: 

  • Objectives, which include cost savings, improved service quality, focus on core/strategic issues, currency of technology, capital expenditure avoidance, expertise/skills/innovation, and time to delivery.
  • Priorities, which include building trust and confidence, service quality, ease of communication, focus on business objectives, end user satisfaction, win-win collaboration orientation, and strategic involvement.
  • Performance, which includes end user satisfaction, service quality, price competitiveness, relationship effectiveness, and relationship value.

By coming to an agreement on what service effectiveness means to both the provider and the recipient, there’s an immediate return on investment to the bottom line created by keeping current models and relationships in place. This, in turn, avoids organizational upheaval and transitional costs, and creates a mutually beneficial business arrangement via an efficient set of services that provide flexibility, measureable value and terms that ensure success.

How can we engage?

Please let us know how we can help you on your journey.

Contact Us

"*" indicates required fields

Please review our Privacy Notice and check the box below to consent to the use of Personal Data that you provide.