Apple's livesteam outage was easily preventable: here's how!

The live stream of Apple's announcement of the Apple Watch was marred by technical problems. Users saw messages about "could not load movie" and "you don't have permission to access".

As we read Dan Rayburn's excellent technical analysis of what went wrong, we couldn't help but think how easily preventable their problems were.

The problem was that Apple introduced a new feature that had unknown resource requirements and (oops!) they didn't have enough resources. For example, suppose a thousand website visitors requires a certain number of computers (resources) to serve the website. Some websites are "heavier" and require the same work to be spread over more computers, others require fewer resources per thousand users.

For previous events the live stream page was a static page that embedded the live stream video. This made the page highly cacheable and required very few resources.

However this event added a new interactive feature: a live scrolling display of tweets which created a live summary of what was being presented. The idea was brilliant and did (when it was working) created a richer user experience.

The problem was that this "tweet feed" went from having 0 to millions of users in a matter of minutes. That's much too fast to add new capacity if the number of resources had been under estimated.

Normally a web service starts small and grows over many months. For example, you might start a new website for trading Beanie Babies. At first you get a few dozen or hundred visitors each day. Soon your site becomes more popular and you have a constant flood of visitors. You add capacity to handle the flood and all is good. Eventually you notice you are growing at, for example, 10% per month. That gives you a good sense of how much capacity you have to add each week to handle the additional users.

When growth is spread out over a long amount of time it is easy to stay ahead of the curve. However what if on "day 1" you were going to have a million users? You would have to accurately predict how much capacity will be required with zero prior experience; just engineering estimates.

You can do some tests. You can easily simulate 1,000 users and multiply it up, but that kind of projection is rarely accurate. When dealing with growth in the hundreds of thousands you can't predict what problems will crop up.

For example, Apple may have tested their new interactive tweet feed with thousands of users, but until they had millions of users there was no way to know what their weakest link was. It could have been bandwidth, a software thread locking problem, not enough CPU, or a myriad other issues.

So how do you handle this kind of situation? Companies like Google and Facebook have developed ways to perform tests that more accurately predict the resources needed. They had no choice. Google can't announce a new feature without millions of people trying the new feature the moment it is available.

One technique is to slow down the rate of growth artificially. When Gmail was new, Google required new users to get an invitation to join. They controlled the rate at which invitations were distributed, thus controlling the rate of growth. Did a new shipment of hard disks arrive late? Delay the next batch of invites. Did a code optimization make the system more efficient? Hand out more invitations.

Another way to artificially control growth is to enable the feature without announcing it. This is called a "soft launch." Word of mouth only spreads so fast. This may slow down the growth enough that it can be monitored by engineers who can fix the problems as they are discovered. In the worst case, you can turn off the feature since it hasn't been officially announced yet. Users will be disappointed but the truth is that removing the feature might just create more hype for when it does return.

Apple's situation was a little different. Literally zero to millions in just a few minutes. For that, you need to do a dark launch. As described in our new book, The Practice of Cloud System and Network Administration:

The term dark launch was coined in 2008 when Facebook revealed the technique was used to launch Facebook Chat. The launch raised an important issue: How to go from zero to seventy million users overnight without scaling issues. An outage would be highly visible. Long before the feature was visible to users, Facebook pages were programmed to make connections to the chat servers, query for presence information and simulate message sends without a single UI element drawn on the page. This gave Facebook an opportunity to find and fix any issues ahead of time. If you were a Facebook user back then, you had no idea your web browser was sending simulated chat messages but the testing you provided was greatly appreciated.

Apple could have added some code to their homepage that would have queried the tweet stream but not displayed it. This would have enabled them to use the millions of people that visit their homepage to help them test this feature. In fact, they could have started small: enabling this hidden feature for 1% of all visitors and "turning up the dial" slowly over time to see how the tweet stream system reacted. Once it was at 100%, they'd have developed confidence in the system.

In the process,they would have probably found what is suspected to be a bug. The code was querying for updates 1000 times per second instead of a more reasonable ten times per second. This alone might have created 100 times the pressure on Apple's resources.

All of this could have been done weeks before the actual event.

What makes this such a shame is that the media spent time discussing the technical problems of the launch, even though this had no real bearing on the new Apple Watch itself. This subtracted from the amount of positive press that Apple was trying to achieve.

We're not saying that if Apple's engineers had read The Practice of Cloud System Administration the entire fiasco would have been prevented. In fact, we have to assume someone at Apple knows these techniques. The problem is that the right people didn't know. This is why we wrote this book: to spread these kind of ideas to more people.

It isn't just big companies like Apple that need to understand these techniques. Even small startups have unexpected success.

Big launches are high-stakes tests of a company's IT team. You have to get it right the first time or you'll end up like the successful restaurant Yogi Berra once described: "Nobody goes there anymore. It's too crowded."

For more information about The Practice of Cloud System Administration, please visit https://the-cloud-book.com.

Posted by Tom Limoncelli in The Practice of Cloud System Administration

Comments (0)
| Trackbacks (0)
Tweet

Awesome Conferences

No TrackBacks

Leave a comment

Best of Blog

Navigation

Recent Entries

Search

Archives

RSS Feed

Credits