I’ll always remember the shiny gold star stickers I obtained for getting 100% scores on math assessments at school. The celebrities had been proudly displayed on our household’s fridge door alongside my different sibling’s achievements. One thing in me nonetheless desires to earn extra of these stars.
However our expertise jobs at present, striving for 100% service availability and incomes that metaphorical gold star isn’t, because it seems, pretty much as good a purpose because it seems. Today, the gold star customary is intentionally set by service uptime expectations to be wanting 100%. Assembly 99.99% or 99.999%, however nothing extra, earns us the gold star on the finish of the day.
Current outages at Fb and AWS remind us of the wide-ranging impacts service disrupts have on finish clients. One might be forgiven for pondering outages are inevitable, since trendy functions are constructed with modularity in thoughts, and because of this, they more and more depend on externally hosted companies for duties resembling authentication, messaging, computing infrastructure, and so many extra companies. It’s clearly laborious to ensure every part works completely, on a regular basis. But end-users might rely in your service availability and should even be so unforgiving that they might not return in the event that they really feel the companies is unreliable.
So why not attempt for 100% reliability, if something much less can value you enterprise? It might appear to be a worthy purpose, nevertheless it has shortcomings.
Over-reliance on service uptime
One pitfall with companies that intentionally got down to, and even obtain 100% reliability is that different functions turning into overly reliant on the service. Let’s be actual: companies ultimately fail, and there’s a cascading impact on functions that aren’t constructed with the logic to resist failures of exterior companies. For instance, companies constructed with a single AWS AZ (Availability Zone) in thoughts may probably rely on 100% availability of that AZ. As I’m penning this weblog submit, I take note of the facility outage in in Northern Virginia that affected the us-east-1 AZ and quite a few world Net companies. Whereas AWS service has confirmed to be terribly dependable over time, assuming it could all the time be up 100% of the time proved to be unreasonable.
Perhaps a few of the functions that failed had been constructed to maintain the failure of an AZ, however constructed inside a single area. In current reminiscence, AWS has suffered from regional outage of companies. This illustrates the necessity to develop for multi-region failures, or different contingency planning.
With regards to constructing functions with uptime in thoughts, it’s accountable to imagine your service uptimes will fall wanting 100%. It’s as much as SREs and utility builders to make use of server uptime monitoring instruments and different merchandise to automate infrastructure, and develop functions throughout the boundaries of reasonable SLOs (Service Degree Goals).
System resiliency suffers
Providers constructed with the idea of 100% uptime from the companies they depend on themselves implicitly do not need resiliency as a countermeasure to service interruptions. However a service that counts on compute infrastructure failures has the logic to fail over to different obtainable assets whereas minimizing or eliminating customer-facing interruptions. That failure mitigation design might be within the type of failing over to a distinct Availability Zone (for AWS builders) or distributing an utility infrastructure over totally different Availability Zones. Whatever the method, the general purpose is to construct resiliency into utility companies with the idea that no service is 100%.
Downtime desk for various Service Degree Agreements
There are, in fact, some techniques which have achieved 100% uptime. However that’s not all the time good. A wonderfully dependable system results in complacent operators, particularly within the customers of the product. It’s greatest for SLOs to have upkeep home windows, to maintain customers of the service on top of things on preserving their whole system operating, even when a dependable part suffers an outage.
Providers are sluggish to evolve
One other shortcoming of striving for 100% uptime is that there is no such thing as a alternative for main utility upkeep. An SLO with 100% uptime means there’s 0% downtime. Meaning zero minutes per 12 months to carry out large-scale updates like migrating to a extra performant database or modernizing a entrance finish when an entire overhaul is known as for. Thus, companies are constrained from simply evolving to the subsequent greatest model of themselves.
Consequently, strong SLO’s with sufficient built-in downtime present the respiratory room wanted for companies to get well from unplanned and deliberate downtime to implement upkeep and enhancements.
Builders constructing at present’s functions can make the most of many various companies and bolt them collectively like constructing blocks – they usually can obtain outstanding reliability. Nonetheless, striving for 100% uptime is an unreasonable expectation for the applying and different companies the applying counts on. It’s way more accountable to develop functions with built-in resiliency.
We’d love to listen to what you suppose. Ask a query or go away a remark beneath.
And keep related with Cisco DevNet on social!