No, you don’t have to run like Google
Years ago, Google struggled with how to pitch its cloud offerings. Back in 2017 I suggested that the company should help mainstream enterprises to “run like Google,” but in a conversation with a senior Google Cloud product executive, he suggested that the company shied away from this approach.
The concern? That maybe mainstream enterprises didn’t share Google’s needs, or maybe Google would simply intimidate them.
For the mere mortals that run IT within such mainstream enterprises (read: almost everyone), fear not. It turns out there are many things that Google might do that make no sense for your own IT needs.
Just ask Colm MacCárthaigh, AWS engineer and one of the authors of the Apache HTTP Server, who asked for “examples of technical things that don’t make sense for everyone just because Amazon, Google, Microsoft, Facebook” do them. The answers—excessive uptime guarantees, site reliability engineering, microservices, and mono-repos among the highlights—are instructive.
Excessive uptime guarantees
“Five or five-plus nines availability guarantees,” says Pete Ehlke. “Outside of medicine and 911 call centres, I can’t think of anything shy of FAANG [Facebook, Amazon, Apple, Netflix, and Google] scale that actually needs five nines, and the ROI pretty much never works out.”
I remember this one well from the variety of start-ups for which I worked, as well as when I was at Adobe (whose service-level commitments tend not to be five nines, but are arguably higher than necessary). Are you going to be OK if the multi-player game goes down? Yep. What about Office 365 for a few minutes, or even hours? Yes and yes.
Site reliability engineering
A bit of a spin on devops (though it predates the devops movement), SRE (named in multiple replies to MacCárthaigh) came out of Google in 2003, and was designed to infuse engineering with an operational focus. A few core principles guide SRE:
- Embrace risk
- Utilize service level objectives (SLOs)
- Eliminate toil
- Monitor distributed systems
- Leverage automation and embrace simplicity
Or, as Ben Traynor, who developed Google’s SRE practice, describes it:
SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labour. In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
SREs spend much of their time on automation, with the ultimate goal being to automate away their job. They spend considerable time on “operations/on-call duties and developing systems and software that help increase site reliability and performance,” says Silvia Pressard.
This sounds important, and even more so if you equate “site reliability” with “business availability.” But do most companies really need their developers to become operational experts? SRE might be critical at Google or Amazon, but it’s arguably a heavy lift for most enterprises, tasking developers with too much of an operational load for them to manage it successfully.
As commentator “Buzzy” tells it, “Definitely microservices. The number of 20-staff-in-total companies I’ve had to talk down from that ledge….” Nor is he the only one to call out microservices as a needless complication for most enterprises. Many of the replies to MacCárthaigh’s tweet mentioned microservices.