Site Reliability Engineering (SRE) is a hot topic and job title, but what exactly is SRE? Are we just putting a new name on an old concept? Did we actually knock down the wall between Dev and Ops, or are we still tasking a single team with making the rest of the company look good?
Google invented the role, so let’s take a look at their definition:
“[A team that ] is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).” Google – Site Reliability Engineering
That definition says both a lot… and very little. Plus, some SREs don’t align with this standard. So with the variety of ways SREs function, one could ask, “is SRE just ops with a new name?”
Although I’d like to share a universally accepted and crisp definition of Ops, such a definition technically doesn’t exist (despite being around for a lot longer than SRE!). So you are left with my definition:
Ops is a team that is responsible for making sure that the software written by their colleagues is available to customers, through the management of servers, networking, and incident response.
As you can see, my definition of Ops sounds a bit like Google’s definition of SRE, so I wanted to get to the bottom of the difference and form a conclusion. To do this, I reached out to four experts to see what they thought about the question and hear how they define SRE.
Opinion 1: No, SRE is not Ops with a Different Name
First, I reached out to Arup Chakrabarti, knowing that he led engineering teams at Amazon.com, Netflix, and PagerDuty. Arup didn’t mince words when I asked “Is SRE just Ops with a new name?”:
“My initial knee-jerk reaction is, ‘F*ck that, it’s disrespectful to many operators out there and belittles the under-recognized work that they do every day to keep the lights on.’
“But my more reasonable and nuanced opinion would be, ‘SRE is an evolution of how to operate software. Just like how we don’t use punch cards anymore to develop applications, we can no longer SSH into individual machines to operate the software.’”
“An evolution of how to operate software.” I like that, but isn’t it still Ops? While Arup had a strong response to the contrary of the question, there was still more to learn.
Opinion 2: Yes, SRE is Ops with a Different Name
Next, I talked with Paige Cruz who started her engineering career in application development at New Relic before serving in SRE roles at InVision, Lightstep and Weedmaps. Here is what she had to say when I asked “Is SRE ops with a different name?”:
“Two years ago I would’ve defended the SRE role as emerging and interdisciplinary. But today, I agree with that statement.
In a world where “DevOps” has become a job title, we’ve reinvented the wall between Dev and Ops.
We need a more nuanced discussion, however, because everyone has their own flavor and twist on the role of SRE – from company to company and even SRE to SRE.”
I have to admit, it does seem that at some companies, application developers throw software (and problems) over the wall for SREs to deal with.
Opinion 3: Not Sure Yet.
Dave Bresci, Sr Manager of SRE at PagerDuty, is also contemplating the differences (or lack thereof):
“It’s a fair question to ask. SRE is a hot name to call things, but is it really SRE in the way Google defined it?
Ops gets labeled as running servers, networking, and core infrastructure services. SRE is going beyond, with infrastructure as code, developer tooling & enablement, and owning reliability for the entire org, not just servers.
Ops has been unfairly turned into a bad word for some reason, but the ops function will never go away. You still have to operate services, but now that responsibility is distributed to more people in the company. The last `S` in SaaS is for “service” and that means operating software for someone else. At a SaaS company, everyone is responsible for operating reliably.”
He is right, we are all responsible for reliability. We don’t (or at least shouldn’t) throw it over the wall to SREs, and we are lucky to have SRE to make it easier and more efficient to deliver reliable software.
So maybe my question presents a false dichotomy?
Opinion 4: Who Cares?
I can always count on Adam Panzer, a senior software engineer formerly from Zendesk, Optimizely, and PagerDuty, for a deep thought masquerading as a hot take. He helped me understand that names are just names; it’s how you practice your discipline that matters:
“It’s like saying Jackson Pollock and Van Gogh were both “just painters” and that Abstract Expressionism and Dutch Post-Impressionism were just the same things but with different labels. We often use labels to denote changes in thinking or practice.
Operations Engineers were originally racking and stacking physical servers, installing applications and operating systems, and more. Today, the people practicing DevOps may not do any of that.
Who cares what it’s called? The name isn’t important, it’s the outcome.”
Like I said, deep thoughts disguised as hot takes.
While I instigated the debate, I also must admit that I think it is in some ways a “silly question.” The software industry is constantly changing, and the discipline of operating software is constantly changing, too.
Today, delivering reliable software today takes more than a single team. Plus it takes a completely different skill and mindset than it did in the past (especially if we aspire to the Google standards of SRE).
SRE is a subset of DevOps and one way of approaching reliability. How that works looks different between organizations. So Ops and SRE can have similarities, and they can have differences. In some cases, we are rebuilding walls between development and operations, but that’s not what SRE is, and when it happens, it’s not because of SRE.
There isn’t a “yes,” “no,” or even “maybe” answer to this question, but it is a useful question to consider because it helps us think about the way we use our terminology. The field of SRE and DevOps is still growing, and it’s important that we all stay on the same page.
We hope you enjoyed this exploration of the state of SRE, Ops, and DevOps. And we hope that you got some insight into the nuances of terminology and functionality that lie within this evolving, exciting conversation.
What do you think? We’d love to hear your opinion. Just tweet @Metrist or contact us to let us know.
If you have questions about how third-party cloud dependencies impact what it means to deliver reliable software, reach out and try out the Metrist platform.