We had a great time at DevGuild: Incident Response and learned a lot from some amazing speakers that shared their experiences from Spotify, Zendesk, Salesforce, Honeycomb, Snyk, and more. I wanted to recap the experience and provide some takeaways from the amazing conference.
And don’t take my word for it, you can watch the replay of the conference here before November 30th and here after November 30th.
Day One
The first day was focused on observability, and started off with my talk, Modern Monitoring & Detection. Then the conference hosted a panel discussion on Observability & Early Warning Detection with Liz Fong-Jones, Linda Ypulong, Jesse Robbins, and Metrist Co-Founder Ryan Duffield.
- Incident response is a somewhat controversial subject. How to respond to incidents, how to prepare for incidents, and even defining terms related to incidents are all contested. Since incident response is an emerging (and increasingly critical) field, determining the specifics and best practices can become “controversial” – or at least contested. Which is one reason the conference was held in the first place.
- Cloud dependencies also have their own dependencies, making incident response even more complex. Cloud dependencies have their own unique contribution to incidents (especially considering they are essentially outages you can’t control – and they make up 70% of outages). However, it’s important to remember that cloud dependencies have cloud dependencies – further increasing the complexity of incident response.
- Stop making observability about pillars. When talking about observability, a specific number of pillars often arises – especially if one is defining observability in a certain way. While these pillars and the data they provide are important, the number of pillars is sometimes contested – is it three? Four? Ten? But the number of pillars isn’t as important as the data they provide and what one does with that data to address incidents.
- Cloud dependency health is rarely as simple ase “up” and “down.” Further, it’s rarely as straightforward as assessing an “up or down” status. So this is why cardinality and having personalized data as it pertains to how your app interacts with your vendor matters.
- You cannot operate a 4 nines service with human intervention. Right now, service operation and incident response are efforts driven by humans. But considering the current state of increasingly-complex systems, pure human maintenance of these systems simply isn’t possible. So integrating AI and other means of automation are critical to maintaining the best reliability possible.
- Use a “How We Got Here” for post-incident review. Linda brought up the fact that people should not only review the factors that were involved in an incident, but also the factors that led up to the incident. Including the human systems and intervention. This way, we can work not only to resolve incidents better but better prevent them.
- MTTD is the one and the only thing that is always in our control! It pays to invest in reducing Mean time to Detect because it is the one incident response metric that we can control, and has a direct impact on Mean time to Resolve.
- Include cloud dependencies in your directory. Apps are built on apps, with third-party cloud services as a top source of downtime, so it is critical to understand our dependencies on other software in order to highlight risks and opportunities.
Day Two
The next day focused on incident response. Nora Jones started the day off with “Your Incident Response Playbook.” A panel discussion on incident response with Nora, Erin McKeown, Dylan Bochman, and Sarah Butt followed.
- Kill the term “root cause,” incidents are rarely because of one thing. Nora Jones, founder & CEO of Jeli, made a great point that incidents are usually the result of multifaceted issues, so labeling a “root cause” could be too simplistic and unhelpful.
- Your incident review report isn’t the end. Nora also brought up the point that good incident reviews are part of good incident response. These reviews now become another data point and can serve to drive chaos events and meta-analysis across incidents. Along those lines: having the context behind metrics is almost as important as the metrics themselves.
- A lot of incident response data can be mined from Slack channels. Since understanding the context and human element of incidents are important, a lot of important data can be gleaned from Slack. From code to communication before, during, and after the event, the messaging platform can provide a lot of valuable information.
- Use Howie for post-incident response: Similar to what Linda said during day 1, instead of the 5 whys or a root cause analysis, use “How we got here” as a guide to incident response. Or, as Jeli calls their free guide, Howie.
- COVID has changed incident response. We aren’t in the same room as much anymore and we may be working with people we’ve never seen before. With our workplaces being more and more remote and incidents being more and more complex, more people who haven’t been in the same room – or perhaps even worked together may be required to address a common incident. So, empathy and relationship building with our colleagues is more important than ever.
Day Three
The final day of DevGuild: Incident Response was all about the future. Mandi Walls started off with her talk, “Minimizing Human Intervention in Incident Response” and the sessions concluded with “A Fireside with Guy Podjarny” hosted by Jesse Robbins.
- The future of incident response is automation. As mentioned earlier, humans can only do so much, especially in light of our ever-increasingly complex systems. So it’s important – and increasingly necessary – to automate as much incident response as possible. However, we’re probably a long way off from true AIOps (incident response) but we can take advantage of our data to match like with like and automate most of the rote tasks.
- Seven-day on-call rotations are a recipe for burnout. Although these rotations are common, they should truly become a thing of the past. Tired, burnt-out people are not the best at handling complex, stressful situations. So it’s not a good idea to put them in a situation in which they are likely to be tired and burnt-out.
- We get too hung up on names like DevOps and DevSecOps. Definitions and delineations are good, but sometimes they can cause confusion. One example is that DevOps does not preclude Security. Another is that DevSecOps isn’t different from DevOps, it’s a more inclusive way to talk about the DevOps movement.
- Compliance does not equal security. Just because one is compliant with certain security protocols, it does not mean that your system is secure. Don’t let compliance allow you to become non-vigilant or too trusting of vendors claiming compliance.
- Unlike Operations, Security is tough to measure. Measuring security based on the number of breaches is like measuring ops only on the amount of downtime. Availability and MTTR are clear metrics, but we just don’t have that for Security. As a result, it’s harder harder to set goals and measure impact.
After DevGuild: Incident Response
I learned a lot, was encouraged, and was inspired by this year’s DevGuild: Incident Response. And I’m glad to be part of the solution for creating more reliable systems. (And a more reliable world!)
If you’d like to watch any of the sessions, you can see them here, including my talk and Ryan’s panel.
You can also check out Learning from Incidents from Jeli and our weekly What Went Down newsletter.