Intro to SRE
Reliability is a critical feature of most software, and maintenance rather than initial development predominates the cost of software. Yet, a large number of development teams treat operations as an afterthought instead of integrating operations into their development processes.
Error budgets and Site Reliability Engineering practices can improve the reliability, maintainability, and, yes, feature velocity, of products. This talk is an introduction to the basics of bringing SRE practices into your organization -- who to hire, how to organize, what projects to work on, how to measure reliability, and how to assess reliability risks.
Also presented at Code As Craft at Etsy (slides), PDX Women Talking Tech meetup, Toronto and Chicago Google Cloud Summits, and privately as training to dozens of current and prospective Google Cloud Platform customers. Co-developed with Alesia Braga.
When using tens or hundreds of microservices to provide an application's critical functionality, diagnosing what interaction between components is causing an outage can be challenging. Engineers spend a lot of time building dashboards to improve monitoring but still spend a lot of time trying to figure out what’s going on and how to fix it when they get paged. Building more dashboards isn’t the solution; using dynamic query evaluation and integrating tracing is. Learn how SREs discover and debug problems at Google during outages, and hear real stories about our experiences.
Making your team safe and inclusive doesn’t end with unconscious bias training and learning to defuse harmful interpersonal interactions. Your codebase, design documents, and technical communications are likely littered with pitfalls that prevent everyone from feeling included. Liz discusses common inclusivity anti-patterns in code and technical communication and how to avoid them.
Effective Service Level Objectives
Service level objectives and error budgets are the cornerstone of Site Reliability Engineering and a critical tool for organizations to find an appropriate balance between reliability and rates of feature development. In this talk, you will learn how to set and measure useful service level indicators and objectives for needs ranging from interactive, latency-sensitive, query-based systems to batch throughput-oriented systems. You will learn how to set high-signal-to-noise-ratio alerting based on the error budget, and how to make longer-term changes to development priorities if your budget is overspent or underspent.
Co-developed with the CRE team at Google, including Kristina Bennett, Alex Bramley, David Ferguson, and Marie Cosgrove-Davies.
Relieving Tech Debt w/ Interrupt Reduction Projects
It's easy to plan out month-long or year-long projects, or to have an interrupts rotation for dealing with oncall/tickets, but how do you make sure you're also doing the short week-long projects that can relieve your technical debt? I'll cover a planning approach that my team found that makes room for all three sets of work, reducing in the long term the operational burden of the services we operate.
Concepts co-developed with John Tobin and Dave O'Connor.
Managing Up and Sideways
Ever have a bad manager? Or have a project go off the rails but feel powerless to stop the trainwreck? I'll talk about why knowing a little bit about management can help you as an individual contributor or tech lead, and talk about a few ways that you can help yourself and your team without ever formally managing yourself.
Build skills through hobbies! Bring them to work!
Building technical and leadership skills doesn’t only happen in the workplace! I became a better technical leader and Site Reliability Engineer from playing games such as Puzzle Pirates, World of Warcraft, EVE Online, and Factorio. I will share what I learned from these experiences, and how both hiring managers and employees can talk about non-traditional forms of experience.