I make developers, operators, and workers as a whole more productive and empowered.
Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 15+ years of experience. She is an advocate at Honeycomb.io for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.
She lives in Brooklyn with her wife, metamours, and a Samoyed/Golden Retriever mix, and in San Francisco and Seattle with her other partners. She plays classical piano, leads an EVE Online alliance, and advocates for transgender rights as a board member of the National Center for Transgender Equality.
Intro to SRE
Reliability is a critical feature of most software, and maintenance rather than initial development predominates the cost of software. Yet, a large number of development teams treat operations as an afterthought instead of integrating operations into their development processes.
Error budgets and Site Reliability Engineering practices can improve the reliability, maintainability, and, yes, feature velocity, of products. This talk is an introduction to the basics of bringing SRE practices into your organization -- who to hire, how to organize, what projects to work on, how to measure reliability, and how to assess reliability risks.
Also presented at Code As Craft at Etsy (slides), PDX Women Talking Tech meetup, Toronto and Chicago Google Cloud Summits, and privately as training to dozens of current and prospective Google Cloud Platform customers. Co-developed with Alesia Braga.
When using tens or hundreds of microservices to provide an application's critical functionality, diagnosing what interaction between components is causing an outage can be challenging. Engineers spend a lot of time building dashboards to improve monitoring but still spend a lot of time trying to figure out what’s going on and how to fix it when they get paged. Building more dashboards isn’t the solution; using dynamic query evaluation and integrating tracing is. Learn how SREs discover and debug problems at Google during outages, and hear real stories about our experiences.
Making your team safe and inclusive doesn’t end with unconscious bias training and learning to defuse harmful interpersonal interactions. Your codebase, design documents, and technical communications are likely littered with pitfalls that prevent everyone from feeling included. Liz discusses common inclusivity anti-patterns in code and technical communication and how to avoid them.
Organizing for Your Ethical Principles
Our job as engineers does not stop with eliminating technical defects and ensuring high reliability. Engineers of all kinds must ensure their work serves the public good. A service that reliably harms, exacerbates injustices, or excludes marginalized groups is not a service worth building and maintaining. Learn how to effectively accomplish change in your working conditions or your employer's products through grassroots employee advocacy.
Effective Service Level Objectives
Service level objectives and error budgets are the cornerstone of Site Reliability Engineering and a critical tool for organizations to find an appropriate balance between reliability and rates of feature development. In this talk, you will learn how to set and measure useful service level indicators and objectives for needs ranging from interactive, latency-sensitive, query-based systems to batch throughput-oriented systems. You will learn how to set high-signal-to-noise-ratio alerting based on the error budget, and how to make longer-term changes to development priorities if your budget is overspent or underspent.
Co-developed with the CRE team at Google, including Kristina Bennett, Alex Bramley, David Ferguson, and Marie Cosgrove-Davies.
Relieving Tech Debt w/ Interrupt Reduction Projects
It's easy to plan out month-long or year-long projects, or to have an interrupts rotation for dealing with oncall/tickets, but how do you make sure you're also doing the short week-long projects that can relieve your technical debt? I'll cover a planning approach that my team found that makes room for all three sets of work, reducing in the long term the operational burden of the services we operate.
Concepts co-developed with John Tobin and Dave O'Connor.
Managing Up and Sideways
Ever have a bad manager? Or have a project go off the rails but feel powerless to stop the trainwreck? I'll talk about why knowing a little bit about management can help you as an individual contributor or tech lead, and talk about a few ways that you can help yourself and your team without ever formally managing yourself.
Build skills through hobbies! Bring them to work!
Building technical and leadership skills doesn’t only happen in the workplace! I became a better technical leader and Site Reliability Engineer from playing games such as Puzzle Pirates, World of Warcraft, EVE Online, and Factorio. I will share what I learned from these experiences, and how both hiring managers and employees can talk about non-traditional forms of experience.
Publications & Videos
"SRE vs. DevOps: competing standards or close friends?" w/ Seth Vargo on the GCP Blog
"Intersections between Operations and Social Activism" w/ Emily Gorcenski in Seeking SRE
“Jeff Bezos is wrong, tech workers are not bullies” w/ Laura Nolan et al. in the Financial Times
“Our Executives Engaged in Abuse. Don’t Let Kink and Polyamory Be Their Scapegoats.” in Medium Featured Stories
“Google Workers Lost a Leader, But the Fight Will Continue” in Medium Featured Stories
"Interrupt Reduction Projects" w/ John Tobin and Betsy Beyer in USENIX ;login:
"A Hierarchy of SRE Needs" (blog)
Forthcoming in 2019: Considered Harmful: A Memoir.
Interviews & Podcasts
GCP Podcast Episode 127 with Seth Vargo, Melanie Warrick, and Mark Mandel
GCP Podcast Episode 139 with Melanie Warrick and Mark Mandel
Screaming in the Cloud Episode 19 with Corey Quinn
Fireside Chat at FutureStack NYC with Matthew Flaming
DevOps/SRE AMA with Charity Majors and Adam Jacob, hosted by Andrew Smirnov of Catchpoint
o11ycast Episode 6 with Charity Majors and Rachel Chalmers
I also frequently sit on panels about management, SRE, and ethics.
Technical Press & Citations
"Site Reliability Engineering: Philosophies, Habits, and Tools for SRE Success" (blog by New Relic)
"Accelerate: State of DevOps Report: Strategies for a New Economy" (by DevOps Research and Assessment)
Beth Pariseau, TechTarget, October 16, 2018
"Grumpy humans are really bad at running systems," said Liz Fong-Jones, developer advocate at Google and former leader of the Google SRE team responsible for Bigtable. Fong-Jones spoke from experience about how to optimize human labor at an SRE conference here last week. "Unfair distribution of work prevents system scale," she said.
Caroline Donnelly, Computer Weekly, July 24, 2018
Google has used the statement “class SRE implements DevOps” to title a new (and growing) video playlist by Liz Fong-Jones and Seth Vargo of Google Cloud Platform, showing how and where these disciplines connect, while nudging DevOps practitioners to consider some key SRE insights.
Stephen Shankland, CNET, July 19, 2018
At the conference, engineers from Facebook and other tech companies, like Amazon, Shopify, Lyft, Google and Yahoo gave talks and asked questions of their peers.The profusion of management tools shows how complex it is to run suites of services on hundreds or thousands of servers. Over and over, engineers spoke of completely overhauling their technology every few years as massive growth overwhelmed the earlier system.
Increasingly sophisticated tools spotlight problems and help people trace their origins, said Google site reliability engineer Liz Fong-Jones.
Joab Jackson, The New Stack, Jul 3, 2018
As your system grows more complex, and your knowledge of what can go wrong increases, you may be tempted to expand a dashboard with more metrics representing outages. This is a bad idea, advised Google Site Reliability Engineer (SRE) Liz Fong-Jones. Too many dashboards leads to cognitive overload, and as the SRE just blindly looks through a set of a set of visualized queries, looking for patterns. It’s wasted time, she warned.
Matt Santamaria, ITOpsTimes, March 27, 2018
“Site Reliability Engineering is a specialized job function that focuses on the reliability and maintainability of large systems,” said Liz Fong-Jones, staff Site Reliability Engineer at Google. “SREs couple operational responsibility with the competence and agency of software engineering to guide system architecture. They aim to strike the right balance between reliability and development speed by engineering solutions to operational problems.”
TC Currie, The New Stack, October 24, 2017
“It’s really about communication, humility and trust,” said Google engineer Liz Fong-Jones of the emerging practice of site reliability engineering, at New Relic’s FutureStack New York 2017 last month.
Nitasha Tiku, Wired, January 26, 2018
Outspoken diversity advocates at Google say that they are being targeted by a small group of their coworkers in an effort to silence discussions about racial and gender diversity.
In interviews with WIRED, 15 current Google employees accuse coworkers of inciting outsiders to harass rank-and-file employees who are minority advocates, including queer and transgender employees.
Kate Conger, Gizmodo, February 21, 2018
Google’s practice of formally reprimanding—and in at least one case, firing—employees for comments the company deemed discriminatory toward white men suggests that Google made an effort to moderate speech by its liberal employees as well as its conservative ones. These efforts have left some Google employees concerned that they will face professional consequences if they voice support for Google’s diversity and inclusion efforts and wondering if the company’s HR system is being gamed by employees who want to stamp out diversity initiatives.
Emily Chang and Mark Bergen, Bloomberg Technology, June 20, 2018
Liz Fong-Jones, Google staff site reliability engineer, reacts to Google's diversity report. She speaks with Bloomberg's Emily Chang and Mark Bergen on "Bloomberg Technology."
2017-present: Global Steering Committee Member, SREcon (USENIX)
SREcon Americas 2016: Program Co-Chair
SREcon Europe 2016: Program Committee Member
SREcon Americas 2017: Program Co-Chair
SREcon EMEA 2017: Program Committee Member
SREcon Americas 2018: Program Committee Member
SREcon EMEA 2018: Program Committee Member
SREcon Asia/Australia 2018: Program Committee Member
Velocity New York 2018: Program Committee Member
Google Cloud Next SF 2018: Proposal Reviewer
SREcon Americas 2019: Program Co-Chair
Grants & Investments
I engage in angel investing in social-benefit-focused, for-profit startups, and do targeted grant-making to enable non-profits to scale. My areas of competency and focus are on problems faced by transgender people (especially trans people of color), including policy work, impact litigation, poverty alleviation, violence prevention, suicide prevention, and addressing online/offline harassment.
National Center for Transgender Equality (also a board member)
For-Profit Seed Investments
February 2019 to present
Staff Developer Advocate, SRE/DevOps/Infra&Ops
August 2018 to January 2019 (on sabbatical November-December 2018)
Staff Site Reliability Engineer, Customer Reliability Engineering
July 2017 to July 2018
Site Reliability Engineering Manager, Bigtable
June 2015 to June 2017
Senior Site Reliability Engineer [Google Play Books, GFE, Google Flights]
June 2012 to May 2015
Site Reliability Engineer [HR Info Systems, Developer Infrastructure, Bigtable]
January 2008 to May 2012
Technical Operations Manager, Puzzle Pirates Support Tools & Anti-Cheating (contract)
Three Rings Design
March 2005 to December 2007
OS X Systems Administrator
College Preparatory Mathematics
June 2004 to August 2005
SB Computer Science and Engineering (course 6-3)
Massachusetts Institute of Technology
National Center for Transgender Equality
December 2017 to present
UNIX System Administrator, Undergraduate Computer Science Lab
California Institute of Technology
February 2006 to December 2007
Skills & Languages
US8656465B1 - "Userspace permissions service"
US8694791B1 and US9015827B2- "Transitioning between access states of a computing device" (w/ Florian Rohrweck)