Liz Fong-Jones
Liz Fong-Jones
@lizthegrey
 
Google24x7_062.JPG
 

I make developers, operators, and workers as a whole more productive and empowered.

Liz is a developer advocate, activist, and site reliability engineer (SRE) with 14+ years of experience based out of Brooklyn, New York and San Francisco, California. She has worked across 8 different teams spanning the stack from Google Flights to Cloud Bigtable in her 10+ years at Google. She lives with her wife, metamour, and a Samoyed/Golden Retriever mix. In her spare time, she plays classical piano, leads an EVE Online alliance, and advocates for transgender rights as a board member of the National Center for Transgender Equality.

Connect

Public Key: 1F77 14D7 EC34 41D2 CECC  2460 6A3F 8B00 FBDD D2A4

Talks

 

Intro to SRE

Reliability is a critical feature of most software, and maintenance rather than initial development predominates the cost of software. Yet, a large number of development teams treat operations as an afterthought instead of integrating operations into their development processes.

Error budgets and Site Reliability Engineering practices can improve the reliability, maintainability, and, yes, feature velocity, of products. This talk is an introduction to the basics of bringing SRE practices into your organization -- who to hire, how to organize, what projects to work on, how to measure reliability, and how to assess reliability risks.

Video: The Lead Developer NYC 2018 (slides)

Also presented at Code As Craft at Etsy (slides), PDX Women Talking Tech meetup, Toronto and Chicago Google Cloud Summits, and privately as training to dozens of current and prospective Google Cloud Platform customers. Co-developed with Alesia Braga.

An enterprise-flavored version of this talk was co-presented with Dave Rensin at Velocity NYC 2018 (slides).


Debugging Microservices

When using tens or hundreds of microservices to provide an application's critical functionality, diagnosing what interaction between components is causing an outage can be challenging. Engineers spend a lot of time building dashboards to improve monitoring but still spend a lot of time trying to figure out what’s going on and how to fix it when they get paged. Building more dashboards isn’t the solution; using dynamic query evaluation and integrating tracing is. Learn how SREs discover and debug problems at Google during outages, and hear real stories about our experiences.

Video: Systems at Scale 2018 (slides), All Day DevOps 2018 (slides)

Also presented at QCon NYC 2018, DevOpsDays NYC 2018, Gluecon 2018, and SREcon Americas 2018). Co-developed with George Talbot and Adam Mckaig.


Reliable Inclusion

Making your team safe and inclusive doesn’t end with unconscious bias training and learning to defuse harmful interpersonal interactions. Your codebase, design documents, and technical communications are likely littered with pitfalls that prevent everyone from feeling included. Liz discusses common inclusivity anti-patterns in code and technical communication and how to avoid them.

Presented at Flawless Hacks 2018 (slides), Velocity NY 2016 and privately as training within Google


Organizing for Your Ethical Principles

Our job as engineers does not stop with eliminating technical defects and ensuring high reliability. Engineers of all kinds must ensure their work serves the public good. A service that reliably harms, exacerbates injustices, or excludes marginalized groups is not a service worth building and maintaining. Learn how to effectively accomplish change in your working conditions or your employer's products through grassroots employee advocacy.

Video: SREcon EMEA 2018 (slides) as a keynote joint with Emily Gorcenski

Also presented at Write/Speak/Code 2018 (slides), QCon NYC 2018 (video), and privately within Google.


Effective Service Level Objectives

Service level objectives and error budgets are the cornerstone of Site Reliability Engineering and a critical tool for organizations to find an appropriate balance between reliability and rates of feature development. In this talk, you will learn how to set and measure useful service level indicators and objectives for needs ranging from interactive, latency-sensitive, query-based systems to batch throughput-oriented systems. You will learn how to set high-signal-to-noise-ratio alerting based on the error budget, and how to make longer-term changes to development priorities if your budget is overspent or underspent.

Video: Datadog Dash 2018 (slides) and Google Cloud Next SF 2018 (slides)

Also presented at Code As Craft at Etsy (slides).

Co-developed with the CRE team at Google, including Kristina Bennett, Alex Bramley, David Ferguson, and Marie Cosgrove-Davies.


Relieving Tech Debt w/ Interrupt Reduction Projects

It's easy to plan out month-long or year-long projects, or to have an interrupts rotation for dealing with oncall/tickets, but how do you make sure you're also doing the short week-long projects that can relieve your technical debt? I'll cover a planning approach that my team found that makes room for all three sets of work, reducing in the long term the operational burden of the services we operate.

Presented at BoSRE in Boston, MA (slides),  SREcon Europe 2016 (video), and internal Google summits.

Concepts co-developed with John Tobin and Dave O'Connor.


Managing Up and Sideways

Ever have a bad manager? Or have a project go off the rails but feel powerless to stop the trainwreck? I'll talk about why knowing a little bit about management can help you as an individual contributor or tech lead, and talk about a few ways that you can help yourself and your team without ever formally managing yourself.

Video: Lesbians Who Tech NYC 2018 keynote (slides, a11y notes); also delivered at SREcon 2016 Europe


Build skills through hobbies! Bring them to work!

Building technical and leadership skills doesn’t only happen in the workplace! I became a better technical leader and Site Reliability Engineer from playing games such as Puzzle Pirates, World of Warcraft, EVE Online, and Factorio. I will share what I learned from these experiences, and how both hiring managers and employees can talk about non-traditional forms of experience.

Video: !!con NYC 2018 keynote (slides)

Publications & Videos

SRE and DevOps video series w/ Seth Vargo on the GCP YouTube channel

"SRE vs. DevOps: competing standards or close friends?" w/ Seth Vargo on the GCP Blog

"How SRE relates to DevOps" w/ Betsy Beyer and Niall Murphy in the Site Reliability Workbook

"Intersections between Operations and Social Activism" w/ Emily Gorcenski in Seeking SRE

“Jeff Bezos is wrong, tech workers are not bullies” w/ Laura Nolan et al. in the Financial Times

"Interrupt Reduction Projects" w/ John Tobin and Betsy Beyer in USENIX ;login:

"A Hierarchy of SRE Needs" (blog)

"The Myth of Psychological Safety" (blog)

Forthcoming in 2019: Considered Harmful: A Memoir.

Interviews & Podcasts

GCP Podcast Episode 127 with Seth Vargo, Melanie Warrick, and Mark Mandel

GCP Podcast Episode 139 with Melanie Warrick and Mark Mandel

Screaming in the Cloud Episode 19 with Corey Quinn

Fireside Chat at FutureStack NYC with Matthew Flaming

DevOps/SRE AMA with Charity Majors and Adam Jacob, hosted by Andrew Smirnov of Catchpoint

o11ycast Episode 6 with Charity Majors and Rachel Chalmers

I also frequently sit on panels about management, SRE, and ethics.

Technical Press & Citations

Citations

"Site Reliability Engineering: Philosophies, Habits, and Tools for SRE Success" (blog by New Relic)

"Accelerate: State of DevOps Report: Strategies for a New Economy" (by DevOps Research and Assessment)

Press

SRE model requires technical, organizational optimization skills

Beth Pariseau, TechTarget, October 16, 2018

"Grumpy humans are really bad at running systems," said Liz Fong-Jones, developer advocate at Google and former leader of the Google SRE team responsible for Bigtable. Fong-Jones spoke from experience about how to optimize human labor at an SRE conference here last week. "Unfair distribution of work prevents system scale," she said.

Google Cloud Next '18: What datacentre operators can learn from how Google SRE teams operate

Caroline Donnelly, Computer Weekly, July 24, 2018

Google has used the statement “class SRE implements DevOps” to title a new (and growing) video playlist by Liz Fong-Jones and Seth Vargo of Google Cloud Platform, showing how and where these disciplines connect, while nudging DevOps practitioners to consider some key SRE insights.

How Facebook operations got 10 times faster while getting 10 times bigger

Stephen Shankland, CNET, July 19, 2018

At the conference, engineers from Facebook and other tech companies, like Amazon, Shopify, Lyft, Google and Yahoo gave talks and asked questions of their peers.The profusion of management tools shows how complex it is to run suites of services on hundreds or thousands of servers. Over and over, engineers spoke of completely overhauling their technology every few years as massive growth overwhelmed the earlier system.

Increasingly sophisticated tools spotlight problems and help people trace their origins, said Google site reliability engineer Liz Fong-Jones.

Debugging Microservices: Lessons from Google, Facebook, Lyft

Joab Jackson, The New Stack, Jul 3, 2018

As your system grows more complex, and your knowledge of what can go wrong increases, you may be tempted to expand a dashboard with more metrics representing outages. This is a bad idea, advised Google Site Reliability Engineer (SRE) Liz Fong-Jones. Too many dashboards leads to cognitive overload, and as the SRE just blindly looks through a set of a set of visualized queries, looking for patterns. It’s wasted time, she warned.

Defining the role of a Site Reliability Engineer

Matt Santamaria, ITOpsTimes, March 27, 2018

“Site Reliability Engineering is a specialized job function that focuses on the reliability and maintainability of large systems,” said Liz Fong-Jones, staff Site Reliability Engineer at Google. “SREs couple operational responsibility with the competence and agency of software engineering to guide system architecture. They aim to strike the right balance between reliability and development speed by engineering solutions to operational problems.”

No Grumpy Humans and Other Site Reliability Engineering Lessons from Google

TC Currie, The New Stack, October 24, 2017

“It’s really about communication, humility and trust,” said Google engineer Liz Fong-Jones of the emerging practice of site reliability engineering, at New Relic’s FutureStack New York 2017 last month.

Press

 
GoogleHarassment-131586587.jpg

The Dirty War Over Diversity at Google

Nitasha Tiku, Wired, January 26, 2018

Outspoken diversity advocates at Google say that they are being targeted by a small group of their coworkers in an effort to silence discussions about racial and gender diversity.

In interviews with WIRED, 15 current Google employees accuse coworkers of inciting outsiders to harass rank-and-file employees who are minority advocates, including queer and transgender employees.

download.jpeg

Google Fired and Disciplined Employees for Speaking Out About Diversity

Kate Conger, Gizmodo, February 21, 2018

Google’s practice of formally reprimanding—and in at least one case, firing—employees for comments the company deemed discriminatory toward white men suggests that Google made an effort to moderate speech by its liberal employees as well as its conservative ones. These efforts have left some Google employees concerned that they will face professional consequences if they voice support for Google’s diversity and inclusion efforts and wondering if the company’s HR system is being gamed by employees who want to stamp out diversity initiatives.

DgL6w8WUYAAoYMv.jpg

Alphabet Execs Need to Do More to Improve Diversity

Emily Chang and Mark Bergen, Bloomberg Technology, June 20, 2018

Liz Fong-Jones, Google staff site reliability engineer, reacts to Google's diversity report. She speaks with Bloomberg's Emily Chang and Mark Bergen on "Bloomberg Technology."

Community

2017-present: Global Steering Committee Member, SREcon (USENIX)

SREcon Americas 2016: Program Co-Chair

SREcon Europe 2016: Program Committee Member

SREcon Americas 2017: Program Co-Chair

SREcon EMEA 2017: Program Committee Member

SREcon Americas 2018: Program Committee Member

SREcon EMEA 2018: Program Committee Member

SREcon Asia/Australia 2018: Program Committee Member

Velocity New York 2018: Program Committee Member

Google Cloud Next SF 2018: Proposal Reviewer

SREcon Americas 2019: Program Co-Chair

Grants & Investments

I engage in angel investing in social-benefit-focused, for-profit startups, and do targeted grant-making to enable non-profits to scale. My areas of competency and focus are on problems faced by transgender people (especially trans people of color), including policy work, impact litigation, poverty alleviation, violence prevention, suicide prevention, and addressing online/offline harassment.

Non-Profit Grantees

For-Profit Seed Investments

Resume

Paid Experience

Staff Developer Advocate, SRE/DevOps/Infra&Ops
Google LLC
August 2018 to present (on sabbatical November-December 2018)

Staff Site Reliability Engineer, Customer Reliability Engineering
Google LLC
July 2017 to July 2018

Site Reliability Engineering Manager, Bigtable
Google Inc
June 2015 to June 2017

Senior Site Reliability Engineer [Google Play Books, GFE, Google Flights]
Google Inc
June 2012 to May 2015

Site Reliability Engineer [HR Info Systems, Developer Infrastructure, Bigtable]
Google Inc
January 2008 to May 2012

Technical Operations Manager, Puzzle Pirates Support Tools & Anti-Cheating (contract)
Three Rings Design
March 2005 to December 2007

OS X Systems Administrator
College Preparatory Mathematics
June 2004 to August 2005

Education

SB Computer Science and Engineering (course 6-3)
Massachusetts Institute of Technology
February 2014

Volunteer Experience

Board Member
National Center for Transgender Equality
December 2017 to present

UNIX System Administrator, Undergraduate Computer Science Lab
California Institute of Technology
February 2006 to December 2007

Skills & Languages

Go

C++

Java

Python

Distributed Systems

Incident response

Patents

US8656465B1 - "Userspace permissions service"

US8694791B1 and US9015827B2- "Transitioning between access states of a computing device" (w/ Florian Rohrweck)

English

Spanish


Technical Communication

Livetweeting/liveblogging

color-3.png