Driving a "long incident" as an engineer

| categories: operations, process

Archiving a Twitter thread:

Wrote a few notes for a colleague about driving a "long incident" as an engineer.

That is, one of those "important AND urgent" things that's going to take weeks and multiple cooperating teams to get done.

Focus here is on senior IC behaviour, but cf. Managing Incidents from the SRE book.


If you're running a meeting, have a clear agenda, a plan for what you want out of it. Take notes. If you need to switch audience, write something separately.

Maintain notes in Slack of what's going on in Zoom.

Summarize and update status frequently.

If you feel like you're communicating too much, it might be enough. :o)

Communicate clearly to your different audiences.

What do your team, or a sibling engineering team, or your support colleagues, or the execs need to know?

Think about their points of view, and frame things for them.

Engineer to engineer communications are vital. Keep managers and other parties informed, but focus on assembling and directing the engineering team who will solve the problem.

If necessary, fight for the people and resources you need to make progress.

Treat the problem with urgency, but push back on "we don't have enough time / we can't test X".

Take an engineering perspective: get as many facts on the table as you can.

Test assumptions about those facts.

Work constantly to reduce uncertainty.

Map out areas of risk: lean on your experts and help them identify questions we don't have answers for yet.

Push on getting answers to the tractable questions, given the time and resources available.

Keep an eye out for anyone spinning their wheels.

Stay calm and focused. Help everyone else to do the same. Always worth a re-read: Good Medics Don't Run.

If you've been somewhat insulated from ops, there's lots of literature out there to help reflect on incidents and long-running issues, and build up those production muscles.

I'd start probably with the Google SRE books, with the usual caveats about $megacorp vs. $tinycorp.

It comes down to people skills - hard skills, and the ones senior engineers most need to cultivate.

Since this is largely about communications, I must include my theme song: Write It Down.

I literally never tire of this video. Sorry / not sorry.

A little negativity about OKRs

| categories: process, thoughts

Archiving a Twitter thread:

Niall's thread is as thoughtful, judicious, and balanced as I'd expect. Allow me to be a bit more negative about Objectives and Key Results. :o)

First, the planning horizon. At $megacorp and anywhere that follows that playbook, OKRs are a quarterly process.

"Be careful what you measure": if your planning cadence is quarterly, there's a strong chance it'll become your delivery cadence too. That is too slow.

I love having a plan. Honestly, I enjoy the chaos as we watch it crash into reality, particularly in teams that are constrained by operational concerns (read: basically everyone with real customers).

Plans go out the window. That's fine, they're still useful.

But a quarter - for most teams and orgs I've worked with, including at $megacorp - is too big a chunk to let us respond well to inevitable change.

So OKRs become a big, expensive activity that swallow a bunch of team and management bandwidth without allowing room to react.

Second, "KRs considered harmful" (with caveats).

Making good KPIs is a known hard problem, right? As an engineering example, consider the difficulty of producing meaningful SLOs.

Coming up with a brand new set of good KPIs each quarter is high overhead and prone to failure.

"KR 1: produce the metrics we're going to use to track this objective." :o)

It can feel really hard to tie quantitative measurement of KRs - often proxy metrics - to the real outcomes we're trying to drive.

Finally though, it's not all gloom.

Objectives are a good tool for driving alignment. They can build a coherent narrative around strategy and tactics, from execs through to delivery teams.

I like to de-emphasize the measurement part though.

Where we do have KRs, ideally they're tied to longer-term KPIs that we can iterate on and come to understand better over time, on the order of months or years.

SLOs are a good example. Or from a team perspective, the DORA metrics. Or "north star" product / business metrics.

As for planning horizons: I am strongly attracted to ~6 week "bets" like Basecamp's or Intercom's.

Where I've worked like this, it's felt lighter-weight and more agile in the best senses.

All that said, whatever your particular team, org and/or company are doing to plan their time: get involved, and try to make it work as well as possible.

Channel your reservations into creativity. Use and improve the system. Try experiments. Share them.

Planning is good!