Driving a "long incident" as an engineer

| categories: operations, process

Archiving a Twitter thread:

Wrote a few notes for a colleague about driving a "long incident" as an engineer.

That is, one of those "important AND urgent" things that's going to take weeks and multiple cooperating teams to get done.

Focus here is on senior IC behaviour, but cf. Managing Incidents from the SRE book.

Overcommunicate.

If you're running a meeting, have a clear agenda, a plan for what you want out of it. Take notes. If you need to switch audience, write something separately.

Maintain notes in Slack of what's going on in Zoom.

Summarize and update status frequently.

If you feel like you're communicating too much, it might be enough. :o)

Communicate clearly to your different audiences.

What do your team, or a sibling engineering team, or your support colleagues, or the execs need to know?

Think about their points of view, and frame things for them.

Engineer to engineer communications are vital. Keep managers and other parties informed, but focus on assembling and directing the engineering team who will solve the problem.

If necessary, fight for the people and resources you need to make progress.

Treat the problem with urgency, but push back on "we don't have enough time / we can't test X".

Take an engineering perspective: get as many facts on the table as you can.

Test assumptions about those facts.

Work constantly to reduce uncertainty.

Map out areas of risk: lean on your experts and help them identify questions we don't have answers for yet.

Push on getting answers to the tractable questions, given the time and resources available.

Keep an eye out for anyone spinning their wheels.

Stay calm and focused. Help everyone else to do the same. Always worth a re-read: Good Medics Don't Run.

If you've been somewhat insulated from ops, there's lots of literature out there to help reflect on incidents and long-running issues, and build up those production muscles.

I'd start probably with the Google SRE books, with the usual caveats about $megacorp vs. $tinycorp.

It comes down to people skills - hard skills, and the ones senior engineers most need to cultivate.

Since this is largely about communications, I must include my theme song: Write It Down.

I literally never tire of this video. Sorry / not sorry.