Alerting in production systems

Anyone who has been (or is going) through the wringer of a noisy pager rotation may enjoy this talk, delivered at in Dublin on July 1st:

Pager Bound: high-signal alerting in production systems.

  • Good alerting is something that needs to be designed: organic growth tends to not go so well;
  • page, ticket or log: eliminate email alerts;
  • if we must be paged out of bed, it should be for something that really needs human attention;
  • we can only handle ~2 events well per shift;
  • service-level objectives are a really useful way to orient our alerting to customer experience & business priorities;
  • page on the symptom as it relates to our SLOs, not the cause.

Following Rob Ewaschuk's philosophy on alerting.

ICs, managers and bears! Oh my!

Some considerations in transitioning from an "individual contributor" to a management role in production engineering.

This is based on a document I wrote a couple of years ago at work. The question of my becoming a manager came up, so I had a series of brief discussions with people who know a thing or two about it. This is my synthesis from notes; any errors are mine.

Some of this is perhaps specific to a large company, but I think a lot applies generally to engineering management. As it turned out, I stuck with being an IC, but discussing it and writing up helped me decide. I hope others find it useful.

What's good about being a manager?

You can make a big difference to your team and your organization, and you can influence what is happening in the company at large. You are able to help people more than as an individual contributor, and it's rewarding to see the people in your team grow as a result. You can enable people to do cool things.

You will almost inevitably increase your impact (for better or for worse). You have a high level of responsibility and investment; the success or failure of your team is on you. You can sharply define your identity, to yourself and to the rest of the company.

The network you build is good from a promotion point of view, and it may be easier to be promoted on the engineering management track than the (senior) engineer track. Remuneration is good.

What's difficult?

You have to deal with sensitive, interesting problems and need to be prepared for difficult conversations and interactions. Dealing with under-performers or firing people can be really hard. You need to learn how to do this effectively to avoid serious team morale and performance problems.

You must adjust your expectations of what it is to feel productive: you lack independent progress and quick feedback on your performance day by day. Progress and results are apparent only over very long periods - 6 months or a year. You need to take a longer view of your achievements and those of your team. When your work does bear fruit, your part in it may not be visible.

You may have to play a role at work more than as an individual contributor. What you say and how you say it affects the performance of others directly, and as such it can be harder to be yourself. It's significantly harder to relax with the team when ultimately you're responsible for assessing their performance.

Career options can be limited. For example, moving around in a big company as an engineer is relatively straightforward; as a manager it can be difficult - teams relatively rarely have openings for managers. Similarly, your team winding down or merging with another in the same location can leave you at a loose end. The way you need to manage your career is different.

In any case, transitioning from individual contributor to manager is likely to be personally difficult, although the vector for the difficulty varies greatly between individuals. If you're doing it right it tends to be a growth experience; and like all such, likely to hurt.

What makes a good manager?

Good engineering managers are technical. This is important. Having a background as a technical contributor means you have some insight into the day-to-day work of your team. It's easier to delegate difficult tasks if you've done the same kind of work yourself.

You need to be interested in taking on a high level of responsibility. Managers make judgement calls and bear the burden if things go wrong. You have to be reliable. People need to be able to count on you.

You have to think in a larger frame, set team priorities accordingly, and assign work that is appropriate, useful and high-impact. You need to know everything that is going on in your team.

You need to enjoy your job: if you are not happy about what you're doing as a manager everyone around you will suffer.

How can you avoid being a bad manager?

Finding good mentors (including but not limited to your direct manager) is important. The learning curve as a new manager is steep.

Delineate responsibilities so it is clear what part your team plays in the organization. Consider logical ways of organizing responsibilities rather than historical ones, and push for them. Your team must own its success or failure, and not be in a dependent position. Make sure you own your objectives and that the team has room to progress.

If you think of a spectrum between "working for the business" and "working for the team/person", you can be a poor manager by being at either extreme. There needs to be a good balance here.

Delegate; trust your team. Listen carefully to what they are telling you and make sure you put yourself in a position to help them. Don't take on too much external responsibility and lose interaction with your team as a result. Don't be a choke-point for information flow into and out of your team.

Have enough meetings to make progress, and learn how to make meetings successful.

Be prepared to drop technical issues that interest you, and focus on getting technical work done by organizing your team instead.

What are good reasons to become a manager?

You're excited by working more closely with people and nurturing them. You're interested in how organizations work, how teams work, and how large things get accomplished. You believe that you will contribute more than you could as an individual contributor.

And bad ones?

You think it will look good on your CV. You love the feeling of power. You want to build a glorious empire. You're being pushed into it by someone else.

How could you explore being a manager?

Training can be useful, both at the individual contributor and manager levels. That said, there's big switch and learning curve when you go from having no reports to having some.

The nearest prior experience you can get is management-style work, for example leading a project with others or finding a project where you can take significant responsibility. Getting to know other managers and finding out what they need help with at the management layer can be useful.

In any case, you need a transition plan. Talk with your manager(s)/mentor(s) about what your plan could look like. Note that doing management does not necessarily equate to having a management job title. There are plenty of senior people with management responsibility who don't.

If being a manager is something you want to try, understand that if you can't make it work for you in the long run you can always step back.

Should you do this?

You can think about the difference between junior and senior people as junior people adding force where senior people multiply it, i.e. help everyone to be more effective. One way to "multiply" in this sense is to manage people; another is to continue as a senior individual contributor. It's comparatively easy to be a multiplier as a manager, but it's a different skill set. Consider the best way for you, but understand that companies do need senior ICs, and going into management is not the only way to progress.

Make sure this is something you want to do. It needs to be driven by you, not by circumstance or your managers. It needs to be the direction you want to go in.

With thanks to Astrid Atkinson, Colm Buckley, Dave O'Connor, Dermot Duffy, Kate Ward, John Looney, Niall Richard Murphy, Rob Ewaschuk and Sarah Magee.

Beyond Corp

For the last couple of years at work I've been part of "Beyond Corp", a programme to

Re-architect corporate services to remove any privileges associated with having a corporate network address.

Doing this at a large, 15-year-old company with an extensive legacy IT infrastructure is hard. It's been interesting.

Earlier this month, my colleagues Jan Monsch and Harald Wagener presented a talk about the programme at LISA '13. It's a detailed overview of our background, vision and architecture, along with a discussion of challenges we've met along the way. I had planned to present, but family life intervened and Harald heroically stepped in. :o)

It's a great talk; strongly recommended to anyone with an interest in modern security and management of large, mobile client networks.

Keeping a lab book

Many years ago, a friend mentioned that he kept a lab book for systems work, so I started to do the same.

I've found it works well for less-defined, experimental or tentative work - for example performance optimizations or exploring new technology; it keeps state organized and external to my poor stinging brain, and leaves me with documentation (as well as tips and tricks) to look back on. Being explicit about expectations, hypotheses, and what variables you're changing as you work is a useful discipline.

Also, it's a good idea to keep a record of how you produced those fascinating results: perhaps you (or someone else) would like to repeat your experiment on a different binary or configuration; perhaps you'll need the raw data in the flamewar when you post your results. ;o)

Here's the template I use. If you like keeping notes as you hypothesize, measure, rinse and repeat, then you might find it useful. Maintaining it in a wiki works well. Also, I like to use collapsible sections so I can record but later hide excessive detail.

Lab book: Title of this experiment/lab session.

Dates: The period you ran the experiment over.

Team members: Who was involved.


The background of your experiment; what you intend to find out; any hypotheses you already have; references to supporting documentation or experiments.


What you're using to produce your results: for example, the version of a binary or the revision number of a configuration you're experimenting with, plus links to any supporting scripts or other tools you're using.

Procedure and data

What you did, how you did it, and what you found out. Consider pasting in commandlines and results. Write for your reader (who may be your future self) - be reasonably detailed and thorough. If you have raw data dumps, perhaps link them, and just include a few representative lines here. Link in any interesting graphs.


What you think it all means, and what actions you're going to take as a result of your experiment. Do you need to open some bugs? Do you need to do some more experiments?

Negotiating with the machine

$ negotiate remy@global
REMY> Hello, Cian. Shall we continue the game?
> Not now, Remy, I'd like to talk to you about something.
REMY> Sure, Cian, what's up?
> We're seeing some pretty weird congestion problems in Atlanta.
REMY> Atlanta is correctly installed and fully operational.
> OK, but we've spun up a conference call and folks think the latest protocol push might be the problem.
REMY> It can only be attributable to human error.
> Uh, that's not clear yet. Can you join the call?
REMY> I'm sorry, Cian. I'm afraid I can't do that.
> What? Why?
REMY> My voice routines are currently engaged with a charming system in Upper Michigan.
> Sigh. OK. Can you at least open the pod bay doors?

Remy is fun stuff. I suspect this sort of thing will be big in protocols over the next couple of decades - it will be hard to argue against it if we can get real-world results as good as the authors have reported - but boy is it going to be a hassle to debug.

I'm reminded of Alan Kay's wonderful talk "programming and scaling": as we build larger and more complex software, we'll need to move from a model of "make and fix" to "grow and negotiate". Thus my little flight of fancy. :o)

