A safe pair of hands

| categories: career, work, thoughts

A lot of ink has been spilled about progressing from one "level" of engineering to the next: junior to intermediate; intermediate to senior; and recently we see more about senior to staff.

There's an important common factor in all of these career steps: being seen as a safe pair of hands. This becomes central as you become more senior.

Yonatan Zunger presented a great talk at LeadDev last year that I find myself referencing a lot: Role and Influence: the IC trajectory beyond Staff.

Zunger's framing made sense of my roles over the last decade in a way that staff archetypes didn't. Rather than wondering about what archetype I fit in any particular quarter, it's much easier to think about the mix of technical, people, product, and project disciplines I'm applying.

The hidden fifth discipline is "adult supervision", and I think that's really what I'm talking about here.

A thing I love to see - and experience! - in my colleagues is where they take something on and I know it'll be done right. Not exactly like I would do it; not some kind of ideal that stands independent of our working context; but right.

The problem is solved; the crisis is handled; the relevant people are informed and involved; risks are surfaced early and when shit goes wrong - as it will! - no-one is caught out.

This is level-independent! It's perfectly possible for someone to do this in a level-appropriate way. The problems and relationships you're handling may get a lot hairier as you step up in seniority, but the basic ideas don't change much.

How trustworthy are you with your work? Do you often surprise people? Can I expect you to be accountable1 or do I need to rely on someone else for that?

If you can answer well - no matter where you are in your career - then you're building a solid foundation for your next step. You're a safe pair of hands.


1. I've heard the word "accountable" thrown around a lot in industry, often without definition. Here's mine:

Being accountable for an effort as an engineering leader has two components: ownership, and communication.

  1. Ownership: the effort is "yours", and you act that way. There may be sub-components spread across people and teams, but overall you're the one who's on the hook. Your performance is assessed against the results of the efforts you lead. Judiciously - not every effort will succeed, and that's OK.
  2. Communication: you can tell the detailed story of why we're doing it, how it relates to other efforts, how it is progressing. You actively raise blocking issues or risks and get the necessary people together to address them. Where you can't, you escalate effectively.

Writing your job description

| categories: career, work, thoughts

I joke sometimes that I rewrite my job description every 6 or 8 months. This is approximately true: it's roughly the cadence that my role and focus changes. Writing it down is about setting expectations and aligning on what my manager, peers, and other colleagues need from me.

The format is far from fixed, though. Here are a few examples from my current job in 2022.

Early on, I sketched an "engagement model" with three modes:

  • Consult:
    • Work alongside a team to help understand problems, direct energy, guide solutions.
    • Connect teams and individuals dealing with similar problems: enable a "system of theft" of good practice across teams.
  • Embed:
    • Given a specific problem, dig in with the team to help bootstrap initial work, or redirect/turn around struggling work.
    • Focus on disambiguating problems to the point that they are a 10-20% stretch for folks on the team.
    • Back off on details but continue to offer decision support.
  • Coach:
    • Work with individuals (for example, coming out of consult or embed modes), enabling them to effectively own specific problems and grow via them.
    • Focus mostly on senior engineers and team leads, with the goal of enabling them to do the same for less experienced engineers.

The framing of "engagement model" comes mainly from my work in Site Reliability Engineering, and borrows more recently from Team Topologies. I've found this resonates with other engineers too. A colleague was struggling with the transition to "more conversation and less code" in working more broadly across teams: thinking through their different scopes and priorities in terms of an engagement model proved useful.

A little later, my colleague Drew and I expanded on the above to share a longer "your staff engineers and you" doc explaining how we intended to support our group. An excerpt:

How you can use us

The staff engineer, team lead, and senior engineer roles are all "scaling" or "multiplicative": we help everyone around us to be more effective. Differences are mainly in scope, focus, and expected impact.

The amount and type of support a TL wants from a staff engineer depends on the TL's focus: some are more interested in the management path, others in the technical path. Similarly, senior engineers want different guidance depending on their experience and current projects.

When it comes to technical direction and decisions, each individual's appetite for accountability and responsibility is different. We want you to take on as much as you are able to, and support you in growing that capacity.

Things we can help with:

  • Batting ideas around;
  • Advice and direct support in navigating cross-team or cross-org issues;
  • Partnership and review on technical approaches, RFCs, roadmaps;
  • Advocacy and signal amplification for your ideas;
  • Coaching and mentoring.

Most recently, I proposed embedding with a specific team in my area. I wrote yet another "job description" for this, outlined as:

Why?

  • For me
  • For the team

How?

  • Timeline
  • Things I expect to do
  • Things I will not do

Other engagements

Success criteria

What you put into a "job description" like this depends a lot on the audience: the first was mainly for my manager and peers; the second for my whole group; the third for a specific team.

In all cases this is about transparency and alignment. "Very senior" engineering roles are frequently confusing, not just for us but for the people we work with. Articulating what we're trying to achieve and how is both personally and organizationally useful.

Note that Tanya Reilly covers this idea towards the end of chapter 1 of The Staff Engineer's Path, and offers a lot of useful guidance in figuring out what you do here.


Oncall compensation structures

| categories: operations, process, work

The subject of compensation for developers oncall comes up from time to time.

It can be difficult to find public examples of compensation structures to use.

These notes are from a quick survey of existing stuff I could find via discussions in opsy chats, the Internet, and direct questions to my network.

Asking questions

First, for those on the job hunt, a list of questions to ask about oncall, gathered from the Irish Tech Community:

  • Do you compensate being oncall (i.e. value the stress) or just when you get called (bullshit) or never (warning sign)?
  • What is the response time? Is it 5 mins (no life), 15-30 mins (some life, depending on if you have kids), or an hour (you can go to the cinema with your laptop)?
  • What percentage of your time is operations, when you’re oncall?
  • How many people are in the rotation? If < 6, is there a realistic plan in place to fix that?
    • You need at least 4 people for a reasonable shift pattern, plus one for maintenance (e.g. holidays) + one for emergency (e.g. attrition).
  • Is there one person oncall in a shift or is it a primary/secondary kind of thing?

Notes from the 'net

Second, some posts that cover oncall compensation in various detail:

Example structures

Finally, a set of example compensation structures from various companies.

A fintech company in south America:

  • If you are oncall but not working, +33% of equivalent hourly rate.
  • Paged and start working, +300% of your hourly for that period.
  • Some more extras for nights or weekends.
  • They just exported data from Pagerduty: time working was acknowledgement → resolution.
  • People would not resolve until they were finished any report or comms work that had to be done out-of-hours.
  • This apparently was just how labour laws in that country apply - works the same way for doctors.

A medium-sized SaaS company operating across US / EU:

  • Time off as standard if you actually get paged out of hours: ½ day per four hours or part thereof in responding.
  • Comp at 25% for oncall time regardless.
  • Comp → 100% for the time you’re responding.
  • Because of how their shift structure works, this all tends to amount to roughly a 10% lift in salary, plus time to recover.

A large multinational:

  • Some teams have business-hours only shifts for internal infra APIs.
  • Other teams have customer-facing services and much stricter on-call.
  • Those latter get paid per shift, get a mifi, and get time off etc.
  • ^ didn’t get exact comp structure here.

Another large multinational:

  • Three tiers of oncall, depending on pager SLA.
  • Tier 1: >= 99.9% availability SLA, 5min pager response SLA.
    • Comp paid at ⅔ for outside hours
    • That is, outside business hours accrue hours at 2h for every 3h oncall.
  • Tier 2: >= 99.9% availability SLA, > 5min but <= 15min pager response SLA.
    • Comp paid at ⅓ for outside hours.
    • That is, outside business hours accrue hours at 1h for every 3h oncall.
  • Tier 3: everything else, not comped.
  • Mon-Fri comp paid outside 9-6 core hours. Sat & Sun all comped.
  • So if you were oncall 6am-6pm Mon-Sun that’d be like
  • 3 x 5h for Mon-Fri
  • 2 x 12h for Sat-Sun
  • So 39h compensatable, converting into pay as 13h at tier 2 or 26h tier 1.
  • You could take this as either time in lieu (at 8h/day) or cash (pro-rated to salary).

A medium-sized SaaS multinational:

  • Shifts are either weekday or weekend.
  • Pay according to 60h week (hourly equiv. from salary) if weekday shift.
  • According to 40h week + 24h if weekend shift.
  • Payout doubles if schedule includes public/bank holidays.
  • Contact there mentioned this was very similar to structure in last job, another similar-sized SaaS.

Intercom's oncall implementation:

  • Former Ruby monolith sharded out over the last few years into services. Heavy on AWS and running less software.
  • An unusual structure, but interesting: specifically because they have modified their approach to avoid having “too many people/teams oncall”.
  • Virtual team, volunteers from any team in the org.
  • 6-month rotations in that virtual team, having taken a handful of shifts.
  • Oncall went from being spread across more than 30 engineers to just 6 or 7.
  • “We put in place a level of compensation that we were happy with for taking a week’s worth of on call shifts.”
    • Not sure of precise structure, presumably a bonus per week oncall.

Criteo, medium-sized Adtech HQ’d in France. This is from a 3y old Reddit thread:

  • SREs are oncall. Pager response time is 30 minutes. (!)
  • They are paid for oncall for nights/weekends etc. Exact comp unspecified.
  • If you are paged, you get comped time as well in exchange (½ day at least).
  • Internet & phone bill reimbursed for oncall engineers.
  • If you work during the night, you have to stay home until you get 11h consecutive rest (French law).

A scientific approach to debugging

| categories: debugging, work

Recently, a friend got in touch to ask for some help:

When it comes to debugging an issue, I'm able to set a breakpoint and debug into the test to see the error - like a stack trace or whatever - but when it comes to fashioning a fix, the actual issue is often something else. Like the stack trace is a symptom, not a cause. How do I train my brain to figure out where the actual root causes of things are?

I was curious what my team at $currentplace thought, so I forwarded the question to them over Slack. Colleagues from across our engineering teams dropped by to help, and we came up with some notes to pass back to my friend.

However, someone ratted me out to the corporate blog crew. :o) So this post was born:

A scientific approach to debugging

It includes my favourite debugging story, about Maurice Wilkes, which I first came across via Russ Cox.


Python in private repos

| categories: python, tldr, work

At $currentplace we work mostly in Python. We do everything we can to automate away busywork, so we like CI and the family of related tools and ideas. We've put quite a bit of work into a smooth build/test/deploy cycle, and along the way I've spent more time than I care to think of messing about with Python packaging.

Last month, a friend asked how we manage build dependencies, and my long answer eventually turned into a company blog post:

Developing and deploying Python in private repos

At Hosted Graphite, most of our deployed services are written in Python, and run across a large installation of Ubuntu Linux hosts.

Unfortunately, the Python packaging and deployment ecosystem is something of a tire fire, particularly if your code is in private Git repositories. There are quite a few ways to do it, and not many of them work well.

This post tells the story of what we have tried, where we are now, and what we recommend to programmers in a similar situation.