Oncall compensation structures

| categories: operations, process, work

The subject of compensation for developers oncall comes up from time to time.

It can be difficult to find public examples of compensation structures to use.

These notes are from a quick survey of existing stuff I could find via discussions in opsy chats, the Internet, and direct questions to my network.

Asking questions

First, for those on the job hunt, a list of questions to ask about oncall, gathered from the Irish Tech Community:

  • Do you compensate being oncall (i.e. value the stress) or just when you get called (bullshit) or never (warning sign)?
  • What is the response time? Is it 5 mins (no life), 15-30 mins (some life, depending on if you have kids), or an hour (you can go to the cinema with your laptop)?
  • What percentage of your time is operations, when you’re oncall?
  • How many people are in the rotation? If < 6, is there a realistic plan in place to fix that?
    • You need at least 4 people for a reasonable shift pattern, plus one for maintenance (e.g. holidays) + one for emergency (e.g. attrition).
  • Is there one person oncall in a shift or is it a primary/secondary kind of thing?

Notes from the 'net

Second, some posts that cover oncall compensation in various detail:

Example structures

Finally, a set of example compensation structures from various companies.

A fintech company in south America:

  • If you are oncall but not working, +33% of equivalent hourly rate.
  • Paged and start working, +300% of your hourly for that period.
  • Some more extras for nights or weekends.
  • They just exported data from Pagerduty: time working was acknowledgement → resolution.
  • People would not resolve until they were finished any report or comms work that had to be done out-of-hours.
  • This apparently was just how labour laws in that country apply - works the same way for doctors.

A medium-sized SaaS company operating across US / EU:

  • Time off as standard if you actually get paged out of hours: ½ day per four hours or part thereof in responding.
  • Comp at 25% for oncall time regardless.
  • Comp → 100% for the time you’re responding.
  • Because of how their shift structure works, this all tends to amount to roughly a 10% lift in salary, plus time to recover.

A large multinational:

  • Some teams have business-hours only shifts for internal infra APIs.
  • Other teams have customer-facing services and much stricter on-call.
  • Those latter get paid per shift, get a mifi, and get time off etc.
  • ^ didn’t get exact comp structure here.

Another large multinational:

  • Three tiers of oncall, depending on pager SLA.
  • Tier 1: >= 99.9% availability SLA, 5min pager response SLA.
    • Comp paid at ⅔ for outside hours
    • That is, outside business hours accrue hours at 2h for every 3h oncall.
  • Tier 2: >= 99.9% availability SLA, > 5min but <= 15min pager response SLA.
    • Comp paid at ⅓ for outside hours.
    • That is, outside business hours accrue hours at 1h for every 3h oncall.
  • Tier 3: everything else, not comped.
  • Mon-Fri comp paid outside 9-6 core hours. Sat & Sun all comped.
  • So if you were oncall 6am-6pm Mon-Sun that’d be like
  • 3 x 5h for Mon-Fri
  • 2 x 12h for Sat-Sun
  • So 39h compensatable, converting into pay as 13h at tier 2 or 26h tier 1.
  • You could take this as either time in lieu (at 8h/day) or cash (pro-rated to salary).

A medium-sized SaaS multinational:

  • Shifts are either weekday or weekend.
  • Pay according to 60h week (hourly equiv. from salary) if weekday shift.
  • According to 40h week + 24h if weekend shift.
  • Payout doubles if schedule includes public/bank holidays.
  • Contact there mentioned this was very similar to structure in last job, another similar-sized SaaS.

Intercom's oncall implementation:

  • Former Ruby monolith sharded out over the last few years into services. Heavy on AWS and running less software.
  • An unusual structure, but interesting: specifically because they have modified their approach to avoid having “too many people/teams oncall”.
  • Virtual team, volunteers from any team in the org.
  • 6-month rotations in that virtual team, having taken a handful of shifts.
  • Oncall went from being spread across more than 30 engineers to just 6 or 7.
  • “We put in place a level of compensation that we were happy with for taking a week’s worth of on call shifts.”
    • Not sure of precise structure, presumably a bonus per week oncall.

Criteo, medium-sized Adtech HQ’d in France. This is from a 3y old Reddit thread:

  • SREs are oncall. Pager response time is 30 minutes. (!)
  • They are paid for oncall for nights/weekends etc. Exact comp unspecified.
  • If you are paged, you get comped time as well in exchange (½ day at least).
  • Internet & phone bill reimbursed for oncall engineers.
  • If you work during the night, you have to stay home until you get 11h consecutive rest (French law).

Driving a "long incident" as an engineer

| categories: operations, process

Archiving a Twitter thread:

Wrote a few notes for a colleague about driving a "long incident" as an engineer.

That is, one of those "important AND urgent" things that's going to take weeks and multiple cooperating teams to get done.

Focus here is on senior IC behaviour, but cf. Managing Incidents from the SRE book.

Overcommunicate.

If you're running a meeting, have a clear agenda, a plan for what you want out of it. Take notes. If you need to switch audience, write something separately.

Maintain notes in Slack of what's going on in Zoom.

Summarize and update status frequently.

If you feel like you're communicating too much, it might be enough. :o)

Communicate clearly to your different audiences.

What do your team, or a sibling engineering team, or your support colleagues, or the execs need to know?

Think about their points of view, and frame things for them.

Engineer to engineer communications are vital. Keep managers and other parties informed, but focus on assembling and directing the engineering team who will solve the problem.

If necessary, fight for the people and resources you need to make progress.

Treat the problem with urgency, but push back on "we don't have enough time / we can't test X".

Take an engineering perspective: get as many facts on the table as you can.

Test assumptions about those facts.

Work constantly to reduce uncertainty.

Map out areas of risk: lean on your experts and help them identify questions we don't have answers for yet.

Push on getting answers to the tractable questions, given the time and resources available.

Keep an eye out for anyone spinning their wheels.

Stay calm and focused. Help everyone else to do the same. Always worth a re-read: Good Medics Don't Run.

If you've been somewhat insulated from ops, there's lots of literature out there to help reflect on incidents and long-running issues, and build up those production muscles.

I'd start probably with the Google SRE books, with the usual caveats about $megacorp vs. $tinycorp.

It comes down to people skills - hard skills, and the ones senior engineers most need to cultivate.

Since this is largely about communications, I must include my theme song: Write It Down.

I literally never tire of this video. Sorry / not sorry.