Who has the best serverless platform?

Back in August 2017 I wrote a post about: Who has the serverless advantage? AWS Lambda vs GCP Cloud Functions vs Azure Functions. We were three years into commercial serverless offerings and there were still a lot of limitations. My main observation was:

What really matters is the availability and consumption of other services within the cloud provider ecosystem.

Being able to execute functions in response to events is only as useful as what you can actually do within the execution pipeline. This is where the services differ — their ability to to pass data to backend services, perform calculations, transform data, store results, and quickly retrieve data.

AWS benefits from being the leader in cloud by the sheer size of its product portfolio. The core services of compute, storage and networking are commodities — the differentiation is what’s built on top of them.

At the time, Azure Functions and AWS Lambda were similar in functionality and significantly ahead of Google Cloud Functions, which was still in Beta. So how have things progressed for each provider since then, and what does this say about their approach to serverless?

AWS Lambda

It is an interesting coincidence that both Alexa and Lambda were released in Nov 2014, which suggests that the requirements of one product may have influenced the other. Alexa, a deployed on the Echo devices and in other hardware, in an inherently event-driven product. It is only doing things when you trigger it. It fits the Lambda execution model exactly.

If we assume AWS created Lambda to service Alexa as the first customer, this is important because it means the product development is driven by internal stakeholders who provide the initial demand and use case validation well in advance of input from the general market, rather than copying competitors.

Given how well formed Lambda was on release, I think this is a reasonable assumption. The continued development of the product since 2014 means it has a significant headstart, and the internal use case meant there was always an incentive to invest even in advance of adoption by external customers.

This means that Lambda essentially had no competition for 2 years until Azure Functions and Google Cloud Functions came out in 2016, and Google wasn’t really in the game until March 2017 due to the gap between the Feb 2016 Alpha and March 2017 Beta.

Today, Lambda supports runtimes in Java, Go, PowerShell, Node.js, C#, Python, and Ruby and has added complex functionality such as Step Functions and integration into other AWS products. Many other AWS products generate events which can be processed in Lambda which is proving how you can tie together the full ecosystem to deal with logging, metrics and security in particular. The original Alexa use case is still valid, but many of these improvements are clearly driven by external customers.

Lambda@Edge is also interesting because it allows you to execute your functions in CloudFront CDN locations, allowing for low latency applications close to the user. However, it does have some major limitations (such as 1MB response limits and a maximum of 25 deployed functions) so is clearly still an immature product. But what we know about AWS suggests this will progress rapidly. As nicely visualised by CloudZero, serverless is truly a spectrum:

Serverless spectrum

Azure Functions

Azure offers very similar functionality when compared with Lambda, with more language support in experimental runtimes e.g. Bash and PHP.

A big difference is the availability of on-premise functions, where you can run the workers wherever you wish (so long as you have a Windows server). Until the recent release of AWS Outposts, this was a major difference in product philosophy between AWS and Azure. The former was focused on pure cloud, with functionality to help you move there whereas the latter adopted a hybrid approach from the beginning. This makes sense for Azure given Microsoft’s history of on-premise software. Now we see AWS expanding to try and include the relatively small but still significant share of workloads which will always run on-premise.

A famous use case for Azure Functions is Troy Hunt’s Have I Been Pwned (HIBP) project, entirely delivered using Cloudflare (who have their own serverless product, covered below) and Azure.

But it’s not just functions being used – as I mentioned in the original article, the rest of the Azure product portfolio is just as important. HIBP makes extensive use of Azure Table Storage, which you could call another “serverless” product because it is a database as a service – no need to deal with running SQL Server. This is where AWS has often had a lead in the past because of the range of products they have. That’s less relevant today because both Azure and Google have products in all the key areas: storage, compute and databases.

Google Cloud Functions

GCF has only been Generally Available since July 2018 which makes it the youngest platform of the major three. However, it has actually been available in public since 2016. This is something I’ve called App Engine Syndrome in the past:

Cloud Functions seems to be suffering from App Engine syndrome — big announcements of Alpha/Beta features, followed by silence/minimal progress until the next big announcement the following year. The focus of Google’s serverless ambitions seems to be Firebase, not Cloud Functions.

The release notes show regular updates and new functionality during the beta but there was nothing between July 2018 and November 2018.

This is somewhat frustrating because at Server Density we made extensive use of Google Cloud Platform and it is my preferred vendor of the big three – they have the best console UX, documentation and APIs, and I’ve found all their GA products to be well designed and robust.

Indeed, GCF is a good product that is easy to work with through the command line and console. It has good integration into the rest of GCP whether as direct triggers, through pub/sub, storage, monitoring and most recently scheduled functions. The lack of development on the core product is somewhat misleading because of how it continues to be built into the rest of the platform, but it is concerning that the development velocity of what is a growing technology in the industry is seemingly being neglected. At least publicly, and in comparison to the rapid progress of AWS Lambda.

My criticism of Google Cloud has long been that not enough of Google’s own products use it. Of course, they use the underlying technologies and the physical infrastructure, but unlike Amazon.com using AWS for their critical retail operations, it is unclear how much of Google is using the same platform that GCP customers are using. If Google had a use case similar to AWS and Alexa, that might provide incentives to increase the velocity of development in addition to any GCP customers. Maybe they do. But we still see Google falling further and further behind.

Other platforms

AWS, Azure and Google have the position of being the main three cloud providers. Indeed, they are the only ones that matter for general use cases. Their advantage is the vast resources each company can invest in infrastructure and product development. However, that doesn’t mean they are the only players in serverless. There are other, specialist providers who have particular use cases in mind.

Cloudflare Workers

I mentioned Cloudflare in the context of Have I Been Pwned above. Whilst all of the main processing for HIBP happens on Azure, over 99% of traffic is actually being served by Cloudflare’s CDN and their Workers product.

Moving caching and request response serving to the edge reduced HIBP’s cost from $9 per day to $0.02 per day. This was not just by being able to serve many requests within Cloudflare’s free tier but also by eliminating almost 500GB of Azure network traffic:

StackPath EdgeEngine

As HIBP shows, avoiding network traffic is a second order benefit of using serverless. Colocating other services on the same platform eliminates a large amount of traffic so all you need to deal with is internal networking. Serverless certainly means not having to deal with scaling server infrastructure and so spending only based on what you use. But it also means that where you use other products like CDN, you can avoid costly traffic back to the origin.

This is a big part of the use case for why we launched EdgeEngine at StackPath – we have huge volumes of CDN traffic where we serve as the edge delivery provider, but requests still have go to back to the origin for dynamic processing. One of the major selling points of StackPath CDN is the diversity of PoP locations around the world. Your origin might be in AWS US East but if you are serving traffic to Spain, you will still have a significant volume of requests needing to go back to the centralised cloud provider.

Network costs at cloud providers are one of the biggest hidden taxes on using public cloud, especially if you use third party services like a CDN. So if you can eliminate those requests entirely you can not only provide better application latency but also save on your bill. Now we have our EdgeEngine product, you can do things like API token validation without it ever having to hit that central infrastructure.

AWS Lambda vs Azure Functions vs Google Cloud Functions

Lambda created the market for serverless and continues to innovate and lead on functionality. It benefits from the vast AWS product portfolio so is often a default choice for those already on AWS.

Azure Functions are just as mature as Lambda. It’s the Lambda equivalent for Microsoft fans, so is an easy default if you use Azure. .NET languages are well supported, as you would expect.

Google Cloud Functions still has the same problems of slow product development but the overall platform portfolio has improved significantly. It’s likely to the first choice for developers starting from scratch, just because the overall experience of working with Google Cloud is better than AWS or Azure. Google is also innovating in other areas, such as with the Cloud Spanner database product. GCF may benefit from those who want something in the platform they chose for other reasons.

Ultimately, serverless is not just about functions. If you want more than just simple request manipulation or one time processing then you need to be able to connect to a datastore and other services like logging and monitoring. It’s been possible to build full applications using only serverless, like HIBP, for a long time. How sophisticated those applications get will now depend on what other services appear to support serverless functions, and what the role for low latency edge deployment plays in adding to the use cases. To quote Troy Hunt of HIBP:

So, in summary, the highlights here are:

  1. Choose the right storage construct to optimise query patterns, performance and cost. It may well be one you don’t expect (as blob storage was for me).
  2. Run serverless on the origin to keep cost down and performance up. Getting away from the constraints of logical infrastructure boundaries means you can do amazing things.
  3. The fastest, most cost-effective way to serve requests is to respond immediately from the edge. Don’t hit the origin server unless you absolutely have to!

The organisational commit log

As a company grows, I’ve found communication to be the most difficult challenge. Whether it is explaining the company vision to new hires or finding the reasoning behind a decision made 6 months ago, recording knowledge and keeping it up to date requires a lot of effort.

When everyone is in the same office, informal chats are the norm. Even if everyone is remote, a small team can be in almost-perfect sync with simple one to one chats or a basic chat room. But problems occur as more people join the team.

  • Who is working on what? Who might be able to help with current challenges or problems and what experience and knowledge can be shared?
  • What tools, systems, processes and technologies are being considered, implemented or used? Is there relevant experience elsewhere in the company that can help?
  • What decisions have been made about products, customer issues, sales pitches, technical approaches? And why?

Just as a code commit log shows the reasons behind a specific change, where is the organisational commit log?

Email has the advantage of being asynchronous, user customisable (your own clients, filters, etc) and easily searchable, but it shouldn’t be a database for important information and is only useful for the participants of written discussions. Mailing lists could address that and are well understood particularly for open source projects. Stripe has successfully used email with internal mailing lists to improve transparency but have had to made adjustments as they scaled.

Group chat is a good way to communicate with individuals or small groups but has a lot of major issues, particularly with interruptions, and requires everyone to pay attention to every message and decide if it’s relevant to them.

Unlike email which has subjects and forced threading, chat tends to be more of a stream, very often with multiple discussions interleaved into the same chat. If you have teams in multiple timezones, it’s very easy to wake up to hundreds of messages across many different subjects (because people don’t actually use Slack threads!) which may or may not be relevant. I like the ability to pipe messages from other system so there’s only 1 place to catch up e.g. Zendesk tickets, but it doesn’t seem like a sustainable way to manage a large organisation.

Meetings are another place for knowledge to become lost. Unless you have a disciplined approach to note taking, recording decisions and the reasons behind them is easy to neglect. We’re starting to see some “AI” driven attempts to solve this, particularly with the Microsoft Teams Meeting Transcription product. Having meetings transcribed means that discussion now becomes searchable and can be referenced in commits, designs, comments and other documentation.

Documentation itself has to become a core part of how a company operates. Something like Confluence, a central wiki that everyone has access to, is more efficient than having thousands of individual documents in various cloud storage systems. I find the daily summary of changes of the StackPath confluence a valuable way to passively see what other teams are doing. But it’s only really effective if every team is deliberate in writing everything up and keeping it up to date.

Wasn’t this supposed to be a problem that was solved by the Google Search Appliance? And then Google Desktop. And Google Drive. And now Google Cloud Search.

These are all things which tend to only become an issue as a company grows, by which time it’s almost too late to have writing organisational history be part of the culture. Forcing everyone into the mindset later on is difficult, which is probably why communication remains the biggest challenge every organisation faces.

So you want more employee company ownership? The answer seems to be less government intervention – not more

Originally written for ConservativeHome.

Employee ownership of company shares is a good idea. As an employer, you want your team to be incentivised towards the success of the business. That’s why it’s fairly standard for startup companies to offer generous stock options as part of the compensation package when competing in the hiring market. Linking employee effort to shareholder returns seems like a reasonable goal.

Unfortunately, it’s not quite as simple as that.

To begin with, options are just that: the option to purchase stock in the future at a price fixed today. You don’t actually own anything – only the potential to own something in the future. Options are used by new firms because the actual valuation of the company is often uncertain, and startups are typically unprofitable for some time, so dividends are not paid.

They are also used to encourage employees to stay with the business. It is usual to have vesting terms whereby the employee must stay with the company for at least a year, and then a certain proportion of the options vest over a period of years. This encourages employees to stay, which is good for the company. But if they do want to leave, they only have 90 days to exercise those options – assuming they have the cash to exchange them for stock. In the US, that limit is ten years.

Options are a necessary instrument because ,if you do decide to give someone stock, that stock is an asset which has a value. And that value incurs a tax charge in a similar way that giving someone cash would incur income tax. It’s simply considered an employment-related benefit, and so HMRC says it must be taxed.

But what if you’re given stock in a company that isn’t yet public? A large company might have a private valuation, but not have floated on the stock market. The stock is not easily tradable and has no public value, so employees are left with a tax charge against a theoretical valuation that must be paid in cash.

Of course, there is a mechanism to protect employees from that tax charge today, deferring it to the future – the so-called s431 election that must be signed within 14 days of the grant. If you forget, or don’t know, you might find you have to pay a significant amount. Hopefully your accountant is paying attention. Assuming you have an accountant?

This is just the beginning of the complex rules around employee company ownership, all built up over time, no doubt, to try and balance the benefit of employee stock ownership with the challenges of tax avoidance through elaborate corporate structuring. Perhaps the reason why employee ownership is so low is because of the unreasonable complexity?

Simplifying the tax code is where we should be focused. Not with a possible future Labour Government forcing companies to issue up to 10 per cent of stock to a collectively managed fund. A ten per cent stake is significant, particularly for the largest companies, and would mean material dilution for other shareholders. The ownership makeup of UK shareholders is not just individuals (who only make up only 12 per cent) but unit trusts, pension funds, insurance providers, and over 50 per cent of stock market ownership is foreign investment. When there is a sudden, unexpected transfer of wealth forced by central government, how would investor confidence be affected?

The structure of Labour’s proposed ownership fund itself is also unclear. How would the managers of the fund get elected? What voting rights would they have? What kind of employee is interested in getting involved, and what would that mean for the kind of actions they pursue? Shareholders rarely have all the information available to the executive team, even if they are part of the ordinary workforce: indeed, that is why operational responsibility is delegated to executives. Active management by activist shareholders is unusual and often drastic. A more pragmatic proposal to involve employees in key decisions is through worker representation on boards – then they at least have information rights to understand and participate in those decisions. 

Linking employee benefits to the success of the business is a good incentive and gives workers a meaningful stake in the output of their efforts. But if this is indeed the goal, why is there a £500 annual cap? This is a clearly misaligned incentive: why put any more effort in once you hit the bonus level? There is a big altruistic assumption that employees would work more just for the benefit of the state, an assumption already disproved with the exodus of high earners during the period of extreme income tax bands in the 1970s.

When something isn’t working, the solution is rarely for the Government to force a solution that only involves first-order thinking. Employees don’t own enough of their company? Let’s have the government force them!? No. We must think deeper and consider the root cause. If the rules were a lot easier to understand and the tax implications less onerous, aligning incentives would be much more appealing, making employee ownership easier and more popular. Rather than more government intervention, the answer seems to be the opposite.

Things they don’t tell you about M&A

Startups exiting through M&A is incredibly rare. Indeed, you can expect at least 70% of companies that raise money to either fail or become self sustaining. There are also different stages where you can sell for an optimum outcome – miss the stage and it becomes more challenging to achieve valuation goals.

But if you are lucky and do go through an M&A transaction, it’s worth being aware of what’s going to happen. The process is set up for the buyer’s advantage. Just like buying a house, the founder/CEO probably only experiences M&A once or twice in their lifetime, but the other parties do it regularly. With the possibility of a life changing financial outcome, the psychology of selling is important to keep under control.

There’s quite a lot of M&A advice online, but it tends to focus on the vagaries of “how to get acquired” or what the formal process is. Having gone through this earlier in 2018, I thought I’d share some of the specific things I’ve experienced that might help other CEOs. This post is about the transaction itself. It’s too early to comment on post-close integration but this post by Steven Sinofsky with lessons from Microsoft is great.

Signing the LOI/termsheet is just the beginning

It’s easy to fall into the trap of thinking that once you’ve negotiated and signed the termsheet, the rest is just a formality. Actually, it’s just the beginning.

Due diligence is about the buyer confirming what you’ve said is true, and trying to find reasons not to do the deal. Of course, everyone is acting in good faith but this is a formal legal discovery stage to understand risk.

We only hear about the successful transactions. Deals fall apart all the time, even at the very last moment.

If there is anything unusual in your business it will be discovered in due diligence, but that is the worst place for surprises. Disclosing anything upfront that you think might worry the buyer allows you to control the messaging and avoids surprises and questions about why you didn’t reveal it sooner. The whole exercise is about trust, which is all too easy to break.

You have to be a project manager

There are so many people involved in a transaction that it can be easy to assume someone else is driving things. You will have teams of lawyers from both sides, tax advisors, executive and technical team members doing due diligence and probably your own M&A advisor as well. There’s a lot of keep track of.

However, just like when you are buying or selling a house, it is up to the principals to keep on top of everyone and make sure things are progressing speedily. Advisors are paid by the hour so their incentives are somewhat misaligned with your own.

Your goal as the seller is to close as quickly as possible so you must keep things moving at all times. Be relentless in chasing your own team as well as the buyer.

Lawyers will only follow your instructions

Your lawyers are paid to execute your instructions. They are usually good problem solvers but will not necessarily come up with creative or wildly different solutions.

We had an instance where our lawyers discovered a problem with a particular aspect of the transaction. They spent half a day going back and forth with the buyer’s lawyers and came to the conclusion we needed to engage additional external advisors, pushing back close for several weeks. This was on the Friday before we intended to close the following week!

Upon being informed, I was able to instantly change the instructions to our lawyers because this wasn’t actually important to resolve right away, and we could deal with it in the post-close process.

The lawyers were doing their job – trying to follow instructions – but they can’t adjust those instructions in the same way the principle can.

Maintain open communication with your counterpart

I was in constant communication with the main person on the buyer side by phone and text. Providing status updates and asking/answering quick questions using informal messaging really helped us keep things moving.

This helped when there was a particular check the lawyers were doing to verify signing authority from one of our investors. It was clear that this investor had authority by looking at the public company records, but it wasn’t quite sufficient to prove authority for the lawyers. Strictly speaking they were probably correct but I was able to get the buyer to instruct their lawyers to lighten up the requirements thereby unblocking a sticking point.

There will be lots of little questions and problems that should be dealt with directly and not via lawyers or advisors. Be careful not to become too friendly because you are still in a transaction, but having a trusted counterpart you can work with closely really helped us close quickly.

Investors vary

You will need to engage with all of your shareholders because you have to get them all to sign, probably physical papers as well as electronically.

We did this in stages.

A few weeks after the LOI was signed as we were getting towards close, I sent out a general message to let everyone know that a transaction was in progress and what the intended close date was. Everyone was then kept up to date because the last thing you want is for someone to disappear on holiday or be away from connectivity.

You also want to minimise any unnecessary involvement, especially from minority investors. It’s difficult enough to close a transaction without random investors trying to negotiate with their own agenda.

This is where it is helpful to have professional investors who have seen transactions before, and angels who have been CEOs. They know what it’s like and keep out the way, being quick to reply and helpful when needed.

Others cause problems. They think they should be involved as if they were board members or majority investors. All that does is make you never want to work with them again. The last thing you want to do is use your drag-along to force the deal: it’s a red flag to the buyer, creates legal risk and ramps up your fees.

The lesson is: reference check your investors. Find out what they were like for the entire lifecycle of an investment.

How to learn product management

For the last few months since the StackPath acquisition, I have been shedding all the administrative tasks of a CEO of a small startup and focusing more and more of my time on product.

This has been initially scoped to integrating Server Density monitoring into the StackPath platform but has been broadening to multiple products across the platform.

I am used to shifting between many different tasks and responsibilities so focusing entirely on product has been a new experience for me. As a result, I have been spending as much time as possible learning about what it means to do product management.

Learning something new is a great time to write about the experience. There are valuable insights that can be shared from a beginner mindset. Once you “know” something, you think about problems in a different way.

So this post is a collection of the resources I’ve found useful in learning about running product engineering ~6 months into the role.

Books for product managers

Product management podcasts

I have yet to find a good podcast that is just about product management, so here are some specific episodes from more general podcasts that I’d recommend listening to.

  • Masters of Scale: Marissa Mayer – I find this podcast series very difficult to listen to because it is incredibly over-produced, but this one made me do further research into how Google runs product management and so was valuable in that sense!
  • The A16Z podcast as a whole is worth listening to, but specifically related to product I would suggest High Growth in Product (and tech) which is a podcast interview with Elad Gil of the High Growth Handbook mentioned above. Also listen to The Basics of Growth Part 1 and Part 2.

Events for product managers

Generally I don’t find attending conferences to be a good use of time. The travel, disruption to routine and low signal to noise ratio of talks means I’d usually much rather watch the videos after. However, I have found these to be worth the time:

  • Mind the Product Conference is the main conference but I attended only the Leadership Forum, which was worth it because of the small number of attendees.
  • Every industry has their own niche conference which is worth attending just to understand the overall landscape. For monitoring, it’s Monitorama. And for SaaS in general, it’s SaaStr. Be very picky and very specific.

Product management blogs

Where specific articles but not the whole blog is useful, they’re listed in the next section. These blogs are worth subscribing to in their entirety.

Articles for product managers

Product management videos and talks

  • Customer Obsession – from ProductTank San Francisco, this talk outlines: the balancing act of delighting customers in hard-to-copy margin-enhancing ways; how “customer obsession” helped Netflix to create a highly personalized experience; and the principles of customer obsession through a case study — “Should Netflix send a free trial reminder to its customers at the end of their four-week trial?”
  • Mastering the problem space for product/market fit – this is a framework covering the universal conditions and patterns that have to hold true to achieve product/market fit. Each layer in the pyramid is a key hypothesis that you need to get right in order to build the next layer and ultimately achieve product/market fit.

Good quotes on product management

Some select quotes from the linked content above that’s worth highlighting by itself.

In the 10+ years since AWS’s debut, Amazon has been systematically rebuilding each of its internal tools as an externally consumable service. A recent example is AWS’s Amazon Connect — a self-service, cloud-based contact center platform that is based on the same technology used in Amazon’s own call centers. Again, the “extra revenue” here is great — but the real value is in honing Amazon’s internal tools.

If Amazon Connect is a complete commercial failure, Amazon’s management will have a quantifiable indicator (revenue, or lack thereof) that suggests their internal tools are significantly lagging behind the competition. Amazon has replaced useless, time-intensive bureaucracy like internal surveys and audits with a feedback loop that generates cash when it works — and quickly identifies problems when it doesn’t. They say that money earned is a reasonable approximation of the value you’re creating for the world, and Amazon has figured out a way to measure its own value in dozens of previously invisible areas.

Why Amazon is eating the world

Perhaps most importantly, the product manager is the voice of the customer inside the business, and thus must be passionate about customers and the specific problems they’re trying to solve. This doesn’t mean the product manager should become a full-time researcher or a full-time designer, but they do need to make time for this important work. Getting out to talk to customers, testing the product, and getting feedback firsthand, as well as working closely with internal and external UX designers and researchers, are all part of this process.

Product Leadership

Many books emphasize the first two points—corporate strategy and culture setting. However, you will find that in practice you have little time in a high-growth, rapidly scaling company to think deeply about those points until you hire a strong executive team and manage your own time properly.

High Growth Handbook

There’s no point in defining what to build if you don’t know how it will get built. This doesn’t mean a product manager needs to be able to code, but understanding the technology stack — and most importantly, the level of effort involved — is crucial to making the right decisions.

Product Leadership

Another lesson that I learned from Brian Chesky—one way to think about when to upgrade executives—is that a really great executive is about six to twelve months ahead of the curve. They’re already planning for and acting on things that are going to be important six to twelve months in the future. A decent executive is delivering in real time, now to one to three months in advance.

High Growth Handbook

The trick to creating a great product team is to think of them as the product. This is not an objectification but rather a thought exercise. After all, they are the product that creates the product. Without them, there is no product. Amazing teams make amazing products. Seen from this perspective, the task of how to hire, onboard, train, and develop them becomes another product design problem. The approach that successful leaders take to creating great product is the same approach they take to creating great product teams.

Product Leadership

Often the hardest part of the communication is communicating the “why” behind the product road map, prioritization, and sequencing. Part of this will be creating a framework that establishes why some things are prioritized higher than others—and it’s important that all other functions buy into this framework.

High Growth Handbook

Out of the goals will come the specific features for development. Like a ripple effect with the vision at the center, the objectives or goals are generated and they in turn generate the features that support those goals. Never start with features. Even if your business or product is based on a “feature concept,” ask yourself what the bigger problem is and why it needs solving. Any feature shouldn’t be considered, prioritized, or delivered in a vacuum. Without a vision to guide the product creation, a project can quickly become a collection of cool solutions lacking a core problem to guide them. Features need to be directly tied to the product or organization’s strategic goals.

Product Leadership

For example, if you as the designer/manager discover that you as the worker can’t do something well, you need to fire yourself as the worker and get a good replacement

^ Principles: Life and Work

If you are not evolving your organizational design, it might be an indicator that your product strategy is getting stale. In our experience, most rigid organizational structures are built to create processes for predictability, not successful outcomes.

Product Leadership

As GV’s Ken Norton says, “I like to start with the problem. Smart people are very solution-oriented, so smart people like to talk about what the solution is going to look like to the problem. But successful people think about the problem. Before I talk about this product, or this feature, or this device I’m going to build, I must understand the problem at a deep level. Then success is easy to articulate, because you know what it’s like when that problem is solved.”

Product Leadership

“By-and-large” is the level at which you need to understand most things in order to make effective decisions. Whenever a big-picture “by-and-large” statement is made and someone replies “Not always,” my instinctual reaction is that we are probably about to dive into the weeds—i.e., into a discussion of the exceptions rather than the rule, and in the process we will lose sight of the rule.

^ Principles: Life and Work

How to hire engineers: the interview process

Originally written for the Seedcamp resources website.

Earlier this year, I wrote about the first step in hiring – how to source candidates. Once you have applications, then you need to evaluate them to decide who you might want to hire.

Regardless of how urgent the need is to fill the position, finding the right people, not just for the role today but for how your business will change in the future, is crucial to success. This post will take you through how to create a robust selection process for hiring engineers.

The goals of the process

You have to remember that you are still in a sales process. You are not just trying to match applications against your person spec but you are also trying to convince them to accept the offer you might make at the end. This means there are several goals to consider:

  1. Evaluate applications against what you are looking for in team members now, and in the future. You need to balance the requirements of the job today with an ability to adapt as the business changes. This is particularly important in early-stage startups. Past experience may be relevant to demonstrate ability to execute, but knowledge of specific technologies is probably not – the best engineers can learn new skills, languages, frameworks and systems.
  2. Continue to demonstrate why your business is a great place to work. This comes in multiple parts, the first of which is well before you even get applications. Building your profile and supporting website materials is important for getting applications in the first place. It is just as important that the interview process runs smoothly, the candidates always know where they are at, what they need to do next and what the timeline is. You need to provide regular updates and fast responses. Their time must be valued more than your own and you need to explain to them why they should be joining the company if you make them an offer. You can never take for granted that just because they have applied to you, they will actually accept any offer.
  3. Build a diverse team. This is assisted by the design of the process but also requires you to have the appropriate HR policies in place e.g. flexible working, generous holiday allowances, clear maternity/paternity policies, etc. Thinking about this from the beginning and designing your processes to consider the challenges of diversity means you do not need to do things like positive discrimination, which I do not think is a good way to tackle the diversity problem in tech. The goal is to increase the diversity of the application pool and run an unbiased process to select the best candidates from that pool. Google has some useful guides on diversity in general and there are several good resources for working on gender diversity.

The basic foundation for running a good engineering interview process is valuing the time of the candidates. They likely have full time jobs and/or consulting gigs, so you cannot ask candidates to spend many hours on the phone, doing coding tasks or building projects. Of course they will need to give up some time to dedicate to the process but you should work hard to minimise it.

Step 1: Application

The usual application is a simple form which asks the candidate to submit their basic details, a CV/resume and a short cover letter explaining why they are interested in the job. The cover letter is the most important aspect and the only element that is actually examined at this stage.

In the job ad I include an instruction which asks the applicant to mention a keyword in their cover letter. If the keyword isn’t present then the application is instantly rejected. This is specifically to filter out mass, shotgun-type applications and to test for attention to detail.

The best people will usually only ever apply to a small number of positions. You want to find people who take the time to consider the company and role well in advance of ever applying, which means reading the full job ad and description.

Where possible, this step should be automated. Only collecting the minimum amount of information e.g. email, cover letter means you can systematically ignore any other details of the application, such as the CV, name, which might introduce bias. Be aware of protected characteristics and things you cannot ask.

Just like college degrees being mostly irrelevant for engineering positions (unless you have some very specific scientific knowledge you require), some companies are now excluding CV submission entirely. This is worth considering as another way to remove potential for bias. The only thing I find CVs useful for is to research interview questions in advance, but everything you need to know you can simply ask the candidate when you speak to them later.

Step 2: Writing exercise

I have found there is a good correlation between ability to write well and coding ability. Programming is all about clear and accurate communication, whether that is directly in code itself or communication about the project with real people!

I test this by requiring candidates to do a short writing exercise whereby they have an hour to research the answer to a particular question, and write up the response. The question should be relatively easy because the focus is on their written answer. You are simply looking for accurate spelling and grammar. Any mistakes should mean an instant rejection – if they are unable to write such a short piece without mistakes or proper proofreading then that indicates a lack of care and attention.

The task should take no more than an hour and you are not looking for technical accuracy of the response. This is purely an assessment of clear and accurate communication.

Step 3: Coding exercise

Designing a good coding exercise is tricky. It needs to be representative of the kind of skills you need for the role. It should allow the candidate to demonstrate a wide range of skills, from writing clear code to tests and documentation. And it should be straightforward to build in a short period of time – a couple of hours is ideal.

One of the more successful exercises I have used in the past is to ask the candidate to build a simple client for a public API. This tests many things such as working with real world systems, understanding credential management and dealing with network issues and error handling.

Whatever you pick, you want the candidate to be able to create a self contained package or repository, with some basic installation and setup documentation so that you can evaluate both whether it works, and the implementation itself.

Before starting this, as an engineering team you need to create a list of objective criteria that you can score the exercise against. These can include things like checking the documentation is accurate, test coverage, code linting, etc. You can determine your own criteria but they should be as objective as possible so that each evaluator can compare their conclusions.

Once the candidate sends you their completed exercise, the code should be given to several of your engineers to evaluate. This should be done blind so the evaluators only see the code, and they do not discuss the details with each other. This gives you several independent evaluations and avoids any bias. Be sure to instruct the candidate not to include any identifying information in the package e.g. a Github URL or their name in an auto-generated copyright code comment.

Step 4: In-person pair programming

At this point you have done most of the evaluation and believe the candidate has the skills you’re looking for. The final stage is to evaluate actually working alongside you in a more realistic situation. For this, I prefer to meet candidates in person and have them work alongside their potential colleagues.

I have done this stage remotely in the past but have found that it is more effective to meet someone in person. You can then evaluate what they are like as a person. However, this is also the stage where there is most risk of bias. You can mitigate this by involving multiple people from your team so that one person doesn’t have a veto.

In the interests of speed and efficiency, I try and schedule all final interviews within the same week. This may not always be possible but I try to batch them as closely as possible. This makes the best use of your team’s time and means that candidates can get a response quickly.

You should cover all travel costs for the candidates, booking tickets for them rather than making them pay with reimbursement – they shouldn’t have to loan your company their own money! If they have to travel a long distance, offer overnight accommodation, transfers and food. Also ensure they have a direct contact who is available 24/7 in an emergency. You want candidates focused on the interview, not worrying about logistics.

Again, you need to determine what the best approach to evaluating their capabilities is. I have found that getting them to actually work on your codebase is a good way to see how they deal with an unfamiliar environment and start to learn a new system. You can ask them to fix a known bug, or introduce a simple bug into the code and work with them to fix it. You are not testing them on their knowledge, but on how they approach the problem. Whether or not they fix the problem isn’t important.

Remember that this continues to be a sales process. Take the time to introduce them to key members of team, show them around the office and, if they’re not local, the area where they’ll be working. Be sure to show off and explain why you want them to join. This is the job of everyone on the team – multiple people telling them about the company is a lot better than just the hiring manager or CEO!

Step 5: The response

Anyone who gets past step 1 should receive a response to their application whether they are successful or not. One of the worst things about applying for a job is not knowing what the decision was.

The challenge with giving a negative result is that candidates will often ask for feedback and may argue with it. It is up to you whether you want to do this at all, but I usually offer detailed feedback only if a candidate reaches step 3 or 4. Failing step 2 is only for poor spelling/grammar, which you can build into an auto-generated response.

If you are going to make an offer, do it as quickly as possible. Include the key information about the compensation package, start date and anything else you need from the candidate. Be sure to review the legal requirements for a formal job offer first.

Don’t use exploding offers and don’t pressure the candidate. During the step 4 interview, you may want to ask them what their evaluation criteria are and whether they are looking elsewhere. Asking them when they think they will be able to reply to you is probably fine, but  don’t ask about salary expectations.

What not to do

You may notice that certain things are not present in the above process.

  • No questions about their background and experience. It is not necessary – you are evaluating them based on their skills and how they apply to them today, not what they claim to have done in the past. That said, in step 4, you may want to ask a few questions about how they may have tackled similar problems in the past, or what interesting challenges they have solved if you are hiring for a very specific problem area. But really, you want to put as much time as possible into designing your coding exercises so they are representative of the problems the candidate would have to solve if they were working at your company. Let them demonstrate their ability, not talk about it.
  • No knowledge questions or puzzles. The ability to recall function definitions or solve theoretical problems is not particularly useful for evaluating whether someone can write good software.
  • No whiteboarding. You may want to use a whiteboard to explain specific system architecture but there is no place for actually coding on a whiteboard, on paper, or anywhere that isn’t a modern IDE or code editor of the candidate’s choice. Nobody codes in isolation without access to the internet to look things up. Everyone has their own preferred coding environment and the coding interview will likely place them in an unfamiliar setup without their usual shortcuts and window layout, so be sure to make allowances for this too.
  • No phone interviews. Again, get the candidate to demonstrate their ability through real tasks, not be explaining what they might do or have done.

Applying HumanOps to on-call

Originally written for the StackPath blog.

One of the two core foundations of SaaS monitoring is alerting (the other being metric visualization and graphing). Alerting is designed to notify you when things go wrong in your data center, that there’s a problem with your website performance, or if you’re experiencing server downtime. More specifically, infrastructure monitoring and website monitoring are designed to notify you in such a way that you can respond and try to fix it. That often means waking people up, interrupting dinners, and taking people away from their family to deal with a problem.

When the very nature of a product deliberately has a negative impact of the quality of life of your customers, it is your responsibility as the vendor to consider how to mitigate that impact. Trying to understand how StackPath Monitoring impacts our customers through their on-call processes was why we started HumanOps.

So how do you apply HumanOps principles to (re)designing your approach to on-call?

HumanOps is made up of 4 key principles. These are explained in more detail in the What is HumanOps post, but essentially it boils down to:

  1. Humans build & operate systems that have a critical business impact.
  2. Humans require downtime. They get tired, get stressed, and need breaks.
  3. As a result, human wellbeing directly impacts system operations.
  4. As a result, human wellbeing has a direct impact on critical business systems.

These can be applied through considering some key questions about how on-call processes work.

How is on-call workload shared across team members?

It’s standard practice to have engineers be on-call for their own code. Doing so provides multiple incentives to ensure the code is properly instrumented for debugging, has appropriate documentation for colleagues to debug code they didn’t write, and, of course, to rapidly fix alerts which are impacting your own (or your colleagues) on-call experience. If you’re being woken up by your own bad code, you want to get it fixed pretty quickly!

With the assumption that engineers support their own code, the next step is to share that responsibility fairly. This becomes easier as the team grows but even with just 2-3 people, you can have a reasonable cycle of on/off call. We found that 1-week cycles Tuesday – Tuesday work well. This is a long enough period to allow for a decent “off-call” time and has a whole working day buffer to discuss problems that might have occurred over the weekend.

You also want a formal handoff process so that the outgoing on-call engineer can summarize any known issues to the person taking over.

How do you define primary and secondary escalation responsibilities?

The concept of primary/secondary is a good way to think about on-call responders and the Service Level Agreement they commit to with each role.

The primary responder typically needs to be able to acknowledge an alert and start the first response process within a couple of minutes. It means they have to be by an internet connected computer at all times. This is not a 24/7 NOC, which is a different level of incident response.

Contrast this with a secondary who may be required to respond within 15-30 minutes. They are there as a backup in case the primary is suddenly unreachable or needs help, but not necessarily immediately available. This is an important distinction in smaller teams because it allows the secondary to go out for dinner or be on public transport/driving for a short period of time (i.e. they can live a relatively normal life!). You can then swap these responsibilities around as part of your weekly on-call cycle.

What are the expectations for working following an incident?

An alert which distracts you for 10 minutes early evening is very different from one which wakes you up at 3 a.m. and takes 2 hours to resolve, preventing you from going back to bed again because it’s now light outside.

In the former situation, you can still be productive at work the next day, but in the latter, you’re going to be very fatigued.

It’s unreasonable to expect on-call responders to be completely engaged the day after an incident. They need to have time to recover and shouldn’t feel pressured to turn up and be seen.

The best way I’ve seen to implement this is to have an automatic “day off” policy which is granted without any further approval, and leave it to the discretion of the employee to decide if they need a full day, work from home, or just to sleep in for the morning.

Recovery is necessary for personal health but also to avoid introducing human errors caused by fatigue. Do you really want someone who has been up all night dealing with an incident committing code into the product or logging into production systems?

This should be tracked as a separate category of “time off” in your calendar system so that you can measure the impact of major on-call incidents on your team.

It also applies if there is a daytime alert which takes up a significant amount of time during a weekend or holiday. The next work-day should be taken as vacation to make up for it.

Having the employee make the decision, but with it defaulting to “time off allowed” avoids pressure to come in to work regardless. Reducing the cultural peer pressure is more challenging, but managers should set the expectation that it is understood that you will take that time off, and make sure that everyone does.

How do you measure whether your on-call process is improving?

Metrics are key to HumanOps. You need to know how many alerts are being generated, what percentage happen out of hours, what your response times are, and whether certain people are dealing with a disproportionate number of alerts.

These metrics are used for two purposes:

  1. To review your on-call processes. Do you need to move schedules around for someone who might have had more of their fair share of alerts? Are people taking their recovery time off? Are people responding within the agreed SLAs? If not, why not?
  2. To review which issues should be escalated to engineering planning. If alerts are being caused by product issues they need to be prioritized for rapid fixes. Your engineers should be on-call so they will know what is impacting them, but management needs to buy into the idea that any issues that wake people up should be top priority to fix.

Eliminating all alerts is impossible, but you can certainly reduce them. You can then track performance over time. You’ll only know how you’re doing if you measure everything though!

How are you implementing HumanOps?

We’re interested in hearing how different companies run their on-call so we can share the best ideas within the community. Let me know how you’re implementing the HumanOps principles. Also, we encourage you to come along to one of our HumanOps events to discuss with the community. Email me or mention me on Twitter @davidmytton.

Configuring for security, privacy and convenience

Balancing security, privacy and convenience is not easy. I’ve spent quite a lot of time figuring out how to configure my various computer systems with this goal in mind.

Computers are supposed to make our lives more convenient and you sometimes have to trade privacy for convenience e.g. Outlook processing emails to allow you to use Focused Inbox. AI is going to bring a lot of productivity improvements but I always prefer when that is processed on device, as with Siri Suggestions for things like when to leave for an event.

You also have to consider your adversary.  There are reasonable steps you can take without seriously damaging convenience to provide safeguards against criminals and data profiling. But if you are trying to evade active government surveillance rather than just avoid being swept up in mass snooping, then things get significantly more difficult.

Targeted surveillance is, and should be, allowed (with appropriate legal safeguards). That is not what I’m trying to protect against here. Good security should be expected by all. Privacy is about having choice and control over your personal data.

Here’s how I approach it as of Oct 2018. I expect these practices to change over time. In no particular order:

  • Only use Apple mobile devices. They are the only company that builds privacy by design into their products. Their business model is to sell high priced hardware, not to sell your data. They have 5 year lifecycles on software updates which are delivered regularly, unlike Android which requires updates to go through carriers (usually delays by months, or forever). Buying direct from Google means giving up all your privacy. And the Apple model is to run as much computation on-device, whereas Google is the opposite – all processing is in their cloud environment, which is secure, but has no privacy.
  • Don’t get an Alexa device or Google Home. If you want a voice assistant, Apple’s HomePod with iOS 12 Shortcuts works very well.
  • iOS is the only secure OS that achieves the security, privacy and convenience balance. Any sensitive work should be restricted to iOS devices only. macOS is the next best option. If you don’t need convenience, use Tails.
  • Configure macOS and iOS for privacy. In particular, this means using full disk encryption and strong passwords.
  • Don’t use any Google services and be sure to pay for key services like email, calendar and file storage. If you’re not paying then your data is the product – you want a vendor who has a sustainable business model in selling the service/product itself, not your data. Running your own systems significantly reduces the security aspect of the balance, so it’s better to use either iCloud (if you don’t want your own domain), Microsoft Office365 (which is what I use) or Fastmail. For £10/m I get access to 1TB of OneDrive storage, Mail, Calendar and the full suite of Office products. I pay an extra £1.50/m on top of that for Advanced Threat Protection. Microsoft allows you to select the country where data is stored, has privacy by design and has a good record of defending against government access requests. The Outlook iOS app is actually very good but the Exchange protocol is supported by every client, so you have a good choice. Focused Inbox is great. Bigger corporates like Microsoft have significantly more resources to invest in security (which is why I prefer Office365 over Fastmail).
  • Unfortunately, Apple Maps is still rubbish compared to Google Maps. They’re generally comparable in major cities so I always prefer Apple Maps until the last-mile destination directions, where Apple Maps is regularly inaccurate. At that point I switch to Google Maps on iOS.
  • Don’t store anything unencrypted on cloud storage providers that you would be concerned about leaking if someone gained access. Encrypt these files individually. You can use gpg on Mac but it’s not especially user friendly. I prefer Keybase but it still requires using the command line. These files will be inaccessible on mobile so you may want to consider using 1Password document storage instead, for small files (they have a total storage limit of 1GB). Office files can be password protected themselves, which uses local AES encryption.
  • Delete files you don’t need any more and aren’t required to keep for tax records. In particular, set your email to delete all messages after a period – the shorter the better. I delete all my emails after 1 year. Configure macOS Finder Preferences to remove items from the Trash automatically after 30 days.
  • Don’t send attachments via email. You might delete your emails after a time but the recipients probably don’t. Instead, share them using an expiring link to online cloud storage.
  • Use a password manager and 2 factor authentication. These are just security basics.
  • Don’t use Google Chrome. Only use Safari or Firefox. Configure your browser to auto clear your history on a retention period that allows convenience but also privacy. I set mine to clear after 1 week. I’ve never needed to go back any further. Be sure the “Prevent cross-site tracking” option is configured in Safari settings.
  • Set up DuckDuckGo for your search provider on macOS and iOS. I’ve not used Google search for years.
  • Buy 1Blocker X for iOS and 1Blocker for macOS (see a comparison of other options) to block trackers and ads in Safari.
  • Set up Little Snitch outbound firewall and be sure you know which apps you’re approaching outbound internet access for.
  • Set up Micro Snitch to be notified whenever your mic and camera are in use. Cover your device cameras as a backup.
  • Don’t use SMS – disable fallback in iOS settings. WhatsApp encryption is good but all the metadata about who you are communicating with is shared with Facebook. Unfortunately, it has built up a considerable network effect so it is necessary to use it to communicate in the Western world. Few people use Signal, which is the best so  follow this guide to maximise WhatsApp privacy. iOS allows you to configure deleting iMessages after a period of time. I have mine set to delete after 30 days. You have to manually clear your WhatsApp conversations.
  • Don’t plug anything directly into any USB charging port in airports, hotels, or anywhere else. Use a USB data blocker adapter first.
  • Back up your files to cloud storage but only if they are encrypted locally first. Arq is a good tool to do this. Don’t use the same cloud storage as your main files e.g. I use OneDrive for my files and Amazon S3 for Arq backups.
  • Always use a VPN when connected to public wifi, or any network you don’t control, but don’t use a free VPN. This site has a good comparison but I use Encrypt.me on macOS and iOS. Encrypt.me is owned by StackPath, my current employer, so I know how all the internal infrastructure is set up i.e. we don’t log traffic. However, I also used it prior to joining StackPath and before Encrypt.me itself was acquired. Encrypt.me is a great consumer VPN but if you want more control and configuration options e.g. OpenVPN support, StrongVPN is another product from StackPath.
  • Change your DNS servers to use a privacy-first DNS provider, such as Cloudflare DNS. Do not use your default ISP DNS or Google DNS. If you have an OpenWRT router, configure it to use Cloudflare DNS over TLS because otherwise your ISP can still sniff your DNS requests.
  • Better yet, buy a router that allows you to configure DNS over TLS and connect to a VPN directly. I have a GL-AR750S configured to force all DNS over Cloudflare DNS over TLS and it is permanently connected to StrongVPN. This means all connections from home are encrypted before they even hit my ISP. The only downside is having to disconnect the VPN when using BBC iPlayer, because it detects the VPN. My wifi uses Mac whitelisting so only specific devices are allowed to connect.
  • Pay for Cifas protective registration and register your phone numbers on the TPS list.
  • Use Apple Pay wherever possible. The vendor doesn’t get access to any information about you and can only identify your payment information from a token specific to each transaction. This protects privacy and if the vendor is breached, your card details are safe. The usual contactless limit doesn’t apply to Apple Pay, which is limited only by your card limit.
  • Don’t buy Samsung TVs. There’s no need for any TV to connect to the internet so don’t connect them in the first place. Use a dedicated device like an Apple TV for your TV interface, it has a better UI anyway.
  • Be mindful of sharing photos online directly from your phone. They usually embed the location of the photo in the EXIF data.

Have I missed something? Let me know what else you’re doing.

Leaving the policing of the internet up to Google and Facebook

Consumers typically don’t want to pay for services that the internet has taught them should be “free”. Social networking, email, calendars, search, messaging…these are all “free” on a cash basis, but have a major cost to your privacy.

The best analogy I have heard to describe how these services work was in an episode of Sam Harris’ podcast with Jaron Lanier.

To paraphrase: imagine if when you viewed an article on Wikipedia, it customised the content based on thousands of variables about you based on such things as where you are, who you are friends with, what websites you visit and how often, how old you are, your political views, what you read recently, what your recent purchases are, your credit rating, where you travel, what your job is and many other things you have no idea about. You wouldn’t know the page had changed, or how it differed from anyone else. Or even if any of the inferred characteristics were true or not.

That’s how Google and Facebook work, all in the name of showing ads.

I don’t have a problem with trading privacy for free services per se. The problem is the lack of transparency with how these systems work, and the resultant lack of understanding by those making the trade off (ToS;DR). For the market mechanism to work, you have to be well informed.

We’re starting to see this with how governments are trying to force the big platforms to police the content they host but leaving the details to platforms themselves. Naturally, they are applying algorithms and technology to the problem, but how the rules are being applied is completely opaque. There’s no way to appeal. By design, the hueristics constantly change and there’s no way to understand how they have been applied.

Policing content is a problem that has been solved in the past through the development of Western legal systems and the rule of law. The separate powers of the state – government, judiciary and legislature – counter-balance each other with various checks and stages to allow for change. It’s not perfect, but it has had hundreds of years of production deployment and new version releases!

What has changed is the scale. And the fact that governments are delegating the responsibility of the implementation to a small number of massive, private firms.

It’s certainly not that the government could do a better job at solving this. Indeed, they would likely make even more of a mess of it e.g. EU cookie notice laws. But private companies can’t be allowed to do it by themselves.

The solution requires open debate, evidence based review, a robust appeals system, transparency into decision making and the ability for the design to be changed over time. But it also needs to be mostly automated and done at internet-scale. Unfortunately, right now I’m not sure such a solution exists.

Regulation always favours the large incumbents, stifling innovation and freedom of expression. Perhaps it is time for the legislative process to adopt a more lightweight, agile process with a specific, long term goal that successive governments can work towards. There tends to be a preference for huge, wide-ranging regulatory schemes which try to do everything in one go. Instead, we should be making small changes, focusing on maximum transparency and taking the time to measure and iterate. The tech companies need to apply good engineering processes to how they are developing their social policy, in public.

But without any incentive to do so, we risk ending up with a Kafka-esque system that might achieve the goal at a macro level, but will have many unintended consequences.

A practical guide to HumanOps – what it is and how to get started

Originally written for the StackPath blog.

Humans are a critical part of operating systems at scale, yet we rarely pay much attention to them. Most of the time, energy and investment goes into picking the right technologies, the right hardware, the right APIs. But what about the people actually building and scaling those systems?

In 2016, Server Density launched HumanOps. It started with an event in London to hear from some of the big names in tech about how they think about the teams running infrastructure.

How can you reach your high availability goals without a team that is able to build reliable systems, and respond when things go wrong? How does sleep and fatigue affect system uptime? System errors are tracked, but what about human error? Can it be measured, and mitigated?

With the acquisition of Server Density by StackPath, I am pleased that HumanOps now has a team dedicated to continuing to build the community. We’re open to anyone taking on responsibility for a local meetup but will also be running our own series of events in major cities around the world. The first of these kicked off this week in San Francisco.


What is HumanOps?

The problem

A superhero culture exists within technical systems operations.

Being woken up to fix problems, losing sleep to make an amazing fix live in production and then powering through a full day of work is considered to be heroic effort.

There is little consideration for the impact this approach has on health, family and long term well-being.

The aim

Running complex systems is difficult and there will sometimes be incidents that require heroic effort. But these should be rare, and there should be processes in place to minimise their occurrence, mitigating the effects when they do happen.

HumanOps events are about encouraging the discussion of ideas and best practices around how to look after the team who look after your systems.

It considers that the human aspects of designing high availability systems are just as important as the selection of technologies and architecture choices.

It’s about showing that mature businesses can’t afford to sacrifice their teams and how the best managed organisations achieve this.

If Etsy, Facebook, Spotify and the UK Government can do this. So can you.

How to implement HumanOps

The first step to implementing HumanOps is to understand and accept the key principles.

Key principles

  1. Humans build & operate systems that have critical business impact.
  2. Humans require downtime. They get tired, get stressed and need breaks.
  3. As a result, human wellbeing directly impacts system operations.
  4. As a result, human wellbeing has a direct impact on critical business systems.

HumanOps systems and processes follow from these principles.

HumanOps systems & processes

There are many areas of operations where HumanOps can be applied, but there are a few core areas which are worth starting with first. Each one of these could be a separate blog post so here are a series of questions to start thinking about your own process design.

  • On call
    This is where the most impact occurs. Being woken up to deal with a critical incident has a high impact, so it is important to design the on-call processes properly. Some key questions to ask: how is the workload shared across team members? How often is someone on-call and how long do they get off-call? What are the response time expectations for people at different escalation levels (e.g. do you have to stay at home by your computer or can you go out but with a longer response time?). Do you get time off after responding to an incident overnight? If so, is there any pressure to forgo that e.g. it should be automatic rather than requiring an active request. Do managers follow the same rules and set an example? Do you expect engineers to support their own code? Do you consider additional compensation for each on-call incident or is it baked into their standard employment contract? Do you prioritise bugs that wake people up?
  • Metrics
    You can’t improve something without measuring it. Critical out of hours incidents will happen, but they should be rare. Do you know your baseline alert level and whether that is improving? Do you have metrics about the number of alerts in general, number of alerts out of hours? Do you know if one person is dealing with a disproportionate number of alerts? Do you know which parts of the system are generating the most alerts? How long does it take for you to respond and then resolve incidents? How does this link to the business impact – revenue, user engagement, NPS? Are these metrics surfaced to the management team?
  • Documentation
    Only the smallest systems can be understood by a single person. This means writing and keeping documentation up to date needs to be a standard part of the development process. Runbooks should be linked to alerts to provide guidance on what alerts mean and how to debug them. Checklists must form a part of all human performed tasks to mitigate the risk of human error. How do you know when documentation is out of date? Who takes responsibility for updating it? How often do you test?
  • Alerts
    Most system operators know the pain of receiving too many alerts which are irrelevant and don’t contain enough information to resolve the problem. This is where linked documentation comes in but the goal should be that alerts don’t reach humans except as a last resort. Interrupting a human should only happen if only a human can resolve the problem. This means automating as much as possible and triggering alerts based on user-impacting system conditions, not just on component failures where the system can continue to operate. Are your alerts actionable? Do they contain enough information for the recipient to know what to do next? Are they specific enough to point to the failure without resulting in a flood if there is a major outage?
  • Simulation
    A large part of the stress of incidents is the uncertainty of the situation coupled with the knowledge that it is business / revenue impacting. Truly novel outages do happen but much of the incident response process can be trained. Knowing what you and each of your team members need to do and when will streamline response processes. Emergency response teams do this regularly because they know that major incidents are complex and difficult to coordinate ad-hoc. Everyone needs to know their role and what to do in advance. War gaming scenarios to test all your systems, people and documentation helps to reveal weaknesses that can be solved when it doesn’t matter as much, and teach the team that they can apply haste without speed. How is the incident initially triaged? What are the escalation processes? How does stakeholder communication work? What happens if your tools are down too e.g. is your Slack war room hosted in the same AWS region as your core infrastructure?

The idea behind HumanOps principles is to provide a framework for focusing on the human side of infrastructure.

What’s the point of spending all that time and money on fancy equipment if the people who actually operate it aren’t being looked after? Human wellbeing is not just a fluffy buzzword – it makes business sense too.

The idea behind HumanOps events are to share what works and what doesn’t, and demonstrate that the best companies consider their human teams to be just as important as their high tech infrastructure.

Over the coming months I’ll be writing more about each of these topics and sharing the videos of other organisations explaining how they do it, too.

If you’re interested in attending, speaking or even running a HumanOps event near you, check out the website event listings and get in touch if there’s nothing nearby.