I recently came across two statements about ITSM’s place in the world of DevOps and Continuous Delivery.
The first quotation comes from the blog of ITSM consultancy iCore. Andy Roberts observes that the emergence of DevOps is forcing the Change Manager to shift their role towards:
“…governance of the change processes, rather than the management of individual changes, evolving from a policeman to a supervisor”
This viewpoint is increasingly common, and it can seem reasonable at face value. The ITSM community is generally aware of its reputational issues. It knows that its own bulky Change Management processes have historically contributed to that problem. Roberts’s blog, in common with many similar commentaries, argues for a future version of Change Management in which DevOps teams must:
“…prove, to the Change Manager, that the processes that define their deployment pipeline can be certified to be categorised as being standard changes.”
I’ve written before on the cultural challenge undermining this viewpoint, but let’s set that aside, and move on the other quotation, which comes from a very different perspective: the world of cloud-native software development. Tony Hansmann is a platform architect with Pivotal. In an operations-focused episode of that company’s own “Conversations” podcast (timestamp 43:00) he argued the case for complete automation:
“There are no untestable things. You must first accept that if you’re dealing with computers you can always test an assertion. Once you come to accept that, then ITIL can be entirely subsumed. ITIL does not require a person to validate. If you have to have X, Y and Z processes, great. Just write the assertions for that process, or automate that process: You can keep your ITIL.”
Each of the statements here reflects the same goal, of course: Everyone wants to be able to implement stuff quickly, without breaking things. However, we can’t use this alone as an argument that Continuous Delivery can be aligned easily with existing ITSM norms. In fact, these two quotes highlight some stark differences.
To illustrate those differences, it is first important to understand that “Standard Change” has a very specific definition (well, of course it does) in ITIL. To be classed as “standard”, a change must be recurrent, well-known, and “proceduralized to follow a pre-defined, relatively risk-free path”. It must also be the “accepted response to a specific requirement or set of circumstances”. If these conditions are met, ITIL allows for advanced approval of changes. Hooray — no Change Advisory Board meeting!
ITSM, defined this way, puts the Change Manager in a position of overarching governance. But while Roberts describes centralisation: a single certificating authority, led by a human persona, with sign-off responsibility over each pipeline, Hansmann advocates decentralisation and automation. There is, he argues, no need for incidental validation by a human “Change Manager”.
Continuous Delivery is effectively stating that “Yes, we govern releases so that they don’t break things. But not your way”.
It’s enlightening to illustrate this difference with real world examples. Let’s look at a state-of-the-art delivery methodology: Google’s “Site Reliability Engineering”. The result of over a decade of evolution and innovation, SRE is described with great detail and clarity in this interview with Ben Traynor, a Google VP of Engineering. Here’s a summary of some of the key characteristics:
- If you don’t quite make it through the Google software engineering interview process, but you came close, Google might hire you instead as a Site Reliability Engineer.
- SREs are in an operations role, but they are developers. Google seeks to hire people who “are inherently both predisposed to, and have the ability to, substitute automation for human labor”.
- SRE teams are responsible for “availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning”.
- At leat 50% of an SRE’s time is expected to be dedicated to coding.
- Site Reliability Engineers act in a supporting role to software engineers, but also in an enforcing role.
To incentivise developers not to break things (and therefore create extra work for SREs), Google has developed a concept called “Error Budget”:
The business or the product must establish what the availability target is for the system. Once you’ve done that, one minus the availability target is what we call the error budget; if it’s 99.99% available, that means that it’s 0.01% unavailable. Now we are allowed to have .01% unavailability and this is a budget. We can spend it on anything we want, as long as we don’t overspend it.
If its error budget is not being eroded by bugs, the development team gains significant autonomy. They can take risks: if they want to release a new feature early, with a reduced test cycle, they can do so. If something breaks in the process, that’s not held against them, provided they stay within their error budget. However, once that budget is overdrawn, SREs withdraw their support of the development of that application. At that point, the developers can only action critical fixes, until their error budget is back in the black.
In contrast to “standard change”, the permission-to-proceed given to Google’s software engineers is not limited to “accepted responses to specific requirements”. SREs are the change managers, but even in that role they don’t care what the requirement is, or what the response to those requirements is. You are allowed to do anything you want, provided you have not broken too much already, and you don’t do so in future.
With no centralised certification of the deployment pipeline, each application development team is instead accountable to targets which are measured and enforced by its assigned SREs. SREs are even free to move from application to application (which is another incentive to build solid applications: if you make a good SRE’s job hard, they can move to another application).
Of course, the innovation pipeline itself is undergoing continuous development and innovation, because the SREs themselves are coders, spending more than half of their time innovating that pipeline. Hence, we don’t even have “a pre-defined path”.
The methodology, therefore, is clearly differentiated, on a number of grounds, even from a liberal reading of ITIL’s “standard change”.
I remain convinced that IT Service Management has a huge amount to offer a digitally transforming enterprise shifting towards agile methodologies, software innovation, and continuous delivery. I am very sceptical, as it happens, of Hansmann’s assertion that every validation of a change can be automated, because in practice this relies on being able to quantify not just every technical check, but also many pieces of tacit human knowledge which will be much more difficult to express as code. Real life isn’t always Turing Complete.
However, the fact that two processes have similar goals does not mean that they are the same process. As Google, and countless other organisations, have illustrated, continuous integration and delivery is evolving, rapidly, past the semantic constraints of ITIL’s “standard change”. Perhaps that’s a problem for ITIL.