Service Mangement still matters in a Microservices architecture
In July 2022, I saw an interesting and lively Twitter discussion in which a number of people debated how architecture impacted the burden of being on-call. Is the alert response workload heavier when working in a microservices-based environment, or a monolithically architectured one?
“The first item on the list was to consolidate the now over 140 services into a single service. The overhead from managing all of these services was a huge tax on our team. We were literally losing sleep over it since it was common for the on-call engineer to get paged to deal with load spikes” — Goodbye Microservices
Orosz argued that this problem leads to consolidation of a microservices architecture, first into “tiers” of microservice importance, and hence into a broader, consolidated service structure.
“it is very easy to re-draw ownership boundaries in monoliths, hence re-orgs are frequenst. Teams don’t have knowledge to maintain buggy code they have just inherited. Severe incidents and awful oncalls as a result”
Maybe, then, the architecture is not the fundamental problem here? Reading the Segment blog’s description of engineers being paged as a result of load spikes took me back to the late 1990s. Back then, I was involved in connecting a newly acquired monitoring tool to our home-built Service Management application. It was assumed initially was that we should convert detected events directly into ITSM incidents, ensuring that there was one single place-of-record for both manually raised and automatically detected incidents.
You may already have predicted the result. Many simple issues generated multiple events. Significant issues were being drowned out by a torrent of low-grade incidents. We learned — quickly — that asking humans to providfe service response based on raw events is a bad idea. And yet here we are in 2022, effectively seeing the same problem.
The problem isn’t architecture. Some architectures are good, some are bad, but there will always be arguments in favour of and against any particular approach. The problem being described actually results from a disconnect between the technical components assembled to provide an outcome, and the outcome itself. We can’t escape fundamental principles of Service Management, and we can’t forget that most of our users really don’t care what the backend architecture looks like.
Regardless of our architectural approach, we have to understand the user’s perspective as well as the technical one. This is hard, but necessary if we wish to understand the real priority of the issues that are being detected in the technical layer.