Don’t overcategorise incidents!

I read a discussion today in an ITSM-focused group on Facebook in which the inital poster sought to define the difference between “network incidents” and “security incidents”. “I have been asked to explain the difference”, the poster stated, wondering whether a cybersecurity incident such as a ransomware should be considered as a subset of either network or IT incidents.

My immediate thought was “why?”.

Ransomware is a good example of something that transcends any specific pigeon-holing. Thinking of such events in “exclusive OR” terms (“is it an IT issue XOR a network issue…?”) seems far too arbitrary.

Events at this scale are unquestionably major incidents, impacting many aspects of not just infrastructure but also of the wider business. They require a significant and challenging multi-modal response. There is not likely to be a single root cause, nor a single event, nor a single step to resolution. We can immediately see this level of complication when we expand on the impact, response, and causal anaylsis of ransomware:

The impact of ransomeware is multi-faceted: There will be immediate loss of access to client devices, and hence multiple user and customer-facing services will be significantly impacted simply by the removal of access to the means of delivering them. There are countless examples of this. Significant areas of the network may need to be shut down. Unaffected services need to be segmented and secured. Data loss may be immediate.

The response is also complicated, requiring significant inputs from multiple stakeholders. Services may need to be prioritised for restoration. The impacted organisation may need to undertake a large scale rebuilding process for both client and backendinfrastructure. Data needs to be repaired, and lost data mitigated for. Security experts need to ensure that there will not be an immediate recurrence when services are restarted. Executives need to be involved in coordination and communications (and may need to take the difficult decision of whether or not to pay the ransom). Regulators and shareholders need to be informed. The customer service department may need to handle a flood of angry consumers.

Causal analysis must account for the fact that it is highly unlikely that one single failing allowed the attack to happen. The review needs to be wide-ranging, focusing on aspects such as system architecture and resilience, local device security, email security, USB device policies, user training, and the vulnerability of the organisation to specific threat actors.

There is some value in categorising incidents (or other represenative records of work), particularly for simple situations for which the resolution is a predetermined set of checks and actions. However, beyond a certain level of breadth or complexity, the desire to over-categorise can become meaninglessly semantic at best, and is likely to impact adversely the organisation’s flexibility and capability to respond.

Part of the problem here is probably the industrial and organisational heritage of our own industry. Relatively recently, many services were genuinely simpler than they are today. For a typical service interaction, the user probably picked up the phone or spoke face-to-face with an agent, who worked on a PC on the client-end of a simple client-server stack, using an application that probably ran in a server room in the corporate office, on a single server (probably named after a Lord of the Rings character).

Today, the same customer will engage in very different ways. The same service may now be delivered via a modern mobile app, and any given service interaction may engage a web of interconnected infrastrcture components, in-house and external, public and cloud. In a world that looks like this, it is often completely invalid to define an issue as “a server issue” or “an operations issue” or “a client issue”. From the technical viewpoint, it may be many of those things and more. From the customer’s viewpoint… well why would they care what kind of technical classification it is being forced into?

Complex systems fail in complex ways. If the organisation’s processes (or tools) are enforcing rigidity in categorisation and response, then this is likely to lead to significant difficulty when major issues arise in the complex socio-technical systems underpinning modern business. Standardisation works where it works, but complexity requires more open and broad thinking.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jon Stevens-Hall

Jon Stevens-Hall

The intersection of digital transformation, DevOps, and ITSM. Articles by a senior Product Manager in the enterprise service management space. Personal views.