Lessons from the TSB failure: a perfect storm of waterfall failures
It was interesting to read recently that TSB, the British bank which made headlines for the wrong reasons with a cataclysmic IT migration failure in 2018, has now effectively thrown in the towel and outsourced its entire IT banking systems operation to IBM.
In this article, I’m going to walk through some of the key points in the executive summary of the independent report, carried out by Slaughter and May, into the TSB migration failure which is likely to have prompted this outsourcing (IBM, incidentally, were parachuted into the bank after it struggled to recover).
The report, commissioned and published by TSB themselves, revealed a near-perfect storm of issues which led to the company going live with a new banking system long before it was ready for production, if indeed it was ever going to be. There are many lessons to be learned here, particularly about the fundamental difficulties of executing such a dramatic single change using linear, waterfall project methods.
Firstly, some important background: Lloyds TSB had been formed in 1995 with the merger of Lloyds Bank (one of the UK’s “big four” high street banks) with the Trustee Savings Bank (which was one of the bigger banks outside that top four). However, the 2008 financial crisis ultimately led to the UK government taking a major stake in the company. As this qualified as state aid, European law later forced the bank to sell a significant part of its assets. The name TSB was resurrected for a chunk of the Lloyds TSB business, which was acquired by a Spanish bank, Sabadell.
For some time afterwards, TSB rented space on the Lloyds bank IT platform. However, Sabadell was understandably keen for this awkward arrangement to end, for a number of reasons — not least the financial and opportunity costs it imposed on them. They initated a project to move TSB to the Sabadell banking platform, Proteo. Their tech organisation, SABIS, was given the task of implementing a UK instance of that platform, and handling the migration.
Things did not go well. In fact, they went spectacularly badly. Parliamentary enquiry badly. The migration failed, locking TSB customers out of their online accounts for an extensive period, and creating huge pressure on the bank’s customer channels. The company lost 80,000 customers and 330 million British pounds, and the CEO’s seven year tenure at the bank came to an abrupt end, perhaps not surprising given his admission, a week into the crisis as IBM were parachuted in, that the company was “on our knees”.
The Slaughter and May report is hugely detailed, but the executive summary alone reveals problem after problem:
Right from the start, the decision was made to do a “big bang” migration and there was a lack of consideration of other options:
2.18… TSB did not give sufficient consideration as to whether a largely single event migration was the right choice, what the risks of this approach would be, or how those risks would be mitigated. This choice was not substantively discussed by the TSB Board. In addition, it appears that neither the TSB Board nor the Executive requested or received any advice on this issue from their external advisers.
It seems like the piloting work that was done was inadequate and probably a waste of effort:
2.20… TSB sought to de-risk the Main Migration Event (the “MME”) through a number of Transition Events (Early Cutovers and Live Proving) that migrated or piloted parts of the functionality ahead of Go Live. While the Early Cutovers allowed some problems with the Proteo4UK Platform to be identified and resolved, the functionality put into live use constituted only a small part of the Proteo4UK Platform, and therefore did not significantly de-risk the MME. The Live Proving was not carried out at a sufficient scale to allow TSB to identify the problems that would arise when the Proteo4UK Platform was taken live to TSB’s entire customer base.
Both the bank and the supplier (SABIS) seemed to be keen to settle on the same target date. The date was defined without up-front understanding of *how* they’d hit it. It was just assumed that they could. This set the whole project up for trouble from an early stage:
2.23… At the time of its offer for TSB in March 2015, and without detailed knowledge of TSB’s requirements, Sabadell set an expectation that the Programme would be completed by the end of 2017. By as early as 1 July 2015 (the day after Sabadell’s offer for TSB was declared unconditional), Sabadell had identified Sunday 5 November 2017 as the intended date for Go Live.
2.24: Nine months of planning within TSB culminated in the March 2016 Integrated Master Plan, which envisaged the same Go Live date of Sunday 5 November 2017. This timetable was ambitious and unrealistic. The pattern of setting a desired end data and then creating a plan to fit that date, whether or not it was realistic or involved taking too much risk, was set for the remainder of the Programme.
The arbitrary date selection seems to have been based on assumptions driven by past projects undertaken by SABIS. (The platform was built in 2000 in part to handle acquisitions and migrations). But they didn’t account for TSB’s particular requirements:
2.25… Taking confidence from Sabadell’s track record, TSB accepted the timetable set by Sabadell. Sabadell (and SABIS) had previously integrated 12 separate banks onto its Proteo platform. TSB did not consider the extent to which this migration experience differed from what was required to meet the challenges presented by the Programme. There were some significant differences, which could have been discovered if TSB had conducted an appropriate due diligence exercise to understand SABIS’ capability to deliver the Proteo4UK Platform.
This was in part due to underestimating the need to customise for TSB. The board didn’t understand the extent of or SABIS’s capability to deliver. The report asserts they’d have behaved differently if they’d known (although I can’t see much basis for that conclusion!):
2.52… The TSB Board understood that the Programme was a significant undertaking. However, there were gaps in the TSB Board’s understanding of the Proteo4UK Platform’s scope and complexity, particularly regarding the extent of the new software being developed to support TSB’s digital, telephone and branch channels.
2.53… In addition, the TSB Board did not understand the extent to which SABIS’ experience was different from what was required to design, build, test and operate the Proteo4UK Platform.
Whatever protections were put in place didn’t prevent the very simple issue of trying to migrate before things were ready:
2.26… TSB also believed it had various protections in place (including retention of the Carve-out Exit Option, being insulated from cost overruns, and the Assurance Matrix) to ensure that it would not Go Live before it was ready. In the event, these protections did not safeguard TSB from migrating to the Proteo4UK Platform before being ready.
Functional testing suffered significant overrun, leading to unplanned parallel functional and non-functional testing. This may have been complicated by several factors: firstly, the handover of functional tests from SABIS to TSB, but also the narrow time window between the handover and the intended end of functional testing, seven months before the arbitrarily determined life date of November 2017:
2.28… Functional Testing was used to confirm that the Proteo4UK Platform’s functionality worked as intended… This was originally SABIS’ responsibility, but TSB took over both test case design and execution in September 2016. SABIS remained in charge of fixing any defects that were detected by Functional Testing.
2.29… Functional Testing took 17 months, which was much longer than anticipated. Delays were mainly due to the volume of defects found and the pace at which SABIS could fix them. In addition, SABIS continually missed deadlines for the delivery of functionality into Functional Testing. As a result, significant amounts of functionality started Functional Testing late.
2.30… TSB had planed for a seven month period prior to Go Live, following the completion of Functional Testing, when Live Proving and Non-functional Testing would be carried out… However the delays in Functional Testing meant that it ran in parallel with Non-functional Testing, and did not finish until shortly before Go Live.
Around September or October 2017, a delay was finally agreed from the original November 2017 target date, but replanning wasn’t comprehensive. No lessons learned were carried into the new plan:
2.32… The Replan presented an opportunity for TSB to take a realistic view of the status of the Programme and to learn from the Programme’s first 18 months. This opportunity was missed. The Replan was not a comprehensive, ‘bottom-up’ process and there was little attempt to investigate the technical causes of the delays that had been faced to date. In particular, TSP failed adequately to assess SABIS’ capabilities and its ability to deliver to the new plan.
Internal audit of the project seems to have badly failed. Project governance teams drew some directly implied criticism in the report:
2.34… Risk Oversight (which was responsible for providing an independent opinion on the risks associated with the Programme) and Internal Audit (which was responsible for providing an independent and objective assurance of the risk management activities of the Programme Team) broadly concluded that the assumptions underlying the Defender Plan were “reasonable” and “satisfactory overall”. We find this surprising in light of the flaws that we have identified with the Defender Plan.
However, these governance failures were not limited to the hands-on audit and governance teams. The board also failed to heed the warnings that were very obvious that the project was overstretched, and the platform wasn’t ready:
2.35… While the TSB Board asked a number of pertinent questions regarding the Defender Plan, there were certain additional, common sense challenges that the TSB Board did not put to the executive (including why it was reasonable to expect that TSB would be “migration ready” only four months later than originally planned, when certain workstreams were as much as seven months behind schedule).
These oversights enabled further overcommitments to be made, again communicating arbitrary decisions on timescales without due consideration of whether those milestones were actually achievable:
2.36… TSB also imposed an unnecessary constraint on the time available to complete to Programme by choosing to announce on Friday 29 September 2017 that Go Live would be replanned “into Q1 2018”. TSB made this announcement nine days after formally commencing the replanning process, and therefore without any proper assessment of the volume of work remaining or when the Proteo4UK Platform was likely to be ready.
All of this was compounded by miscommunication by the bank about the real state of readiness. They overstated the readiness of the deliverable platform, and hence painted themselves into a corner when it came to further communication, months later, about the real state of the initiative:
2.37… In addition, TSB was not transparent about the real reasons behind the decision to delay Go Live (namely that the Programme was several months behind schedule) and instead suggested that the Proteo4UK Platform was almost ready. Having suggested, incorrectly, in September 2017 that the Platform would be ready by the end of November 2017, this would have made it more difficult to announce that the Platform was not in fact ready in the lead up to the new Go Live date of April 2018.
Of course, these compressed timelines and public overcommitments had a significant operational impact. The predictable and damaging effect was that functional testing was shortened, over-parallelised, and incomplete. Work was deferred, scope reduced, and a planned regression testing phase was even dropped altogether:
2.39… The intention had been to deliver and test the core functionality of the Proteo4UK Platform by the end of 2017, to avoid parallelisation in the Programme and to allow three months for additional assurance testing and proving. The reality was very difficult. Functional Testing continued into April 2018, and was only brought to a close by deferring and de-scoping significant elements of functionality for completion after Go Live. The dedicated regression testing phase included in the Defender Plan did not take place.
As a result, non functional testing such as performance testing ended up being rushed:
2.40… The Programme had run out of time and, as a result, the majority of Non-functional Testing (which was needed to confirm that the Proteo4UK Platform as a whole could operate at the service levels expected) was conducted in a highly compressed period shortly before Go Live.
A lack of communication and investigation of these problems perhaps spoke to a culture of obfuscation, or of people being afraid to speak up about what was apparent to them:
2.41… Certain areas (in particular Non-functional Testing) received disproportionately less reporting than others, and updates on the Replan risks often lacked any commentary explaining the significance of the facts being reported. Consequently, as the programme strayed further from principles, assumptions, dependencies and milestones set out in the Defender Plan, TSB did not sufficiently interrogate those deviations.
The quality of the non-functional testing (particularly performance) was badly inadequate, perhaps primarily due to the timescale compression. At some stage, it was decided that performance testing would only focus on one of the two data centres, and as a result, it was not discovered that there was a significant problem with their configuration. This inadequacy was cited in the report as being directly contributory to the issues experienced by customers:
2.43… There were issues with the configuration of the two data centres, which contributed significantly to the problems experienced by TSB digital customers immediately after Go Live. As the decision was taken to conduct Performance Testing on a single data centre, it was impossible to identify these issues before Go Live.
The second part of this bullet point contains a truly jaw-dropping revelation: When performance tests failed, the targets were downgraded, which allowed the tests to pass… at a level lower than the actual production load at go-live.
2.43 (cont’d)… During execution of the Performance Testing for Internet Banking and the Mobile App, test targets were lowered after tests did not pass at the original target load. The summary of Non-functional Testing results did not make it clear that the targets had been changed, and that the actual volumes following Go Live exceeded the lowered test targets.
The board missed key chances prior to go-live to reflect on the bad situation:
2.44… In the lead up to Go Live, the TSB Board should have done more to assess, and should have provided as stronger challenge to, the Executive’s explanation of the adequacy of testing…. There were clear indications, which the TSB Board should have interrogated, of the frantic pace at which key Programme activities were being finalised so close to the proposed date for Go Live.
However, the board was probably not helped by the highly inaccurate reports it apparently received, even on basic measures such as defect count:
2.45… The TSB Board was not provided with an accurate view of the defects outstanding in the Platform at the point of Go Live; the actual number of defects was at least two and a half times the “around 800 defects” reported to the TSB board.
Note: I was curious about that defect count, and drilled into the main body of the report. The internal communication was apparently obfuscating and misleading. It implied that most of a total 85,000 defects had been fixed, and hence only 1% of defects were being carried forward:
18.21… “of the c.85,000 defects raised during the programme we will be taking around 800 defects into live”
However, the quoted number could easily be misleading, and the 85,000 and 800 numbers were not equivalent. In truth, there were 800 separate pieces of functionality identified as having one or more defects:
18.23… Accordingly, the number of Defects that were actually present in the Platform at Go Live was 3,317, a significantly greater number than that reported to the TSB Board in the T3 Memo on 18 April… TSB did not at the time of Go Live, and did not during the course of our Review, have a comprehensive view of the Defects raised during the Programme and remaining open.
SABIS’s lack of readiness to support the platform was a key issue identified in the report. They had already struggled through smaller preliminary go-lives:
2.46… SABIS had been operating some limited live services prior to Go Live. It had struggled for a period after each Transition Event put new services live, failing to satisfy agreed service levels.
2.47… A report by Deloitte Spain (commissioned by SABIS) which identified deficiencies in SABIS’ internal controls was not shared with TSB.
Furthermore, after Go Live, two different reports, published after the disastrous events, concluded that SABIS had not had sufficient support and remediation capacity in place:
2.49… It is clear from statements made, and remediation work completed, after Go Live that SABIS had not been ready to operate the Proteo4UK Platform. For example:
(A) a September 2018 report prepared by the Sabadell Group COO stated that SABIS “did not have sufficient capability to respond and to resolve incidents” at Go Live; and
(B) a November 2018 report prepared by the TSB CIO stated that SABIS’s “insufficient ability to operate the new IT platform” had exacerbated TSB’s problems after Go Live.
Get those dropping jaws ready again. Prior to go-live, it seems TSB had simply relied on a written assurance by SABIS’s country MD that they were ready. They didn’t receive any evidence to back this up but instead proceeded apparently on the basis of this letter:
2.50… In addition, instead of a formal, evidenced attestation from either SABIS or Sabadell in the lead up to Go Live, TSB received only a letter from the SABIS UK Managing Director. …The statements in that letter plainly relate to (his) expectations regarding the performance of the Proteo4UK Platform, rather than his expectations regarding the ability of SABIS to operate it.
And here’s that letter:
There were major warning signs but the supplier appeared to get a soft ride from both TSB and auditors. For example, 15.13 notes a report from KPMG Spain just two months before the major failure. A lack of capacity management controls was brushed off as minor:
15.13… (KPMG) was “unable to evaluate controls related to capacity management” as “there were no controls operating” at the time of its review. TSB has pointed out that this finding was rated “Low”… in TSB’s view this was “not a material risk”.
The bank’s own due diligence was strongly criticised too: The report states that the bank didn’t perform adequate upfront due diligence, nor exercise its own contractual rights to audit the work being carried out by SABIS.
2.56… TSB did not conduct a comprehensive due diligence exercise to understand SABIS’s capability to deliver the Proteo4UK Platform. In addition, TSB did not adequately exercise its contractual audit rights during the Programme in order to understand:
(A) the quality and completeness of the platform being designed, built and tested by SABIS; and
(B) SABIS’ capacity to operate that platform and to deliver the service required by TSB and its customers.
The report suggests the bank was over-integrated with the supplier… directing operations more than a typical “customer” of such a relationship. I’m not sure about this one — was the relationship too close or too distant? The report seems to be suggesting both:
2.57… The TSB CIO’s role meant that he was the customer of SABIS’ services to TSB, but also (having previously been the Chief Process and Information Officer at Sabadell and in effect run SABIS) continued in practice to direct much of SABIS’ work as supplier. In our view, this made it difficult to manage SABIS on an arm’s length basis. Without an appropriate arm’s length relationship, there was insufficient clarity between the parties as to who was assuming the risk in certain key decisions (including, for example, the decision to conduct Performance Testing on only one data centre).
Although this was a project of unprecedented complexity for both TSB and in the context of UK banking as a whole, the bank didn’t seek sufficient independent advice to validate the capability and work of its supplier and its own executives:
2.62… the TSB Board did not take independent advice on the Programme as a whole, nor appoint advisers with appropriate mandate. This was required to provide an objective and authoritative view on the progress of the Programme and to assist the TSB Board in appropriately challenging the Executive.
Finally, the report asserts that independent advice might have led to different decisions on supplier, platform, approach, and also would have prevented the project carrying on regardless when it was failing:
2.64… Had the TSB Board obtained independent advice on the evidence available to it to support the decision to Go Live, it might have identified, for example, the weaknesses in the Non-functional Testing evidence and in the assurances of SABIS’ readiness, and therefore that TSB was not ready to Go Live on the weekend of 20–22 April 2018.
If nothing else, this is a fine piece of evidence for the “there is no single root cause” camp.
My feelings on this: it seems like a classic failure of a big waterfall project. Maybe circumstance forced the decision to structure it that way, but it’s a good example of why not to do it. The lessons learned here can only reinforce the argument for newer, more agile ways of working.
Finally, let’s bear a thought for the individuals involved at ground level in this crisis. One can barely imagine the sum of the individual human cost along the way, in terms of horrible stresses and workloads. This, alone, should be a good reason never to make the same mistakes again.