The backstory: why do we we have technical debt?
We really value good coding practices, so why do we have technical debt?? Well, believe it or not, Moonpig.com is now over 17 years old! Moonpig was founded before the importance of good technical practices was widely understood. The idea of a “definition of done” had not entered the mainstream, and it’s fair to say our founders had little understanding of building and maintaining websites. Engineering head, Dan Bachmann, has been here throughout the 17 year journey; he recalls the founders telling him his role was temporary.
“You do realize that in two years time we won’t need developers anymore. The web site will be completed by then and there won’t be any need to change it.”
This comment goes a long way to explaining why, 14 years later, we found ourselves drowning in technical debt. “Technical debt” is a broadly used term, so I’ll clarify exactly what I mean:
- Terrible architecture
- Monolithic systems
- Poorly written code – classes containing thousands of lines of code, for example
- Very limited test coverage at every level
- Limited, and unhelpful, logging
- Very limited monitoring
The list goes on, but that gives you a flavour. When I joined 3 years ago, our website consisted of two single architectural components – the web solution and the database. If you wanted to update the website, you’d need to deploy one of those two components. Imagine the size of two components built up over 17 years!
The case for change
I joined the company as a ScrumMaster, and arrived keen to make improvements. I’d not even completed my probation before I realised that agile management practices could only deliver limited improvements. If we wanted to improve speed and quality, we had to invest in our infrastructure and code base. But how to persuade the wider business this investment was required? Demonstrating or quantifying the cost of technical debt and legacy is notoriously difficult.
When I joined, the CTO was under huge pressure to improve the speed and quality of releases. He regularly asked me how we could improve, and my answer was always the same: we spent 14 years creating a mess, and we need to invest to improve it.
From anecdotes to hard data
After a few months, I decided to try and provide some hard data to prove the problem. By then I had some metrics available and a bit of analysis instantly showed why the business were unhappy. We were only spending about half our time on feature development! Fixing bugs and getting releases out consumed the rest of the time. No wonder it seemed like we never got anything done!
Translating the impact of our antiquated system in to hard data made a significant difference. This was no longer about grumbling developers, this proved we really had a problem. Crucially it provided the CTO with the foundation to make a business case for investing in our systems. As a result 20% of all development effort was dedicated to a technical backlog, prioritised by the engineering team. At the time we used Scrum and estimated in points, so 20% of points each sprint came from the technical backlog. It also lead to dedicated resource focused on getting us to continuous deployment (which I’ll write a separate post about).
Use it or Lose It
So now we had the time, how to use it effectively? The first year we took a very simple approach. The tech team leads met every two weeks and decided what tasks each team should focus on in the coming weeks. This worked well enough, but didn’t deliver dramatic improvements. We made tweaks here and there, but it was an adhoc approach that didn’t deliver massive impact anywhere.
The second year we decided we needed more focus. We chose two epic level projects and focused purely on those. That delivered better results.
Make it count
This year marks the third year, and we’ve dramatically optimised our process – inspired by our product development approach. Firstly, we asked everyone on the engineering team to nominate ideas for the tech backlog. What did they think our biggest problems were? What should be our priorities? Around 30 ideas were submitted, from which we identified 6 core themes. For each of these themes we were then able to identify clear goals, for e.g. 100% of logs can be traced to cause. At Moonpig we split the year in to thirds, or “Ts” as we call them. For each T we set specific OKRs (measurable goals) to focus our product development efforts. So why not do the same for our technical aspirations?
We asked the teams to choose which themes they wanted to focus on for the first T, and to set a goal to achieve within that T. For example, in the last T we had an OKR to reduce the “no-changes” build time in Visual Studio by 60%. Instead of trying to complete that goal with 20% of our effort, we simply made it another OKR for the team to achieve during that third. That removes the complication of measuring how much time we spend on the product and tech backlogs and trying to maintain the 80%/20% split. A team has product and technical OKRs to focus on each T and they have equal importance. Crucially the product owner owns the same OKR, so there is no temptation to prioritise the product backlog over the tech backlog.
The boy scout rule
The boy scout rule applied to your codebase is that you always leave it a little bit better than you found it. We apply this thinking at a more global level. When we’re optimising existing features, or writing new features, we look at how we can use these as opportunities to improve the codebase and architecture. Everyone, including the product team, know that this might mean it takes us a bit longer. However, everyone now recognises the importance of good code quality and architecture. Our product and business partners have experienced the problem of a poor quality codebase, and they have no wish to return to it. When hearing that something will take 3 weeks, they no longer ask if there’s a “quick hack” we can use instead.
We are still a long way from perfect. 14 years of mess won’t be cleaned up in 3 years, especially not when you’re trying to hit ambitious growth targets. But we are getting better all the time. We’ve gone from having two key components of our website to 20+ services which we can continuously deploy. We have decent automated test coverage so we no longer need hours of manual testing, and fewer bugs get through. We’ve gone from releasing once every 3 weeks, to an average of 4 releases a day. Our investment in architecture and the codebase hasn’t cost the business, it’s provided a huge benefit. These improvements have enabled us to become the fastest growing brand within the Photobox group.