Never stop making Process Improvements

Have you ever been on a project where, because of the tight schedule or tight budget, the focus was only on delivering business stories? How did that work for you? Did you ever manage to pay all the technical debt incurred? What about the process debt?

I think that in most cases, this is a false economy. We’re gaining a small benefit now (maybe not event that), but we’re paying a much larger cost in the future. This is because, as Mike Rother says, a process that does not improve, degrades over time. For example, if you’re not continuously improving the feedback through the deployment pipeline, your 20 minute test suite will grow into an 1 hour test suite. If you’re not constantly fixing brittle tests, people will get used to ignoring them. This is not a people problem. It’s a system problem. It’s much harder to make people do something. It’s easier to put the required controls in the process. As an example, fail the build if the tests take longer than 30 minutes.

Why is this still a problem

What’s interesting is that this is not a new problem. But still, most of us still find ourselves in this position. Why? How come we still find ourselves regressing? How come the deadline never allows for any improvement time? How come we still don’t have enough slack?

One problem is that the cost of process degradation is hidden. So what if a test fails from time to time? It only takes 5 minutes to run again. So what if we don’t have automated tests? We only regression test the app every 2 sprints. But these process inefficiencies introduce a lot of waste (like wait time and rework).

Another problem is that the process degrades slowly. The regression test suite didn’t double overnight. It got there slowly, over the course of several months. We’re the frog that gets slowly boiled alive.

Yet another issue is that, many times, we optimize locally instead of globally. Yes, maybe one team delivered on a tight deadline, breaking all the rules and incurring a lot of technical debt on the way. People are celebrating the value that the team delivered. But what about the impact of the technical debt on the other 5 teams that now have to live with it? Is the organization aware of the real price that they will need to pay?

What to do about it

It feels like organizations don’t really learn from other people’s mistakes. Organizations make the same mistakes, regardless of all the books and case studies. So, many times, in order to learn, organization need to feel the pain. It’s like people are saying “It can’t happen to us. We’ll definitely keep the debt under control”. But then it does happen. So, what ca we, as software professionals, do about it?

Back-of-the-envelope cost-benefit analysis

The first thing that we can do is to make the cost clear. As I said previously, many times this cost is hidden. So, in order to improve, we need to inform the stakeholders about how much waste does each inefficiency cause. This doesn’t have to be a perfect calculation. Many of these things are hard to estimate. A quick, back-of-the-envelope calculation should suffice. Even if you spend more time in analyzing the cost, the result will probably not be much more accurate. Let’s see some examples in practice.

A slow test suite

Let’s say that the automated regression test suite takes 2 hours to run. The team thinks that they can cut this in half if they run more tests in parallel. They estimate that this will take roughly 10 days. Cost = 10 man days.

That was simple. What about the benefit?

Lead Time: If the regression step was the bottleneck, Lead time has now improved with 1 hour. This means that we can fix critical bugs 1 hour faster. We can also potentially ship business value 1 hour faster.

Downstream dependencies: If there are other steps that come right after the regression step, those steps will benefit from this improvement. Let’s say that after a successful regression test run, we can deploy to an UAT environment. We do this about 6 times per sprint. So now users will need to wait 1 hour less. Benefit = 6 hours/sprint (1 man day/sprint). We’re making an assumption that 1 man day has 6 hours of work.

Feedback for developers: Although, in theory, people start working on something else immediately after merging their code, in practice this is not that simple. If the definition of done for a user story states that the change should not cause any regressions, you’ll want to check that it doesn’t. So you’ll follow the build through the deployment pipeline and make sure that it doesn’t break anything. If it breaks something, you’ll need to do some context switching. You’ll drop whatever you’ve started and investigate the failing build. This is why, in practice, people aren’t 100% focused on the new piece of work until the old piece of work is done. So, each developer could potentially gain 1 hour of focused work per merge. If you have a team of 4 developers, and each developer merges his code at least daily, then you could gain 4 (developers) * 9 (working days in a sprint) * 1 (hours) = 36 hours/sprint (6 man days/sprint).

So, after this quick back-of-the-envelope calculation, the estimated Cost is 10 man days and the estimated Benefit is 7 man days/sprint. So, you’ll break even after about 1.5 sprints.

Brittle Tests

Let’s get back to our regression test suite. It now runs in half the time, so 1 hour. But it also contains a couple of brittle tests. These are tests that fail randomly, without any change to the code or the environment. This happens in about 10% of the test runs. A team member had a quick look and she thinks she can solve the problem in 7 days. Cost = 7 days.

What about the benefit?

Downstream dependencies: As we said in the previous example, we can deploy to UAT after a successful regression test run. This happens about 6 times per sprint. There’s a 10% chance that the regression step is red when we need to deploy. This means that subject matter experts lose, on average, 0.6 hours/sprint (0.1 man days/sprint) waiting to deploy to UAT because of a couple of brittle tests.

Feedback for developers: If the test fails, someone needs to drop what he was doing, run the entire test suite again and then hope it passes. With two context switches (one to trigger the step and one to check the result), we can estimate that the developer is losing 1 hour of productive work. The regression suite runs about 36 times per sprint and it fails 3.6 times (10% out of 36) because of the brittle tests. So this means 3.6 hours/sprint (0.6 man days/sprint).

So, the estimated Cost is 7 days and the estimated Benefit is 0.7 days/sprint. So we break even is about 10 sprints. This might not look like much, but there’s another systemic problem lurking around. By rerunning the tests, we are fixing the symptom (tests failing), not the problem (brittle tests). This, in time, will make the problem worse (more brittle tests). This means that the tests will fail more often. If the tests would fail 20% of the time instead of 10%, then we would break even in 5 sprints.

Position yourself as a Trusted Adviser

Software is eating the world. Many business are now IT businesses. The gap between business and IT is getting smaller. This is a great chance for us software developers to step up and position ourselves as trusted advisors. How can we do that? Here are a few simple ideas:

We should understand the business domain that we are working in. We need to talk with business people using the same ubiquitous language. If you’re working in the accounting domain, you should learn basic accounting. Understand the underlying business need and propose alternative solutions. Maybe even try to solve problems without writing code.
We should care about the business problem that we are solving. We should make suggestions during sprint reviews. We should delight our customers with small UX improvements. These might be easy to do, but will make the Product Owner trust the team even more (thanks Victor Rentea for the tip).

By doing these small things, we’ll gain the trust of the business. If everybody knows that everybody’s doing what’s best for the product, then you might gain time to improve the process.

Be careful what you measure

There’s an old saying: you get what you measure. This is another classic systemic problem: Seeking the wrong goal. If upper management looks at story points delivered, you’ll get more story points. But quality might start to drop. If a high code coverage threshold is imposed on the team, you’ll have high code coverage. But you might also have many useless tests. These metrics might be useful for the team. But when they’re used by someone outside of the team, they can produce unwanted results.

So what can you do instead? Focus on value. Don’t confuse effort with results. Find a metric for results (customer value), even if it’s way harder than just measuring effort (story points). Do a Value Stream Map to understand where’s your bottleneck. Eliminate the bottleneck.

Be a professional

We need to act as professionals. If a client tells a building architect that he doesn’t want a strong foundation for a 30 story building, would the architect listen? If you would only repair visible dents on your car and ignore the engine (since it’s not visible), do you think you would drive that car for long? It’s kind of the same in IT. We have a job because someone values our knowledge and skills. So we should act like professionals. There are lines that we should not cross. And I’m the first to admit that I’ve crossed those lines too many times. Now, I’m not saying that incurring technical debt is always bad. Debt can help you, if you keep it in check and get rid of it as soon as possible. But it needs to be kept in check.

Conclusion

Sometimes how we do the work is more important than the work itself. A process that does not improve, degrades over time. We’ve all seen this when implementing a simple change request takes months. Or when releasing a new version of the product takes too long. So identify waste, make it visible and eliminate it. Rely on data, rather than hunches. Become a trusted advisor to the business. Don’t confuse efforts with results. Be a professional.

Simple Oriented Architecture

Never stop making Process Improvements

Why is this still a problem

What to do about it

Back-of-the-envelope cost-benefit analysis

A slow test suite

Brittle Tests

Position yourself as a Trusted Adviser

Be careful what you measure

Be a professional

Conclusion

Related

Victor Chircu

Why is this still a problem

What to do about it

Back-of-the-envelope cost-benefit analysis

A slow test suite

Brittle Tests

Position yourself as a Trusted Adviser

Be careful what you measure

Be a professional

Conclusion

Related

Victor Chircu

Related Posts

Book Review: No Estimates

Why I don’t like estimating