Finding Adequate Metrics for Outer, Inner, and Process Quality in Software Development
- There are different domains of quality and they are owned by different stakeholders. Outer quality is owned by the product people (e.g. product owners, testers), inner quality is owned by the developers, and process quality is owned by managers.
- Measuring quality may have severe side-effects: when being gamed, or when people over-focus on them, other important things may fall short. Performance measurements can also make people less creative and discourage collaboration.
- It is desirable but not always possible to find metrics with little or no side-effects. Sometimes it might be useful to add a metric, which counters the side-effects of the other.
- A good metric should always answer a question that helps to reach a clear goal, so it can be questioned and adjusted should other questions arise or goals change.
- Several commonly used quality metrics are not answering the most relevant questions and there are many lesser known metrics that are much more adequate.
PO: « Hey folks, we need to add this new feature as fast as possible! It will bring that revenue boost we need. »
Devs: « But we need time to refactor the code! »
Testers: « And to test it properly! »
PO: « Hm … how much time? »
Devs: « Uhm … 3 weeks? »
Testers: « Rather 4! »
« OK, I’ll give you 2. See you then. »
I had a lot of similar discussions in my career. I think they are a symptom of a deeper problem. Implementing a feature can be measured: it is there. Quality is much harder to measure. How much quality do we need anyway? What for? Which metrics truly tell us the quality?
Why Measure Quality?
There are very different motivations for measuring quality. Often people just skip this part of specifying the goal of the measurement as they assume they know it already. I think that’s a big problem as a different goal may lead to very different questions, leading to even more different metrics needed.
Goals I heard about included:
- I want to improve/maintain quality continuously
- I want to know where we stand
- I want to introduce gamification as a motivation for better quality
- I want to increase pride of work
- I want to detect problems (early)
- I want to control development teams as I don’t trust them
Note that the « I » here are very different people. They might be managers, product owners, developers, testers, architects, stakeholders, … and they are usually not talking about the same « quality ». To find fitting measurements, we need to be clear about what the goal is.
This article is mainly about the goal of being able to balance improving quality and adding new features.
Risks in Measuring Quality
There are two quite severe risks when attempting to measure quality.
First, there’s Goodhart’s Law, which is often quoted as « When a measure becomes a target, it ceases to be a good measure. » I’ve experienced this law a lot in my career. For example, when my team’s performance was judged by velocity, our estimates creeped up. We didn’t even want to cheat, it simply happened because we subconsciously didn’t want to justify dropping velocity to anyone outside of the team. Another time, we were asked to increase code coverage of a legacy code base to 85%. As a result, a lot of worthless tests were added to the test suite, making the code even less maintainable. In both examples the intentions were good, but we ended up with meaningless estimates and a rigid test suite that made changes unnecessarily hard.
When choosing quality metrics, we should be aware of Goodhart’s Law. Using a good metric as a target for the wrong group can very easily end badly.
A second risk is that the measurements have a negative impact on collaboration. Metrics on quality can easily be understood or used as performance metrics. Now, when my performance is measured, I will be more careful with my time and resources. Even when the performance of my team as a whole is measured, we will be less likely to help other teams. This will definitely harm the overall performance.
For the same reason, creativity may suffer. When I feel watched and judged, I avoid potential failure. But creative work –like software development– often requires experimentation, which – quite by definition – can fail.
Daniel Pink describes in « Drive » that extrinsic motivation – like performance measurements – will actually have a negative effect on people doing creative work.
So we need to be very careful not to choose metrics that will be perceived as performance metrics to avoid these negative impacts!
How to Find Good Metrics
To find good metrics there’s a very simple goal-oriented approach named GQM (short for goal, question, metric). You can find this basic idea in a lot of other frameworks.
The basic idea is to first explicitly state a goal, eg. we, as a development team, want to know if the quality of our code is good enough, so I can commit to new features.
This goal leads to a lot of questions, which then may lead to metrics providing (partial) answers.
The beauty in this approach is that we can safely replace a metric with another that answers the question in a better way. Or remove metrics when we realize that a question is no longer relevant to reach the goal. Or we can review our questions after the goal needs to be adjusted.
Domains of Software Quality
Inspired by Dan Ashby’s « 8 Perspectives of Quality », I came up with three basic domains of quality:
Outer quality: the quality users and – as a proxy – testers care most about.
This quality is quite obviously important for overall success. If the product is not attractive to users, it will probably not be successful at all.
Inner quality: this domain of quality is not directly perceivable by users, but very important to developers trying to maintain or change the product.
Inner quality is not necessary for current success. You can create a fantastic product that fulfills all desires of your users, but has terrible inner quality. However, eventually there will be new requirements, shifts in your business model, or changes in the market you want to react to. Poor inner quality might make you too slow to do that.
Process quality: another aspect is the quality of the process of creating, developing and maintaining the product. Process quality is especially important to managers and stakeholders.
There are a lot of factors influencing process quality: bad inner quality will have a huge impact on it, but also losing important members of the team, or increasing the overall workload. Success and failure of organizational restructurings should clearly show in process quality metrics.
These three domains are not distinct, but all have different owners that are mainly responsible and accountable. E.g. inner quality can be ruined by development teams eager to please demanding stakeholders who are concerned about outer quality. Ultimately, only that development team is fully aware of inner quality and the effects on the maintainability of the product. The stakeholders, on the other hand, might know that inner quality is ultimately important, but their first priority is to demand attractive features to be implemented. Hence, it needs to be the development team to object to these feature demands, whenever they see inner quality erode too much. The stakeholders should trust that judgment as ultimately outer quality will suffer, when inner quality gets so bad, the product becomes unmaintainable.
Outer quality might be the hardest aspect to measure. After all, this is ultimately about how people perceive the product, and the perception of people is heavily influenced by more than just the product itself. To get to the questions, I tried to take the perspective of a tester who tries to judge if the product should go to production, a user who is asked for their opinion, or a product manager trying to decide on further development.
How Defective is the Product?
One of the most common metrics especially for testers is the number of found defects. This makes a lot of sense, as a product with defects will probably not be regarded as having great quality.
However, simply taking the number of found defects during tests is quite easy to game: don’t test. But even if you do testing, it is quite impossible to tell if you’ve done enough of it. Even if you did a lot of testing, you might have spent a lot of time testing the wrong thing.
We can improve this by comparing the number of bugs found before production to the number of defects found in production. This rewards thorough testing, but depends a lot on how defects are found/reported in production.
I think the question answered by defect statistics is usually less about the product, and rather about the testing process. « How effective is my testing? » is a valid question, but it has a rather indirect impact on outer quality.
Another way to detect defects is the monitoring of service level objectives (SLO) as described in Site/Service Reliability Engineering (SRE). The basic idea is to think from the users/business perspective back to the software components. For example, we can simply count bad responses to the users. « Bad » can mean displaying an error message, but also slower than required. We can simply monitor and alert for these things. Either closer to the user to make the measure reflect the actual user experience, or closer to the backend to reduce possible root causes and minimize implementation effort. There’s a lot more to say about this technique and it certainly does not replace testing, but I think it is a very objective and sustainable source of information regarding the general reliability of the product.
While SLOs are trying to take the users’ perspective, there’s a risk of not measuring the right things or not in the right way. Just like in testing, we may have huge blind spots.
Think about SRE as a product within the product, or as an additional fixed requirement to the product. It requires constant reevaluation, adjustment and development. A lot of organizations will probably see these efforts as overhead, but as they bring us much closer to measuring the current true quality of the product, I think it is simply necessary to make good decisions. I think in a few years it will be hard for us to remember how we did software development without something like SRE.
Do Users Like the Product?
Quite an obvious criteria for outer quality is the question of if the users like the product.
If your product has customer support, you could simply count the number of complaints or contacts. Additionally, you can categorize these to gain more information. While this is in fact a lot of effort and far from trivial, it is a very direct measure and might yield a lot of valuable information on top.
One problem here is selection bias. We are only counting those who are getting in contact, ignoring those who are not annoyed enough to bother (yet).
Another similar problem is survivorship bias. We ignore those users who simply quit due to an error and never bother to get in contact.
Both biases may lead us to over-focus on issues of a complaining minority, while we should rather further improve things users actually like about the product.
Besides these issues, the complaint rate can also be gamed: simply make it really hard to contact customer support by hiding contact information or increase waiting time in the queue.
To get less biased feedback, we could also send feedback surveys to our users. This method at least avoids the focus on negative feedback, but still has some selection bias as not everybody will take the time to fill a survey.
With extending the feedback form to an annoying effort, asking questions in a confusing way, or setting defaults to a preferred answer (see default effect) there is a lot of potential for gaming here as well.
To lower the effort on the users’ side, we might rely on platform ratings like app stores. This might lower the selection bias, but the numbers are getting very ambiguous as it is really hard to tell why a user leaves a one or a five star rating.
Another obvious way of answering the question is to measure the number of new and/or returning users. These numbers can be deceptive, though. Marketing campaigns usually yield massive spikes, while actual improvements take time to be noticed by the users.
No metric I know of is answering the question of if the users like the product really well. Despite their pointed out weaknesses, the above might be the best options we have right now.
But only because it is hard shouldn’t be an excuse for us not to care at all. By combining some of these not perfect metrics and by keeping their shortcomings in mind, we might still be able to get a solid idea if our users like our product.
How Well is the Product Working for Users?
So the question of if users like the product is pretty hard to answer. It might be easier to figure out if users can work with it effectively.
To answer this question, we might facilitate user experience (UX) tests. We invite actual users to perform a number of tasks with the product while we observe carefully. This is a great way to answer the question! Of course there’s some gaming potential here. We can simply ruin it by giving too detailed instructions, choosing the wrong tasks, having a biased selection of participants, or influencing the users in some way to get the results we are looking for. So doing UX testing successfully takes some serious expertise, practice, and also some setup to guarantee proper lab conditions.
In my experience these tests are not happening very frequently due to the mere effort of facilitating them. Also, they yield extremely valuable insights, but only little countable metrics.
A more pragmatic way might be adding user tracking/observability. This is often done close to the user, e.g. in the frontend via sophisticated tracking libraries. These provide quite interesting insights, like heat maps of the most clicked/hovered items on a page. This is nice, but not necessary. Answering questions like « How much time does a user need to do X? », « How many users use X? », or « Which path does a user take to do X? » can also be answered via simple backend metrics.
A user tracking/observability system can simply be gamed by not implementing it. It is not trivial to implement it, provide the necessary infrastructure, or to check it for completeness. However, once in place, it can be used to get answers to very urgent unforeseen questions.
For inner quality there is a massive amount of tools out there and they are full of various metrics, but do they actually answer our most important questions? To get to the questions I tried to take the perspective of a developer who should take over further development of an existing product.
How Maintainable is the Product?
One of the first things I’d like to know in that situation is how big is it? Simply looking at the lines of code might seem a bit dumb, but actually it is not the worst idea. Potentially I’ll need to know about each of these lines, so their sheer number is pretty directly related to my ability to maintain it.
However, lines of code is a dangerous metric especially when developers actively optimize for it and produce very dense code that can be extremely hard to read.
More sophisticated complexity metrics like cyclomatic complexity are still easy to measure and reveal compacted complex code quite well.
Complexity will rise naturally when we add more features to a product, but we should constantly check if the added complexity is adequate for the added feature.
Another important driver for the maintainability of a product is its compliance to standard and good practices. One popular way to measure this is static code analysis. By analyzing the code, we can easily recognize a lot of bad patterns called code smells. For example, long methods and huge classes that don’t fit into the developer’s head can quickly become a problem. Strong coupling that makes it impossible to change a part of the code without making a corresponding change somewhere else can be even worse.
There are a lot more of these smells. In general, smelly code is a lot harder to work with. Hence, it is a very important aspect of inner quality. There are a lot of tools that automatically measure the smelliness, giving us a good indication of bad code, but they miss some very important aspects like misleading or inconsistent naming. So, bear in mind that 0 code smells found does not mean that the code is flawless.
How Well is it Protected against Unintended Changes?
Changing code is dangerous. By adding new features we can easily break an existing one. Unit tests are a very good practice to prevent that from happening, and the metric very closely connected to that is code coverage. It is simply the number of code lines executed during a unit test divided by the total number of code lines.
However, the fact that a line was executed during a test does not mean that its effects were checked. If you take a test suite with a code coverage of 82% and remove all the assertions from it, the coverage is still exactly the same!
The most valuable question code coverage answers is how many lines of code are not checked at all?
A much better tool to answer our actual question is mutation testing. It mutates the code of your program, e.g. it will replace a « + » with a « – » or sets some integer variable to 0 or some object to null. Then it executes all the tests that execute the mutated line. If any of these tests fail, the mutation is regarded as killed. If all tests succeed despite the mutation, it is regarded as surviving. The number of surviving mutations is a metric that answers our original question much better than code coverage alone.
The downside of mutation testing is that it requires many additional test executions and hence results in a longer feedback cycle. I’d therefore recommend to run mutation testing only in nightly builds and – if possible – only for parts of the code that actually changed since the last run.
How Confident is the Team with the Product?
A much disregarded fact about code is that there is hardly a comprehensive standard on what good code looks like. There are so many styles, patterns and concepts you can follow and hardly ever two developers agree on all of them. There are also many ways of organizing a code repository: how and what to document where and in what way.
One team’s greatest project can be just awful for the next, because they are used to a completely different style.
That’s one reason why handovers from one team to another are never as smooth as assumed. Often code is officially handed over, but once the new owners are required to make a simple change on their own, a lot of open questions occur and development is not nearly as fast as with the former team.
We can mitigate the problem with (code) style checkers and strict guidelines, but in my experience these fail to prevent the problem and can be quite obstructive for fluent development at the same time.
Instead of very imperfect automated metrics, simply asking the developers for their opinion on the project can be quite effective. Simple team survey questions like « How effective can you work with the code? » (1 = not effective at all – 4 = very effective), or « How much time would you need to deploy a simple change into production? » Can give us a very clear answer on how confident the team is with the product.
Combined with a question like « What would you need to improve the above answers? », we can simultaneously generate ideas to improve the situation.
I find the survey method very fitting for this question, as there are a lot of different reasons why a team is not confident with the code: pressured development in the past, handovered code, major requirement changes, loss of an important team member, … it will always be revealed in the survey.
When a group of managers approached me (the quality guy) and asked to validate the list of metrics they wanted to look into for each development team, most of these were metrics of inner quality. This left me with a bad feeling. I was pretty sure that this kind of monitoring would make the metrics a target and hence, according to Goodhart, would be ruined. Luckily they were reading Accelerate (review on InfoQ), and the basic metrics (also known as DORA metrics) described in that book seemed a very good solution to their problem. They wanted to prevent running into problems by over-pressuring development, understaffing teams, or other macro management decisions that are ultimately harmful to the process. Metrics of outer and inner quality may be used as indicators for the process quality as well, but they are often too detailed and we might fail to see the bigger picture.
Thinking backwards from the metrics in the book, we need three questions answered: « How fast is the process? », « How productive is the process? », and « How secure is the process? » Questions about the cost or efficiency of development might also be interesting, but are often directly related to these two.
How Fast is the Process?
The one metric I stumbled upon early in agile software development is velocity. It measures how many story points a team does per sprint or time frame. The story points are the result of an estimation per user story or feature done by the team. So the team gets a direct influence of the metric through the estimation process. As the estimate is otherwise completely arbitrary, this is a highly gameable metric. Even if the team does not really want to cheat, as soon as you start measuring a team’s performance by velocity, the estimates will always go up. This phenomenon is known as story point inflation.
Velocity does not really answer the question of how fast the process is. It answers the question of how good the team is at predicting its own performance. This is not entirely worthless to know, but if used as a performance metric will not work at all.
Another quite similar metric originates in the lean theory: lead time. It measures the time between the initiation and completion of a process in general. For practical reasons we can define the process to start at the first commit done by the development team and to end when the code was deployed to production. So basically we end up with velocity minus story points.
Not only working faster will shorten/improve our lead time, but also reduce the size of our deliverables. However, smaller deliverables give us earlier feedback and make us work more focused. These are good side-effects! But we could also reduce our testing efforts, which might lead to a decrease in quality in production. Ultimately, this bad quality will lead to incidents and bug reports, which will ruin our lead time again, but clearly lead time cannot be our only target.
How Productive is the Process?
A popular metric for productivity is capacity utilization. It is basically the ratio of actual output to potential output. There are several similar metrics.
This makes a lot of sense when monitoring machines in a factory. The capacity of a machine is pretty fixed. Even down times for repairs can generally be taken into account for this. However, when we are looking at the productivity of a software development process, lots of things are different. Most obviously, unplanned/unplannable work is much more likely, investing in engineers’ knowledge and skills will probably decrease utilization for a short period of time, but can result in a massive productivity boost later, and people are simply not machines. A drop in utilization can very simply mean that a good product team just waits for the results of their last change and has some time to tidy up and improve inner quality.
In my personal experience, when 100% utilization is the (implicit) goal, what you actually get is the opposite of productivity: either people pretend to be at 100% doing nonsense busy work, or capacity drops massively due to burnout. Probably both can happen at the same time.
Instead of utilization, we can keep an eye on batch size, which is basically the number of things in progress that have not yet been delivered to the customer. In software development these things are basically any changes that are not deployed to production. Hence, we can use a slightly easier to measure proxy: deployment frequency. Whenever we start to work on several things at once, the consequence is a longer period of time where we won’t deploy to production. Either we can only deploy once everything is done, or we still have open branches that need to be merged eventually.
When measuring batch size or deployment frequency, it doesn’t matter what gets deployed to production. It can be a big refactoring that will make the development of the next feature faster, it can be the implementation of diagnostic features that will make the product more observable and lead to much better decisions, or it can be a new feature.
The easiest way of gaming this metric would be to deploy a lot of tiny changes that don’t make much sense. Cutting actual work into smaller yet meaningful pieces and simply reducing work in progress are the much more likely side-effects, which will improve performance and productivity greatly.
How Secure is the Process?
So, in order to make the process fast, we can sacrifice certain checks. And maybe we should, but there’s probably a point where we are really fast in delivering defects instead of working software. So, we need to also care about the safety of the process. One quite simple way of measuring this is to count the number of failed changes and compare it to the total number of changes: the change fail rate.
The most obvious way of improving this number would be to add a lot of checks before actually applying the change. That is perfectly valid, but will likely increase batch size and lead time. If the checks are automated, only applied if really necessary, and kept short but productive, the change fail rate will be low without harming our performance metrics.
Trying to prevent failures is certainly a good thing, but some last risks are often quite persistent. Preventing these events can take a lot of effort and might even prove (economically) impossible. So, instead of preventing these failures at all costs, we might sometimes also invest in being really good at noticing and fixing them. The mean time to recovery is a good way of measuring just that. We simply start the clock when something went wrong and stop it when everything is working again.
This rewards investment in monitoring, modern deployment techniques, (again) small = low risk batch sizes, and other desirable things. The worst kind of gaming would be to hide failures, but in my experience this will eventually show in very solid metrics like revenue.
In my experience, all three kinds of quality are important.
Outer quality might be the most obvious one. It is hard to measure, though. We tend to measure what we can instead of what we should. We should be very conscious about how far away our quality metrics are from the actual interesting questions and ideally come up with new metrics that are at least closer.
To answer how well the product works for our users, we should implement some user tracking, ideally guided by results from UX tests. To detect defects and keep an eye on them, we should also adopt SRE and have a decent set of constantly monitored SLOs, but also pay close attention to what user complaints as we might have some loopholes in our monitoring. As for the question if our users like the product, there’s probably no simple metric and we should be aware of this gap.
In inner quality there’s really no shortage of metrics. Static code analysis gets us very far and it is easy to get lost in all these numbers. But we should not claim that any of these numbers represents inner quality as a whole. They are far too detailed and opinionated and leave the abilities of the development team completely out of sight. In my experience, measuring inner quality requires a constant review of the measurements, which can only be done by the development team itself. As a baseline, code complexity and code smells give us a quite good idea of how maintainable the code base is. Mutation testing gives us a very good idea about how well the code’s functionality is protected, and by adding an occasional team survey, we also keep track of the team’s overall confidence.
But the team’s work must never be judged by inner quality metrics as that will inevitably lead to undesirable side-effects due to Goodhart’s Law.
The quality of the process should be the one managers should monitor instead. The questions and metrics should ideally not require the development team to work in a certain way, as that way might very well not be ideal for all the teams or products. The DORA metrics are a very good set for this.
Keeping the team flexible and trusting in their judgment is maybe one of the most important factors of a truly great software development company.
In any measurement, I think we need to be aware of the difference between the question we want answered and the one the metrics actually answer. That knowledge allows us to eventually find a better metric, or find additional metrics to counter or at least mitigate the negative side-effects of the ones we have.