Mutation testing

Tarik Kilic
5 min readJan 9, 2022

Unit tests are great. It’s probably fastest way of catching catastrophic consequences of the change made to the software, if not the pair coding. As codebases grow more and more complex (and we all know they unfortunately do), it’ll allow you to be slightly more confident to make fairly complex changes and move on. We all know this from very early on during our engineering tenure.

The moment you start writing unit tests, you have couple of options. Some of us are rigid followers of test driven development, which is pretty much mean you make your test fail first and then write the code to make it pass. For some of us, unit tests are an afterthought, probably to make that PR merge criteria pass. I’m not here to discuss which is better, whatever floats your boat as far as this article is concerned. Whichever school we are following when writing tests, we all probably stop when the code coverage is 100%. That’s sort of universal go-to metric of unit tests. It also sometimes serves as a good indication of a healthy codebase, or helpful to set a target line if you’re catching from behind on your code coverage.

There’s a caveat though. Not all 100% code coverage is the same. There are myriad of ways to reach that 100% for same piece of code. Hate to break it to you, some of those ways are way better than the others, given its purpose. I’m even going to take it one step further, not all 100% coverage guarantees that unit tests are meeting their purpose, catching that a change breaks unit logic before it makes it way to production. Shocking, right? Let’s look at an example of “100%” code coverage that it’s not useful for its purpose;

We have a simple piece of code here, as dumb as it gets for my lack of imagination and for simplicity of demonstrations. It takes some arguments and returns a boolean according to the logic.

Now let’s look at the following tests, written for someLogic method.

Above tests are all passing and are indeed resulting on a 100% coverage for SimpleController. There’s no problem with that. But there’s a fundamental issue here at play. Those unit tests stink, they’re really bad ones. Let’s showcase this by changing SimpleController logic slightly to following.

Now, as part of this change, our logic also accept count2 parameter to be zero. There’s a change in unit logic. How does our tests do? They all pass. They think this is fine. But when you think about this change, let’s call it mutation for sake of our broader topic, it can be the very first introduction of a sweeping bug in your codebase. We want your unit tests to be complaining about this mutation, so you can also alter them to this new logic. You want this mutation in code to break your tests, but it haven’t.

Unfortunately, not all tests are good tests and not all 100% coverages are good coverages. Sometimes they’re explicitly bad coverages that are blinding you away from problems, real big problems.

It is very obvious that these tests are bad ones and also it can be very obvious that change I’ve introduced is buggy, if this was a code change that’s going through a review process. But only because it is stupidly simple to see what’s going on. When you scale this issue to real world, where we are dealing with very complex contexts, it may not be that straightforward to see this change or those tests are stupid.

So now, as someone that wants to establish good engineering practices in a complex project, you wonder; how will I know our test coverage is one that establishes a sustainable codebase? There’s an answer to that. It’s called mutation tests.

Mutation tests are best kind of tests, because you don’t write them (as the best kind of code is the one not written). More or less, different libraries in different ecosystems are following this: they’re making one single targeted change on your covered codebase. Each of these changed version of covered code is called mutations. Then, they basically run your unit test suite against these mutations and see if they fail. Any mutation that survives, i.e. unit tests pass, practically means that your unit tests are not good ones. They’re not sturdy enough to catch breaking unit logic. I don’t know why you would need them if that’s not the case.

Theoretically speaking, I hope this gives the gist of what mutation testing is. We did talk the talk but let’s also walk the walk. In the following section, I’ll try and make above bad unit tests better using mutation testing.

For Java, I’ve used PIT, which seems to be integrating with Gradle quite easily as following.

After configuration, it’s as simple as running ./gradlew pitest , which prints out the following for our tests above.

====================================================================
- Mutators
====================================================================
> org.pitest.mutationtest.(...).ConditionalsBoundaryMutator
>> Generated 2 Killed 0 (0%)
(...)====================================================================
- Statistics
====================================================================>> Generated 7 mutations Killed 5 (71%)
>> Mutations with no coverage 0. Test strength 71%
>> Ran 11 tests (1.5 tests per mutation)

Also, we can see similar information regarding mutation test run from the test report generated.

If you follow the trail, it comes to this dashboard, which is very useful to get to the bottom of the issue with tests.

PIT documentation has a section dedicated to help developers makes sense of survived mutations, which in our case described as “changed conditional boundary”. This particular mutator mutates operator > with >= on line 14 of SimpleController. It’s the exact same mutation we applied above ourselves, which made none of the unit tests failed, in this context called a surviving mutation.

This just make sense. Looking at our tests, we’re not quite testing our code rigidly. Lets test line 14 a bit more thoroughly.

These new suite of unit tests are resulting with all mutations are being killed. They also satisfy the logic by explicitly testing all the cases that can occur around line 14. ./gradlew pitest now producing the following output.

>> Generated 8 mutations Killed 8 (100%)
>> Mutations with no coverage 0. Test strength 100%
>> Ran 13 tests (1.62 tests per mutation)

Before closing up, it’s also noteworthy to mention that you can configure mutations that are being applied to your code. There are some sets of pre-defined for mutations such as DEFAULTS, STRONGER and ALL.

Setting it from build.gradle as following

leads to more mutations than default set.

>> Generated 9 mutations Killed 9 (100%)
>> Mutations with no coverage 0. Test strength 100%
>> Ran 18 tests (2 tests per mutation)

Mutation tests personally completed a gap that I was thinking about constantly when I was brainstorming with teams on different ways of writing unit tests. I must admit that I’ve struggled about finding an objective criteria that is as effective as mutation testing. Hope it does the same for you as well!

--

--

Tarik Kilic

building teams, products and systems that scale. currently @SurveyMonkey, previously @Beerwulf.com