RSS

Monthly Archives: February 2018

The Siren Call of Test Metrics

The Siren Call of Test Metrics

This article was originally published in Testing Circus.

Reporting on software testing using metrics can have unintended consequences. Which test metrics do we regularly produce for reporting, and what information will stakeholders interpret from those metrics? Let’s examine some real-world examples in detail.

 

95% Tests Pass, 1% Fail, 4% Untested

Test pass/fail metrics are popular in traditional test progress reports. Consider for a moment that you are the product owner and you have final responsibility for deciding whether to release a product. The test manager presents you with this graph:

Test Progress chart showing 95% pass, 1% fail and 4% untested
There’s certainly a lot of green in this graph, and green is usually good. So, does this graph make you feel more inclined to release the product?

When I read this graph I have more questions than answers. For example:
Failed tests
– Which tests have failed?
– What was the cause of failure and what’s the severity and status of the related bugs?

Untested tests
– Which tests have not yet been tested?
– What’s the risk of not doing those tests?

Passed tests
How well have the major risk areas been tested?
How were those major risk areas identified?

 Risks
– What else has not been tested, aside from whatever is represented by the graph, that the test team feels should have been tested? Examples include user scenarios, end-to-end tests, error-handling, performance, security testing and so on.
– What is the test team’s confidence level in product quality?

While the above graph looks more promising than another showing 50% pass and 50% untested, it’s still only a small part of the overall testing story. When viewed in isolation this graph is virtually meaningless, and can provide stakeholders with a false sense of confidence in product quality.

100% Requirements Coverage

This metric sounds reassuring, and looks impressive when presented as a pie chart.

Requirements coverage pie chart showing 100% green

What information is represented here? The data source for this metric is usually a requirements traceability matrix (RTM). A set of tests are compared to a business requirements document (BRD), and every requirement is ‘covered’ by at least one test. The RTM does not – and cannot – take implicit requirements into account, it is limited to explicitly stated requirements.

Some teams report 100% requirements coverage without ever having executed a test, as the RTM matches tests to requirements irrespective of whether those tests have been performed. A team on a traditional project who have painstakingly written tests for each requirement – without ever having seen working software – may report 100% requirements coverage based on the BRD.

During the development and testing cycle, features invariably start to differ from the original requirements. This may be due to implementation limitations, de-scoping of features, changes to the design, and so on. Requirements and written tests quickly become out-of-date. At that point, how useful is this metric?

On a hypothetical highly-organised and well-documented project, if the requirements and tests are constantly kept up-to-date and we perform all of the planned tests, then can we report on 100% requirements coverage? The RTM shows which requirements the testers consider to be covered by at least one test case. It doesn’t show the extent to which the requirement will be tested. As we see in this tutorial from Cem Kaner, 100% test coverage of the system is impossible to achieve in any realistic software testing scenario. We cannot test every input, every combination of inputs, every path through the code.

If you report on this metric do so with caution, and with multiple disclaimers included in your report. Again, this metric can provide a false sense of confidence to stakeholders.

12 Bugs Raised Today

The number of bugs raised per day metric can be enhanced by providing more details, such as a breakdown by severity and trends over time. The major risk in simply reporting this data to stakeholders in a graph is that it may be interpreted in any number of ways.

Graph showing bugs raised each day grouped by severity

Here are some real-world interpretations of this metric from different stakeholders:

  • Developers aren’t testing their own work enough, “Testers are finding too many bugs”
  • Testers should have found these bugs sooner, “Why was a Sev1 only found yesterday?”
  • Testers are working hard at the moment, “They’re raising lots of issues”
  • Testers aren’t working hard enough, “They could be raising at least 20 bugs every day”
  • Product quality is poor, “Look how many bugs were raised in the last two weeks”
  • Product quality is improving, “There’s less bugs now”

Some of these interpretations may coincidentally be true, but could just as easily be incorrect. If your project stakeholders are presented with a graph similar to this one, are they likely to ask for more information, or leap to their own conclusions?

If testers notice stakeholders equating high bug volume to increased testing effort, how might that affect their behaviour? For example, if a tester in this situation found three spelling errors on one page, they might decide to raise them all individually to increase the volume of bugs raised that day. Then there would be an increase in overhead to process those through the bug tracking system. This in turn could be seen as wasting other people’s time, and fracture relationships between the developers and testers.

0 x Sev1, 3 x Sev2 Bugs Outstanding

There are no showstopper Severity 1 bugs waiting to be fixed for this release, and only three critical Severity 2 bugs.

This metric is a great conversation starter, leading to the obvious question: What are the three Sev2 bugs and do they need to be addressed before the product can be released?

There are some less obvious questions to consider:

  • Has the severity of the bugs been reviewed by more than one person?
  • Are these bugs regressions from the current released version?
  • Have the high-risk areas of the product been tested?
  • What other bugs are outstanding?
  • Were any Sev2 bugs downgraded to Sev3 for the purpose of producing this metric?

Focusing only on the high severity bugs may mask high volumes of minor bugs. If there are a group of minor bugs relating to a particular feature or aspect of the product, they can amount to the equivalent of a critical issue. For example, a number of minor bugs in the back-end UI caused staff to spend longer on tasks which they performed frequently, slowing down overall productivity and causing frustration with the system.

Misuse of Metrics

For the most part, reporting on these metrics alone is the sign of an inexperienced, frustrated or unempowered test lead, and is done with good intentions. I know test leads who produce the metrics requested of them (usually by project managers) while feeling frustrated that the metrics don’t accurately reflect their opinion of product quality, because metrics only tell part of the whole story.

A real-world example of this is a project which had at least one test per written requirement, 95% of tests were passing, and there were no Sev1 bugs outstanding. Based on those metrics, the project manager was satisfied that the exit criteria had been met, and that no further time was needed for testing. As we’ve seen above, each of those metrics is flawed if taken in isolation. In this case, the test manager had an uphill battle explaining why further testing time was required to test the major risk areas more thoroughly, while all metrics graphs were essentially reporting a green light.

I have seen a case of metrics being produced which were deliberately misleading. On one project I joined, the previous test manager had asked the team to prepare two test cases per business requirement: one positive test case and one negative test case. Here’s an example of a written requirement from that project:

Staff access to the system is limited according to their role.

It’s entirely possible to write two statements which appear to cover this requirement:

  1. Verify that staff users can access the system functions they should be able to perform 
  2. Verify that staff users can’t access system functions which they shouldn’t be able to perform

While the above requirement is clearly lacking in detail, such as which functions the users should and shouldn’t be able to perform, it’s a real-world example. (The BRD had even been reviewed and signed-off by project stakeholders.) Rather than raising a red flag, the test manager had prepared test artefacts based on the limited information available at the time, and had traced those back to the requirements. This masked the fact that more information was needed before this requirement could be tested.

The test manager was reporting 100% requirements coverage before test execution had begun, based on the RTM. Having now read just one requirement and the associated coverage of that requirement, how much value would you place on that metric? What do you anticipate will happen when the test team attempt to verify whether this requirement has been met by the system?

Moral of the Story

Test metrics and graphs can enhance the testing story, but cannot replace it. Do not present or judge testing progress by test metrics alone, use qualitative information as well. Speak with the testing team regularly to learn the whole story.

These graphs are like the siren songs of Greek mythology. They are appealing, hollow, and will lead to a great deal of sadness and destruction if followed unquestioningly. For alternative reporting methods, see the Reporting on Test Progress section of my article in the Testing Trapeze 2015 August Edition.

 
4 Comments

Posted by on February 4, 2018 in Software Testing

 

Tags: ,

 
%d bloggers like this: