RSS

The Siren Call of Test Metrics

04 Feb
The Siren Call of Test Metrics

This article was originally published in Testing Circus.

Reporting on software testing using metrics can have unintended consequences. Which test metrics do we regularly produce for reporting, and what information will stakeholders interpret from those metrics? Let’s examine some real-world examples in detail.

 

95% Tests Pass, 1% Fail, 4% Untested

Test pass/fail metrics are popular in traditional test progress reports. Consider for a moment that you are the product owner and you have final responsibility for deciding whether to release a product. The test manager presents you with this graph:

Test Progress chart showing 95% pass, 1% fail and 4% untested
There’s certainly a lot of green in this graph, and green is usually good. So, does this graph make you feel more inclined to release the product?

When I read this graph I have more questions than answers. For example:
Failed tests
– Which tests have failed?
– What was the cause of failure and what’s the severity and status of the related bugs?

Untested tests
– Which tests have not yet been tested?
– What’s the risk of not doing those tests?

Passed tests
How well have the major risk areas been tested?
How were those major risk areas identified?

 Risks
– What else has not been tested, aside from whatever is represented by the graph, that the test team feels should have been tested? Examples include user scenarios, end-to-end tests, error-handling, performance, security testing and so on.
– What is the test team’s confidence level in product quality?

While the above graph looks more promising than another showing 50% pass and 50% untested, it’s still only a small part of the overall testing story. When viewed in isolation this graph is virtually meaningless, and can provide stakeholders with a false sense of confidence in product quality.

100% Requirements Coverage

This metric sounds reassuring, and looks impressive when presented as a pie chart.

Requirements coverage pie chart showing 100% green

What information is represented here? The data source for this metric is usually a requirements traceability matrix (RTM). A set of tests are compared to a business requirements document (BRD), and every requirement is ‘covered’ by at least one test. The RTM does not – and cannot – take implicit requirements into account, it is limited to explicitly stated requirements.

Some teams report 100% requirements coverage without ever having executed a test, as the RTM matches tests to requirements irrespective of whether those tests have been performed. A team on a traditional project who have painstakingly written tests for each requirement – without ever having seen working software – may report 100% requirements coverage based on the BRD.

During the development and testing cycle, features invariably start to differ from the original requirements. This may be due to implementation limitations, de-scoping of features, changes to the design, and so on. Requirements and written tests quickly become out-of-date. At that point, how useful is this metric?

On a hypothetical highly-organised and well-documented project, if the requirements and tests are constantly kept up-to-date and we perform all of the planned tests, then can we report on 100% requirements coverage? The RTM shows which requirements the testers consider to be covered by at least one test case. It doesn’t show the extent to which the requirement will be tested. As we see in this tutorial from Cem Kaner, 100% test coverage of the system is impossible to achieve in any realistic software testing scenario. We cannot test every input, every combination of inputs, every path through the code.

If you report on this metric do so with caution, and with multiple disclaimers included in your report. Again, this metric can provide a false sense of confidence to stakeholders.

12 Bugs Raised Today

The number of bugs raised per day metric can be enhanced by providing more details, such as a breakdown by severity and trends over time. The major risk in simply reporting this data to stakeholders in a graph is that it may be interpreted in any number of ways.

Graph showing bugs raised each day grouped by severity

Here are some real-world interpretations of this metric from different stakeholders:

  • Developers aren’t testing their own work enough, “Testers are finding too many bugs”
  • Testers should have found these bugs sooner, “Why was a Sev1 only found yesterday?”
  • Testers are working hard at the moment, “They’re raising lots of issues”
  • Testers aren’t working hard enough, “They could be raising at least 20 bugs every day”
  • Product quality is poor, “Look how many bugs were raised in the last two weeks”
  • Product quality is improving, “There’s less bugs now”

Some of these interpretations may coincidentally be true, but could just as easily be incorrect. If your project stakeholders are presented with a graph similar to this one, are they likely to ask for more information, or leap to their own conclusions?

If testers notice stakeholders equating high bug volume to increased testing effort, how might that affect their behaviour? For example, if a tester in this situation found three spelling errors on one page, they might decide to raise them all individually to increase the volume of bugs raised that day. Then there would be an increase in overhead to process those through the bug tracking system. This in turn could be seen as wasting other people’s time, and fracture relationships between the developers and testers.

0 x Sev1, 3 x Sev2 Bugs Outstanding

There are no showstopper Severity 1 bugs waiting to be fixed for this release, and only three critical Severity 2 bugs.

This metric is a great conversation starter, leading to the obvious question: What are the three Sev2 bugs and do they need to be addressed before the product can be released?

There are some less obvious questions to consider:

  • Has the severity of the bugs been reviewed by more than one person?
  • Are these bugs regressions from the current released version?
  • Have the high-risk areas of the product been tested?
  • What other bugs are outstanding?
  • Were any Sev2 bugs downgraded to Sev3 for the purpose of producing this metric?

Focusing only on the high severity bugs may mask high volumes of minor bugs. If there are a group of minor bugs relating to a particular feature or aspect of the product, they can amount to the equivalent of a critical issue. For example, a number of minor bugs in the back-end UI caused staff to spend longer on tasks which they performed frequently, slowing down overall productivity and causing frustration with the system.

Misuse of Metrics

For the most part, reporting on these metrics alone is the sign of an inexperienced, frustrated or unempowered test lead, and is done with good intentions. I know test leads who produce the metrics requested of them (usually by project managers) while feeling frustrated that the metrics don’t accurately reflect their opinion of product quality, because metrics only tell part of the whole story.

A real-world example of this is a project which had at least one test per written requirement, 95% of tests were passing, and there were no Sev1 bugs outstanding. Based on those metrics, the project manager was satisfied that the exit criteria had been met, and that no further time was needed for testing. As we’ve seen above, each of those metrics is flawed if taken in isolation. In this case, the test manager had an uphill battle explaining why further testing time was required to test the major risk areas more thoroughly, while all metrics graphs were essentially reporting a green light.

I have seen a case of metrics being produced which were deliberately misleading. On one project I joined, the previous test manager had asked the team to prepare two test cases per business requirement: one positive test case and one negative test case. Here’s an example of a written requirement from that project:

Staff access to the system is limited according to their role.

It’s entirely possible to write two statements which appear to cover this requirement:

  1. Verify that staff users can access the system functions they should be able to perform 
  2. Verify that staff users can’t access system functions which they shouldn’t be able to perform

While the above requirement is clearly lacking in detail, such as which functions the users should and shouldn’t be able to perform, it’s a real-world example. (The BRD had even been reviewed and signed-off by project stakeholders.) Rather than raising a red flag, the test manager had prepared test artefacts based on the limited information available at the time, and had traced those back to the requirements. This masked the fact that more information was needed before this requirement could be tested.

The test manager was reporting 100% requirements coverage before test execution had begun, based on the RTM. Having now read just one requirement and the associated coverage of that requirement, how much value would you place on that metric? What do you anticipate will happen when the test team attempt to verify whether this requirement has been met by the system?

Moral of the Story

Test metrics and graphs can enhance the testing story, but cannot replace it. Do not present or judge testing progress by test metrics alone, use qualitative information as well. Speak with the testing team regularly to learn the whole story.

These graphs are like the siren songs of Greek mythology. They are appealing, hollow, and will lead to a great deal of sadness and destruction if followed unquestioningly. For alternative reporting methods, see the Reporting on Test Progress section of my article in the Testing Trapeze 2015 August Edition.

 
9 Comments

Posted by on February 4, 2018 in Software Testing

 

Tags: ,

9 responses to “The Siren Call of Test Metrics

  1. Michael Bolton

    February 7, 2018 at 7:32 AM

    “One positive test case and one negative test case per requirement” means that, for some some chunk of text that someone wrote that referred to a requirement in some fashion, we should be doing one set of indeterminate things covering indeterminate conditions, once, on the premise that every explicitly specified precondition should be in place (even though a bunch of necessary aren’t explicitly specified); and one other set of interderminate things covering indeterminate conditions, once, focused on leaving at least one (but maybe more) explicitly specified precondition(s) unfulfilled.

    Wait, that’s a little long. Let me try again. “One positive test case and one negative test case per requirement” means “a shallow investigation of one potential failure and a set of possible surprises.” That’s a little better.

    Like

     
  2. Michael Bolton

    February 7, 2018 at 7:33 AM

    (This was a very good post, Kim.)

    —M.

    Liked by 1 person

     
  3. Oskar Kindbom

    February 7, 2018 at 1:25 PM

    Yes!
    For a while I was using exactly those metrics and I was made to fudge the results (increase pass/fail ratio by creating more passing test cases, reduce number of critical bugs by lumping them together) without being asked what the quality of the product was. I knew the metrics were lying, but I couldn’t figure out which metrics to use instead.
    It was a great relief when I realised that I don’t have to use metrics. I can write a report without numbers and tell the testing story instead. “This is what we tested, these are the bugs we found, here’s what we haven’t tested yet and at this point in time we are confident that the system is good enough to be released”.

    Like

     
  4. Catherine Karena (@cakarena)

    March 12, 2018 at 7:40 AM

    I like this post, it’s very good. I was thinking of a really good story was shared with me recently https://www.zentao.pm/blog/idealtesting-124.mhtml – that somehow adds another view to the picture, of those that reduce defects before they go into production. How does that impact metrics when it’s an increase in the lack of bugs? (No sleep last night, somehow that reads funny.)

    Like

     
  5. Anne-Marie

    March 12, 2018 at 8:15 AM

    Metrics are not evil per se. It’s when a metric becomes the Soprano – the one voice. Instead, we need a choir that’s constantly singing to provide any meaningful data. We look for anomalies in trends as opposed to one metric at one point in time.

    Like

     
  6. Ali Khalid

    May 1, 2018 at 1:50 AM

    I agree with all you’ve mentioned there, have been on the hunt for great metrics to use myself, but in vain.

    The problem I struggle with, we need to have some Quantitative measures, It can be at any stage of the SDLC, prepared by anyone but something to quantify a resultant of a testing effort. Could be the quality of the product, time to market impact by testing, code coverage, or a mixture of all of these in some way.

    The goal: To get paid in Dollars and cents, we need numbers speaking for themselves.

    Like

     

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

 
%d bloggers like this: