Paul Holland‘s interstitial Lightning Talk at CAST 2011 was a combination of gripe session, comic relief, and metrics wisdom. The audience in the Emerging Topics track proffered various metrics from their own testing careers for the assembled testers to informally evaluate.
Although I attended CAST remotely via the UStream link, I live-tweeted the Emerging Topics track sessions and was able to contribute my own metric for inclusion in the following list, thanks to the person monitoring Twitter for @AST_News:
number of bugs estimated to be found next week
ratio of bugs in production vs. number of releases
number test cases onshore vs. offshore
percent of automated test cases
number of defects not linked to a test case
total number of test cases per feature
number of bug reports per tester
code coverage
path coverage
requirements coverage
time to reproduce bugs found in the field
number of people testing
equipment usage
percentage of pass/fail tests
number of open bugs
amount of money spent
number of test steps
number of hours testing
number of test cases executed
number of bugs found
number of important bugs
number of bugs found in the field
number of showstoppers
critical bugs per tester as proportion of time spent testing
“Counting test cases is stupid … in every context I have come across” – Paul Holland
Paul mentioned that per tester or per feature metrics create animosity among testers on the same team or within the same organization. When confronted with a metric, I ask myself, “What would I do to optimize this measure?” If the metric motivates behavior that is counter-productive (e.g. intrateam competition) or misleading (i.e. measuring something irrelevant), then that metric has no value because it does not contribute to the goal of user value. Bad metrics lead to people in positions of power saying, “That’s not the behavior I was looking for!” To be valid, a metric must improve the way you test.
In one salient example, exceeding the number of showstopper bugs permitted in a release invokes stopping or exit criteria, halting the release process. Often, this number is an arbitrary selection that was made long before, perhaps by someone who may no longer be on staff, as Paul pointed out, and yet it prevents the greater goal of shipping the product. Would one critical bug above the limit warrant arresting a rollout months in the making?
Paul’s argument against these metrics resonated with my own experience and with the insight I gathered from attending Pat O’Toole’s Metrics that Motivate Behavior! [pdf] webinar back in June of this year:
“A good measurement system is not just a set of fancy tools that generate spiffy charts and reports. It should motivate a way of thinking and, more importantly, a way of behaving. It is also the basis of predicting and heightening the probability of achieving desired results, often by first predicting undesirable results thereby motivating actions to change predicted outcomes.”
Pat’s example of a metric that had no historical value and that instead focused completely on behavior modification introduced me to a different way of thinking about measurement. Do we care about the historical performance of a metric or do we care more about the behavior that metric motivates?
Another point of departure from today’s discussion is Pat’s prioritizing behavior over thinking. I think the context-driven people who spoke in the keynotes and in the Emerging Topics sessions would take issue with that.
Whoever spares the rod hates the child, / but whoever loves will apply discipline. – Proverbs 13:24, New American Bible, Revised Edition (NABRE)
My experience with metrics tells me that numbers accumulated over time are not necessarily evaluated at a high level but are more likely as the basis for judgment of individual performance, becoming a rod of discipline rather than the protective rod of a shepherd defending his flock.
Paul did offer some suggestions for bringing metrics back to their productive role:
- valid coverage metric that is not counting test cases
- number of bugs found/open
- expected coverage = progress vs. coverage
He also reinforced the perspective that the metric “100% of test cases that should be automated are automated” is acceptable as long as the overall percentage automated is low.
Metrics have recently become a particular interest of mine, but I have so much to learn about testing software that I do not expect to specialize in this topic. I welcome any suggestions for sources on the topic of helpful metrics in software testing.
John Stevenson said:
Hi Claire
I like the approach you have used in this blog post and I also get conflicting advice and messages on how we report testing.
One idea I am toying with is the following:
http://taooftest.wordpress.com/2011/04/12/quantitative-vs-qualitative-reporting-part-2/
I have not as yet managed to include within this dashboard expected vs actual coverage but that is something I am still working on.
I hope the link proves to be useful
claire said:
Thanks, John. That article’s system was something James Bach mentioned during his CAST 2011 keynote on New Cool Things. Nolan McAfee actually linked to one of James’ docs in that article’s comments.
Michael Bolton had some suggestions around qualitative resources as well:
Jerry Weinberg: Quality Software Management Vol. 2, First Order Measurement;
How To Observe Software Systems (eBook)
Experimental and Quasi-Experimental Design for Generalized Causal Inference [pdf]
http://www.socialresearchmethods.net/kb/qual.php
Nolan MacAfee (@nmacafee) said:
Great blog post. I’m sharing this around as I often find we get bogged down under metrics that in the end provide little value. Let’s measure what is important that adds value to our software.
claire said:
Not only are bad metrics a waste of time, but they steer the whole project off course. Keep fighting the good fight, Nolan!