Trap: Pass/Fail statistic as quality maturity metric
Submitted by Ainars Galvans on Thu, 24/08/2006 - 07:51.
functional testing | general software testing
I’ve seen a lot of posts about using Pass/Fail statistics as indicator of quality progress or “how far we are from shippable quality”. While I could somehow agree (not happy with it however) that it is OK to use “% failed” statistics as UAT (user acceptance testing) exit criteria (like if at least 95% pass, the contact or it’s phase is fulfilled).
It’s amazing however how people haste to conclude that during development life-cycle this statistic is useful quality measure. Even if we know the number of test cases from the beginning and they are not going to change over time (which also seems impossible to me), there are a lot of issues with this assumption. Let me first summarize the main problems and describe them in details below:
1)Fail % either overall or grouped in any way does not (in general) map with development effort required to fix the detected issues.
2)Defect severity/business value and decision on defect fixing (or cancel/postpone/document as know bug) does not directly change Test Case Pass/Fail status (not until we update test case and re-run it)
3)If there is more than one defect with a single test case, it is still the same failed.
Context
I’m not talking about unit tests done by developers. Just as UAT unit tests aim to prove that the planned work is complete, instead of hunting bugs. For unit tests especially in TDD it could be very true that the fail rate indicate product maturity. More over those unit tests fails in very rare occasions, especially if continuous integration patterns are followed and any regression type failure is fixed immediately – maybe even before checking code in. So it those cases our maturity almost always stays 100% good, just as it is argued by those continuous integration patters – product is shippable all the time.
However once you will have a professional tester assigned to such team or once you allow users to play with the product you will discover a lot of issues there. You may argue that those are not bugs, but either feature requirements, but how about customer who say – we are not going to use that product until you fix the following list of issues, even if you call this implementing more features. And what if they have no more money for implementing those “features”. I believe this is reason why we are unable to find a lot of customers that agree for time & materials contracts: those agile contracts where we implement only as much as we could make with time (=money) they are willing to pay instead of fixed scope and money contracts.
Fail % does not map with development effort
The issue I see is following: there are both “horizontal” and “vertical” features. There are also both “horizontal” and “vertical” bugs. Single incomplete feature or bug (which is not critical, like application does not start) may have any of the following impacts on the test case statistics (no matter how significant this bug or feature is in terms of development effort required):
- Only one/few out of any number of test cases fail
- All/most of test cases with a group/module of test cases fail
- One/few tests cases within all/most of test case groups fail
- Few test cases each if some test group/module fail
Defect analyze results affect maturity status
I’ve seen different test case pass/fail criteria and different feature pass/fail criteria. Some say if there is any defect, it is a fail. So what defect is following: according to specification this label should has color RGB(60,104,168) but it is actually RGB(59,88,169). Or the placement of a label is one pixel wrong. Do you really want to say this is a good reason to fail UAT? But even if you are able to define flexible pass/fail criteria, how about the following cases: a) Defect reasons was wrong configuration, or something, so it get canceled b) We found a workaround and agreed with customer that it is OK to use workaround, so this is no more a defect
OK, you may want to say that in both cases we should update specification to reflect that this deviation is allowed, to update test case run-log status, etc. Well, we are not NASA you know, our management thinks if we have this great defect tracking system that reflects the decision, it is enough.
Multiple defects within one failed test case
Our test cases typically consist of several execution steps. Well you could define that if you have to test edit operation within search results, then your test case is to edit, but login and search operations executed are prerequisite. Anyway your edit operation will consist of stat editing, doing some changes, pressing save button, answering yes on question like do you really want to save... what if you observe different issues during the test execution that however does not influence the progress: unexpectedly blinking items; absence of the “do you really want to save” question; wrong placement of a button; syntax error in the label; etc. If you are a good tester you will report all of them, wouldn’t you?
On the other hand you may discover a single problem – in each f many test cases the button OK have OK text not justified (all the same). You will fail 20 test cases, each having testing window having any OK button. So in the first case you have only 1 failed test case, but in the second – 20 failed test cases. Do you realy think this reflects maturity?
Well, I’ve seen project that has a “severity” to be entered for fail run-log. I.e. if it is low, medium or high severity failure. Those are used as weights in quality maturity metrics. That somehow address the issue, except that severity entered by tester may be subjective and don’t address the first two issues I described.
So what do I propose instead?
Well I’ve already wrote about it in Pass/Fail for run-logs. Is it useful info?. As a summary I could say the following: run-logs during software life-cycle (whatever it is iterative, XP, SCRUM, etc) only show the maturity of testing itself. Failed test case should only mean that we failed to perform this test. Blocked means that we never even attempted. This shows how much we have left to tests. As a result how much more defects you could expect from us to report.
More considerations and reading
I promised to write about in my last post Test Cases as knowledgebase about reasons for having test cases. I referred some magazine column talking among other things about how test case pass/fail statistics maps to feature pass/fail. Note that in my Pass/Fail for run-logs. Is it useful info? I refer to IEEE where pass/fail criteria is supposed only at test plan and test design specification level and not at test case level. I’m not sure if IEEE authors did this for a purpose, but that what I respect them for.
It’s amazing however how people haste to conclude that during development life-cycle this statistic is useful quality measure. Even if we know the number of test cases from the beginning and they are not going to change over time (which also seems impossible to me), there are a lot of issues with this assumption. Let me first summarize the main problems and describe them in details below:
1)Fail % either overall or grouped in any way does not (in general) map with development effort required to fix the detected issues.
2)Defect severity/business value and decision on defect fixing (or cancel/postpone/document as know bug) does not directly change Test Case Pass/Fail status (not until we update test case and re-run it)
3)If there is more than one defect with a single test case, it is still the same failed.
Context
I’m not talking about unit tests done by developers. Just as UAT unit tests aim to prove that the planned work is complete, instead of hunting bugs. For unit tests especially in TDD it could be very true that the fail rate indicate product maturity. More over those unit tests fails in very rare occasions, especially if continuous integration patterns are followed and any regression type failure is fixed immediately – maybe even before checking code in. So it those cases our maturity almost always stays 100% good, just as it is argued by those continuous integration patters – product is shippable all the time.
However once you will have a professional tester assigned to such team or once you allow users to play with the product you will discover a lot of issues there. You may argue that those are not bugs, but either feature requirements, but how about customer who say – we are not going to use that product until you fix the following list of issues, even if you call this implementing more features. And what if they have no more money for implementing those “features”. I believe this is reason why we are unable to find a lot of customers that agree for time & materials contracts: those agile contracts where we implement only as much as we could make with time (=money) they are willing to pay instead of fixed scope and money contracts.
Fail % does not map with development effort
The issue I see is following: there are both “horizontal” and “vertical” features. There are also both “horizontal” and “vertical” bugs. Single incomplete feature or bug (which is not critical, like application does not start) may have any of the following impacts on the test case statistics (no matter how significant this bug or feature is in terms of development effort required):
- Only one/few out of any number of test cases fail
- All/most of test cases with a group/module of test cases fail
- One/few tests cases within all/most of test case groups fail
- Few test cases each if some test group/module fail
Defect analyze results affect maturity status
I’ve seen different test case pass/fail criteria and different feature pass/fail criteria. Some say if there is any defect, it is a fail. So what defect is following: according to specification this label should has color RGB(60,104,168) but it is actually RGB(59,88,169). Or the placement of a label is one pixel wrong. Do you really want to say this is a good reason to fail UAT? But even if you are able to define flexible pass/fail criteria, how about the following cases: a) Defect reasons was wrong configuration, or something, so it get canceled b) We found a workaround and agreed with customer that it is OK to use workaround, so this is no more a defect
OK, you may want to say that in both cases we should update specification to reflect that this deviation is allowed, to update test case run-log status, etc. Well, we are not NASA you know, our management thinks if we have this great defect tracking system that reflects the decision, it is enough.
Multiple defects within one failed test case
Our test cases typically consist of several execution steps. Well you could define that if you have to test edit operation within search results, then your test case is to edit, but login and search operations executed are prerequisite. Anyway your edit operation will consist of stat editing, doing some changes, pressing save button, answering yes on question like do you really want to save... what if you observe different issues during the test execution that however does not influence the progress: unexpectedly blinking items; absence of the “do you really want to save” question; wrong placement of a button; syntax error in the label; etc. If you are a good tester you will report all of them, wouldn’t you?
On the other hand you may discover a single problem – in each f many test cases the button OK have OK text not justified (all the same). You will fail 20 test cases, each having testing window having any OK button. So in the first case you have only 1 failed test case, but in the second – 20 failed test cases. Do you realy think this reflects maturity?
Well, I’ve seen project that has a “severity” to be entered for fail run-log. I.e. if it is low, medium or high severity failure. Those are used as weights in quality maturity metrics. That somehow address the issue, except that severity entered by tester may be subjective and don’t address the first two issues I described.
So what do I propose instead?
Well I’ve already wrote about it in Pass/Fail for run-logs. Is it useful info?. As a summary I could say the following: run-logs during software life-cycle (whatever it is iterative, XP, SCRUM, etc) only show the maturity of testing itself. Failed test case should only mean that we failed to perform this test. Blocked means that we never even attempted. This shows how much we have left to tests. As a result how much more defects you could expect from us to report.
More considerations and reading
I promised to write about in my last post Test Cases as knowledgebase about reasons for having test cases. I referred some magazine column talking among other things about how test case pass/fail statistics maps to feature pass/fail. Note that in my Pass/Fail for run-logs. Is it useful info? I refer to IEEE where pass/fail criteria is supposed only at test plan and test design specification level and not at test case level. I’m not sure if IEEE authors did this for a purpose, but that what I respect them for.
