Steve Rowe's Blog : Testing
Steve Rowe's Blog : Testing
Website:
Description:
Last update:
Pass Rates Don’t Matter
Submitted by SteveRowe on Wed, 17/03/2010 - 13:50.It seems obvious that test pass rates are important. The higher the pass rate, the better quality the product. The lower the pass rate, the more known issues there are and the worse the quality of the product. It then follows that teams should drive their pass rates to be high. I’ve shipped many products where the exit criteria included some specified pass rate—usually 95% passing or higher. For most of my career I agreed with that logic. I was wrong. I have come to understand that pass rates are irrelevant. Pass rates don’t tell you the true state of the product. It is important which bugs remain in the product, but pass rates don’t actually show this.
The typical argument for pass rates is that it represents the quality of the product. This makes the assumption that the tests represent the ideal product. If they all passed, the product would be error-free (or free enough). Each case is then an important aspect of this ideal state and any deviation from 100% pass is a failure to achieve the ideal. This isn’t true though. How many times have you shipped a product with 100% passing tests? Why? You probably rationalized that certain failures were not important. You were probably right. Not every case represents this ideal state. Consider a test that calls a COM API and checks the return result. Assuming you pass in a bar argument and the return result is E_FAIL. Is that a pass? Perhaps. A lot of testers would fail this because it wasn’t E_INVALIDARG. Fair point. It should be. Would you stop the product from shipping because of this though? Perhaps not. The reality is that not all cases are important. Not all cases represent whether the product is ready to ship or not.
Another argument is that 100% passing is a bright line that is easy to see. Anything less is hard to see. Did we have 871 or 872 passing tests yesterday? If it was 871 and today is 871, are they the same 129 failures? Determining this can be hard and it’s a good way to miss a bug. It is easy to remember that everything passed yesterday and no bugs are hiding in the 0 failures. I’ve made this argument. It is true as far as it goes, but it only matters if we use humans to interpret the results. Today we can use programs to analyze the failures automatically and to compare the results from today to those from yesterday.
As soon as the line is not 100% passing, rates do not matter. There is no inherent difference in the quality of a product with 99% passing tests and the quality of a product with 80% passing tests. “Really?“ you say. “Isn’t there a difference of 18%? That’s a lot of test cases.” Yes, that is true. But how valuable are those cases? Imagine a test suite with 100 test cases, only one of which touches on some core functionality. If that case fails, you have a 99% passing rate. You also don’t have a product that should ship. On the other hand, imagine a test suite for the same software with 1000 cases. Imagine that the testers were much more zealous and coded 200 cases that intersected that one bug. Perhaps it was in some activation code. These two pass rates then represent the exact same situation. The pass rate does not correlate with quality. Likewise one could imagine a test case of 1000 cases where 200 were bugs in the tests. That is an 80% pass rate and a shippable product.
The critical takeaway is that bugs matter, not tests. Failing tests represent bugs, but not equally. There is no way to determine, from a pass rate, how important the failures are. Are they the “wrong return result” sort or the “your api won’t activate” sort? You would hold the product for the 2nd, but not the first. Tests pass/fail rates do not provide the critical context about what is failing and without the context, it cannot be known whether the product should ship or not. Test cases are a means to an end. They are not the end in themselves. Test cases are merely a way to reveal the defects in a product. After they do so, their utility is gone. The defects (bugs) become the critical information. Rather than worrying about pass rates, it is better to worry about how many critical bugs are left. When all of the critical bugs are fixed, it is time to ship the product whether the pass rate is high or low.
All that being said, there is some utility in driving up pass rates. Failing cases can mask real failures. Much like code coverage, the absolute pass rate does not matter, but the act of driving the pass rate up can yield benefits.
Five Books To Read If You Want My Job
Submitted by SteveRowe on Thu, 28/05/2009 - 01:03.This came out of a conversation I had today with a few other test leads. the question was, “What are the top 5 books you should read if you want my job?” My job in this case being that of a test development lead. At Microsoft that means I lead a team (or teams) of people whose job it is to write software which automatically tests the product.
- Behind Closed Doors by Johanna Rothman – One of the best books on practical management that I’ve run across. 1:1’s, managing by walking around, etc.
- The Practice of Programming by Kernighan and Pike– Similar to Code Complete but a lot more succinct. How to be a good developer. Even if you don’t develop, you have to help your team do so.
- Design Patterns by Gamma et al – Understand how to construct well factored software.
- How to Break Software by James Whittaker – The best practical guide to software testing. No egg headed notions here. Only ideas that work. I’ve heard that How We Test Software at Microsoft is a good alternative but I haven’t read it yet.
- Smart, and Gets Things Done by Joel Spolsky – How great developers think and how to recruit them. Get and retain a great team.
This is not an exhaustive list. There is a lot more to learn than what is represented in these books, but these will touch on the essentials. If you have additional suggestions, please leave them in the comments.
Why We Conduct Bug Bashes
Submitted by SteveRowe on Fri, 13/02/2009 - 17:24.My team recently finished what we call a “bug bash.” That is, a period of time where we tell all of the test developers to put down their compilers and simply play with the product. Usually a bug bash lasts a few days. This particular one was 2 days long. We often make a competition out of it and track bug opened numbers across the team with bragging rights or even prizes for those who come out on the top of the list.
Bug bashes are a time when everyone on the team is asked to spend all of their time conducting exploratory testing. Sometimes managers will influence the direction by assigning people end-user scenarios or features to look at. Other times the team is just let go and told to explore wherever they desire. Experience has shown me that some direction can be good. Assigning people to explore an area they don’t usually work on gets new eyes on the product and with new eyes come new use patterns and new bugs. Recently I’ve also discovered that it can be helpful to track where people have spent their time. During our last bug bash we created a list of areas that should be explored and had people sign off when they had investigated them. This gives us a much better sense of just what the coverage looked like and allows us to ensure all areas received attention.
Conducting a bug bash can be expensive. There is a lot of work to get done and putting everything else aside for 2 days adds up to a lot of other work getting pushed off. Why do we do this? What is the return on the investment? There are three primary reasons that come to mind:
We have found that empirically, a bug bash flushes out a lot of bugs in a short period of time. Our most recent bug bash saw the number of bugs opened jump to 400% of the daily average. This is important because we frontload the finding of the bugs. The earlier we know about bugs, the more likely we are to be able to fix them. Knowing about more bugs also helps us make more informed triage decisions.
The second reason we conduct bug bashes is because they are likely to find bugs on the seams. Test automation can only find certain kinds of bugs. Exploratory testing is a much better way to find issues on the seams—where functional units join up. Sometimes these bugs are the most critical. Imagine if we could have found the Win7 MP3 bug or the interaction between playing audio and network throughput before shipping the respective products. These are the sort of issues highly unlikely to be found in test automation but which can be found through exploratory testing. We obviously don’t find all such issues through bug bashes, but we do find a lot.
The final reason we run bug bashes is to get a sense of the product. Most of the time we spend our days focused on one small part of the operating system or another. It’s hard to get a sense for the state of the forest while staring at individual trees. After spending several days conducting exploratory tests on the product, we can get a much better sense whether the overall product is doing well or if there are serious issues.
James Whittaker Netcast
Submitted by SteveRowe on Fri, 09/01/2009 - 17:27.James Whittaker is the author of books like How To Break Software. He ran one of the few university-level testing programs at Florida Tech. He's now as Microsoft and helping Visual Studio become better at testing. The guys at .Net Rocks caught up with him for an interview. James explains what he thinks the future of testing is and what's right and wrong with testing at Microsoft. Put this on your Zune/iPod. It's worth the hour.
The Five Why's and Testing Software
Submitted by SteveRowe on Fri, 17/10/2008 - 16:09.Toyota was able to eclipse the makers of American cars in part due to its production and development systems. The system has been popularized under the rubric of "Lean" techniques. Among the tenets of the Lean advocates is asking the "Five Why's." These are not the W's of journalism: Who, What, Why, Where, and When? They are not specific questions even. Asking five why's means asking why 5 times. Why was the production of cars down? Because there were missing screws. Why were there missing screws? Because the production robots were bumping them. Why were the robots bumping? Because the programming was faulty. Why was the programming faulty? Because the programmer didn't take into account a metric->English conversion. Why didn't the the programmer consider conversions? Because...
There is no magic in the number 5. It could be 4 or 6. The importance is to keep asking why until the root cause is understood and fixed. Fixing anything else is just alleviating the symptoms of a deeper problem. Not solving the root problem means it will likely cause other problems later and more time will be wasted later.
How does this apply to testing? It goes to the core of the role of test in a product team. Think about what happens when your team finds a bug in your software. What do you do? Hopefully someone on the test team files a bug report and either the tester or the developer root cause the problem and fix it. This usually means determining the line of source code causing the issue and changing it. Problem solved. Or is it? Why was that line of source code incorrect in the first place? We rarely--if ever--ask.
What if we began to view our role as testers as trying to eliminate bugs from the system instead of from the source code. In that case we would be asking what coding techniques or early testing systems the team could employ to stop the bug from entering the source code at all or at least detecting it while the code was still under development (better unit testing might be a solution in this category).
James Whittaker on Why MS Software "Sucks" Despite Our Testing
Submitted by SteveRowe on Thu, 02/10/2008 - 17:23.A friend turned me on to this post by James Whittaker. I didn't know he had a blog so now I'm excited to read it. He has a lot of really interesting things to say on testing so I encourage you to read his blog (now linked on the left) if you are intrigued by testing.
Microsoft prides itself on the advanced state of its testing operations. This leads to the inevitable question, "If Microsoft is so good at testing, why does your software suck?" James Whittaker was once a person who asked this question and now that he's at Microsoft he is in a good position to try to answer it.
James gives basically three reasons:
- Microsoft's software is really complex. Windows, Exchange, Office, etc. are really, really big projects.
- Microsoft's software is used by a whole lot of people. Eric Raymond once made the comment that all bugs are shallow if you have enough eyes. This applies to closed source software just as much as open source. Within the first few days of a release of Microsoft software, millions of people are using them. Windows has an install base in the hundreds of millions. With so many people looking at it, all bugs are likely to be hit by someone.
- Microsoft testers are not involved early enough in the process. This varies throughout the company, but there is still a lot of room for improvement.
Test Suite Granularity Matters
Submitted by SteveRowe on Fri, 26/09/2008 - 15:20.I just read a very interesting research paper entitled, "The Impact of Test Suite Granularity on the Cost-Effectiveness of Regression Testing" by Gregg Rothermel et al. In it the authors examine the impact of test suite granularity on several metrics. The two most interesting are the impacts on running time and the impact on bug finding. In both cases, they found that larger test suites were better than small ones.
When writing tests, a decision must be made how to organize the tests. The paper makes a distinction between test suites and test cases. A test case is a sequence of inputs and the expected result. A test suite is a set of test cases. How much should each test suite accomplish? There is a continuum but the endpoints are creating each point of failure as a standalone suite or writing many points of failure into a single suite.
The argument for very granular test suites (1 case or point of failure per suite) is that they can be better tracked and analyzed. The paper examines the efficacy of different techniques for restricting the number of suites run in a given regression pass. They found that more granular cases were more effectively reduced. However, the time savings even from aggressive reductions in test suites did not offset the time taken to run all test cases in larger suites. Grouping test cases into larger suites makes them run faster. Without reduction the granular cases in the study ran almost 15 times slower. With reduction, this improved to running only 6 times slower.
Why is this? Mostly it is because of overhead. Depending on how the test system launches tests there is a cost to each test suite being launched. In a local test harness like nUnit, this cost is small but can add up over a lot of cases. In a network-based system, the cost is large. There is also the cost of setup. Consider an example from my days of DVD testing. To test a particular function of a DVD decoder requires spinning up a DVD and navigating to the right title and chapter. If this can be done once and many test cases executed, the overhead is amortized across all of the cases in the suite. On the other hand, if each case is a suite, the overhead is multiplied by each case.
Perhaps more interesting, however, the study found that very granular test suites actually missed bugs. Sometimes as much as 50% of the bugs. Why? Because more less granular cases cover more state than less granular ones and are thus more likely to find bugs.
It is important to note that there are diminishing returns on both fronts. It is not wise to write all of your test cases in one giant suite. Result tracking does become a problem. It can be hard to differentiate bugs which happen in the same suite. After a certain size, the overhead costs are sufficiently amortized and enough states traversed that the benefits of a bigger suite become negligible.
I have had first-hand experience writing tests of both sorts. I can confirm that we have found bugs in large test suites that were caused by an interaction between cases. These would have been missed by granular execution. I have also seen the immense waste of time that accompanies granular cases. Not mentioned in the study is also the fact that granular cases tend to require a lot more maintenance time.
My rule of thumb is to create differentiated test cases for most instances but then to utilize a test harness that allows them to be all run in one instance of that harness. This gets the benefits of a large test suite without many of the side effects of putting too much into one case. It amortizes program startup, device enumeration, etc. but still allows for more precise tracking and easier reproduction of bugs. If there is a lot of overhead, such as the DVD case mentioned above, test cases should be merged or otherwise structured so as not to pay the high overhead each time.
Test Code Must Be As Solid As Dev Code
Submitted by SteveRowe on Fri, 08/08/2008 - 06:59.All good development projects follow certain basic practices to ensure code quality. They use source control, get code reviewed, build daily, etc. Unfortunately, sometimes even when the shipping product follows these practices, the test team doesn't. This is true even here at Microsoft. It shouldn't be the case, however. Test code needs to be just as good as dev code.
First flaky test code can make determining the source of an error difficult. Tests that cannot be trusted make it hard to convince developers to fix issues. No one wants to believe there is a bug in their code so a flaky test becomes an easy scapegoat.
Second, spurious failures take time to triage. Test cases that fall over because they are unstable will take a lot of time to maintain. This is time that cannot be spent writing new test automation or testing new corners of the product.
Finally, poorly written test code will hide bugs in the product. If test code crashes, any bugs in the product after that point will be missed. Similarly, poor test code may not execute as expected.. I've seen test code that returns with a pass too early and doesn't execute much of the intended test case.
To ensure that test code is high quality, it is important to follow similar procedures to what development follows (or should be following) when checking in their code. This includes getting all non-trivial changes code reviewed, putting all changes in source control, making test code part of the daily build (you do have a daily build for your product don't you?), and using static verification tools like PCLint, high warning levels, or the code analysis built into Visual Studio.
