Skip navigation.

Steve Rowe's Blog : Testing

Steve Rowe's Blog : Testing

Website:


Description:

Musings about the role of test, how to test effectively, etc.

Last update:

1 week 3 days ago


James Whittaker on Why MS Software "Sucks" Despite Our Testing

A friend turned me on to this post by James Whittaker.  I didn't know he had a blog so now I'm excited to read it.  He has a lot of really interesting things to say on testing so I encourage you to read his blog (now linked on the left) if you are intrigued by testing.

Microsoft prides itself on the advanced state of its testing operations.  This leads to the inevitable question, "If Microsoft is so good at testing, why does your software suck?"  James Whittaker was once a person who asked this question and now that he's at Microsoft he is in a good position to try to answer it.

James gives basically three reasons:

  • Microsoft's software is really complex.  Windows, Exchange, Office, etc. are really, really big projects.
  • Microsoft's software is used by a whole lot of people.  Eric Raymond once made the comment that all bugs are shallow if you have enough eyes.  This applies to closed source software just as much as open source.  Within the first few days of a release of Microsoft software, millions of people are using them.  Windows has an install base in the hundreds of millions.  With so many people looking at it, all bugs are likely to be hit by someone.
  • Microsoft testers are not involved early enough in the process.  This varies throughout the company, but there is still a lot of room for improvement.

Test Suite Granularity Matters

I just read a very interesting research paper entitled, "The Impact of Test Suite Granularity on the Cost-Effectiveness of Regression Testing" by Gregg Rothermel et al.  In it the authors examine the impact of test suite granularity on several metrics.  The two most interesting are the impacts on running time and the impact on bug finding.  In both cases, they found that larger test suites were better than small ones.

When writing tests, a decision must be made how to organize the tests.  The paper makes a distinction between test suites and test cases.  A test case is a sequence of inputs and the expected result.  A test suite is a set of test cases.  How much should each test suite accomplish?  There is a continuum but the endpoints are creating each point of failure as a standalone suite or writing many points of failure into a single suite.

The argument for very granular test suites (1 case or point of failure per suite) is that they can be better tracked and analyzed.  The paper examines the efficacy of different techniques for restricting the number of suites run in a given regression pass.  They found that more granular cases were more effectively reduced.  However, the time savings even from aggressive reductions in test suites did not offset the time taken to run all test cases in larger suites.  Grouping test cases into larger suites makes them run faster.  Without reduction the granular cases in the study ran almost 15 times slower.  With reduction, this improved to running only 6 times slower.

Why is this?  Mostly it is because of overhead.  Depending on how the test system launches tests there is a cost to each test suite being launched.  In a local test harness like nUnit, this cost is small but can add up over a lot of cases.  In a network-based system, the cost is large.  There is also the cost of setup.  Consider an example from my days of DVD testing.  To test a particular function of a DVD decoder requires spinning up a DVD and navigating to the right title and chapter.  If this can be done once and many test cases executed, the overhead is amortized across all of the cases in the suite.  On the other hand, if each case is a suite, the overhead is multiplied by each case.

Perhaps more interesting, however, the study found that very granular test suites actually missed bugs.  Sometimes as much as 50% of the bugs.  Why?  Because more less granular cases cover more state than less granular ones and are thus more likely to find bugs. 

It is important to note that there are diminishing returns on both fronts.  It is not wise to write all of your test cases in one giant suite.  Result tracking does become a problem.  It can be hard to differentiate bugs which happen in the same suite.  After a certain size, the overhead costs are sufficiently amortized and enough states traversed that the benefits of a bigger suite become negligible.

I have had first-hand experience writing tests of both sorts.  I can confirm that we have found bugs in large test suites that were caused by an interaction between cases.  These would have been missed by granular execution.  I have also seen the immense waste of time that accompanies granular cases.  Not mentioned in the study is also the fact that granular cases tend to require a lot more maintenance time. 

My rule of thumb is to create differentiated test cases for most instances but then to utilize a test harness that allows them to be all run in one instance of that harness.  This gets the benefits of a large test suite without many of the side effects of putting too much into one case.  It amortizes program startup, device enumeration, etc. but still allows for more precise tracking and easier reproduction of bugs.  If there is a lot of overhead, such as the DVD case mentioned above, test cases should be merged or otherwise structured so as not to pay the high overhead each time.

Test Code Must Be As Solid As Dev Code

All good development projects follow certain basic practices to ensure code quality.  They use source control, get code reviewed, build daily, etc.  Unfortunately, sometimes even when the shipping product follows these practices, the test team doesn't.  This is true even here at Microsoft.  It shouldn't be the case, however.  Test code needs to be just as good as dev code.

First flaky test code can make determining the source of an error difficult.  Tests that cannot be trusted make it hard to convince developers to fix issues.  No one wants to believe there is a bug in their code so a flaky test becomes an easy scapegoat.

Second, spurious failures take time to triage.  Test cases that fall over because they are unstable will take a lot of time to maintain.  This is time that cannot be spent writing new test automation or testing new corners of the product.

Finally, poorly written test code will hide bugs in the product.  If test code crashes, any bugs  in the product after that point will be missed.  Similarly, poor test code may not execute as expected..  I've seen test code that returns with a pass too early and doesn't execute much of the intended test case.

To ensure that test code is high quality, it is important to follow similar procedures to what development follows (or should be following) when checking in their code.  This includes getting all non-trivial changes code reviewed, putting all changes in source control, making test code part of the daily build (you do have a daily build for your product don't you?), and using static verification tools like PCLint, high warning levels, or the code analysis built into Visual Studio.

Test For Failure, Not Success

We recently went through a round of test spec reviews on my team.  Having read a good number of test specs in a short period of time, I came to a realization.  It is imperative to know the failure condition in order to write a good test case.  This is at least as important if not more important than understanding what success looks like.

Too often I saw a test case described by calling out what it would do, but not listing or even implying what the failure would look like.  If a case cannot fail, passing has no meaning.  I might see a case such as (simplified): "call API to sort 1000 pictures by date."  Great.  How is the test going to determine whether the sort took place correctly?

The problem is even more acute in stress or performance cases.  A case such as "push buttons on this UI for 3 days" isn't likely to fail.  Sure, the UI could fault, but what if it doesn't?  What sort of failure is the author intending to find?  Slow reaction time?  Resource leaks?  Drawing issues?  Without calling these out, the test case could be implemented in a manner where failure will never occur.  It won't be paying attention to the right state.  The UI could run slow and the automation not notice.  How slow is too slow anyway?  The tester would feel comfortable that she had covered the stress scenario but in reality, the test adds no new knowledge about the quality of the product. 

Another example:  "Measure the CPU usage when doing X."  This isn't a test case.  There is no failure condition.  Unless there is a threshold over which a failure is recorded, it is merely collecting data.  Data without context is of little value.

When coming up with test cases, whether writing them down in a test spec or immediately when writing or executing them, consider the failure condition.  Knowing what success looks like is insufficient.  It must also be possible to enumerate what failure looks like.  Only when testing for the failure condition and not finding it does a passing result gain value.

We Need A Better Way To Test

Testing started simply.  Developers would run their code after they wrote it to make sure it worked.  When teams became larger and code more complex, it became apparent that developers could spend more time coding if they left much of the testing to someone else.  People could specialize on developing or testing.  Most testers in the early stages of the profession were manual testers.  They played with the user interface and made sure the right things happened.

This works fine for the first release but after several releases, it becomes very expensive.  Each release has to test not only the new features in the product but also all of the features put into every other version of the product.  What took 5 testers for version 1 takes 20 testers for version 4.  The situation just gets worse as the product ages.  The solution is test automation.  Take the work people do over and over again and let the computer do that work.  There is a limit to the utility of this, but I've spoken of that elsewhere and it doesn't need to be repeated here.  With sufficiently skilled testers, most products can be tested in an automated fashion.  Once a test is automated, the cost of running it every week or even every day becomes negligible. 

As computer programs became more complex over time, the old testing paradigm didn't scale and a new paradigm--automated testing--had to be found.  There is, I think, a new paradigm shift coming.  Most test automation today is the work of skilled artisans.  Programmers examine the interfaces of the product they are testing and craft code to exercise it in interesting and meaningful ways.  Depending on the type of code being worked on, a workforce of 1:1 testers to devs can usually keep up.  This was true at one point.  Today it is only somewhat true and tomorrow it will be even less true.  Some day, it will be false.  What has changed?  Developers are leveraging better programming models such as object-oriented code, larger code libraries, greater code re-use, and more efficient languages to get more done with less code.  Unfortunately, this merely increases the surface area for testers to have to cover.  Imagine, if you will, a circle.  When a developer is able to create 1 unit of code (r=1), the perimeter which a tester must cover is only 3.14.  When the developer uses tools to increase his work and the radius stretches to 2, the tester must now cover a perimeter of 12.56.  The area needing to be tested increases much faster than the productivity increase.  Using the same programming models as the developers will not allow test to keep up.  In the circle example, a 2x boost in tester performance would only cover 1/2 of the circle.

Is test doomed?  Is there any way to keep up or are we destined to be outpaced by development and to need larger and larger teams of test developers just to keep pace.  The solution to the problem has the same roots as the solution to manual testing problem.  That is, it is time to leverage the computer to do more work on behalf of the tester.  It will soon be too expensive to hand-craft test cases for each function call and the set of parameters it entails.  Writing code one test case at a time just doesn't scale--even with newer tools.  In the near future, it will be important to leverage the computer to write test cases by itself.  Can this be done?  Work is already beginning, but it is just in its infancy.  The tools and practices that will make this a widespread practice likely do not exist today.  Certainly not in a readily consumed form.

This coming paradigm shift makes testing a very interesting place to be working today.  On the one hand, it can be easy for testers to become overwhelmed with the amount of work asked of them.  On the other hand, the solutions to the problem of how to leverage the computer to test itself are just now being created.  Being in on the ground floor of a new paradigm means the ability to have a tremendous impact on how things will be done for years to come.

 

Update:  There are a lot of people responding to this post who are unfamiliar with my other writing.  Without some context, it may seem that I'm saying that test automation is the solution to all testing problems and that if we're smart we can automate all of the generation.  That's not what I'm saying.  What I advocate in this post is only a powerful tool to be used along with all of the others in our toolbox.  If you want some context for my views, check out:

 

Know That Which You Test

Someone recently related to me his experience using the new Microsoft Robotics Studio.  He loaded it up and proceeded through one of the tutorials.  To make sure he understood, he typed everything in instead of cutting and pasting the sample code.  After doing so, he compiled and ran the results.  It worked!  It did exactly what it was supposed to.  The only problem--he didn't understand anything he had typed.  He went through the process of typing in the lines of code, but didn't understand what they really meant.  Sometimes testers do the same thing.  It is easy to "test" something without actually understanding it.  Doing so is dangerous.  It lulls us into a false sense of security.  We think we've done a good job testing the product when in reality we've only scratched the surface.

Being a good tester requires understanding not just the language we're writing the tests in, but also what is going on under the covers.  Black-box testing can be useful, but without a sense of what is happening inside, testing can only be very naive.  Without breaking the surface, it is nearly impossible to understand what the equivalency classes are.  It is hard to find the corner cases or the places where errors are most likely to happen.  It's also very easy to miss a critical path because it wasn't apparent from the API.

There are three practices which help to remedy this.  First, program in the same language as whatever is being tested.  A person writing tests written in C# against a COM interface will have a hard time beginning to understand the infrastructure beneath the interface.  It can also be difficult to understand the frailties of a language different than the one being coded in.  Each language has different weaknesses.  Thinking about the weaknesses of C++ will blind a person to the weaknesses of Perl.  Second, use code coverage data to help guide testing.  Examining code coverage reports can help uncover places that have been missed.  If possible, measure coverage against each test case.  Validate that each new case adds to the coverage.  If it doesn't, the case is probably covering the same equivalency class as another test.  Third, and perhaps most importantly, become familiar with the code being testing.  Read the code.  Read the specs.  Talk to the developers. 

What Tests Belong in the BVTs?

BVTs or Build Verification Tests are standard Microsoft parlance for the tests we run every day to ensure that we didn't break anything important with our checkins the day before.  I've previously written about the importance of keeping them clean.  Within the range of tests that consistently pass, which ones should be in the BVT?  BVT test failures should be something you're willing to act on immediately.  In other words, the failures must be important.  Based on that, here are some criteria:

  • Test major scenarios not minor ones.  If major features are failing, they will be fixed right away.  If a minor feature is failing, it should be noted, but may have to wait until later to be fixed.
  • Test majority use cases, not corner cases.  Tests for the interaction of 3 parts shouldn't be in the BVTs.  Tests outside most user scenarios shouldn't be in the BVTs.  While every book on testing says to test the boundary conditions, the BVTs may not be the place to do that.  Instead, pick the most likely to be used values and scenarios.
  • Run "positive" not "negative" tests.  By that I mean, don't send out-of-bounds conditions or invalid values.  These are valid tests and should definitely be run, but not in the BVTs.  An API faulting when sent a null pointer should be fixed, but the fix can wait until next week.

BVTs should be a carefully guarded set of tests.  They need to run quickly, consistently, and their results should matter.  If these rules are followed, the BVTs will be effective because failures will be respected.  Restricting the BVTs to the most important scenarios will ensure that the results are given the appropriate respect.

When to Test Manually and When to Automate

There's a balancing act in testing between automation and manual testing.  Over my time at Microsoft I've seen the pendulum swing back and forth between extensive manual testing and almost complete automation.  As I've written before, the best answer lies somewhere in the middle.  The question then becomes how to decide what to automate and what to test manually.  Before answering that question, a quick diversion into the advantages of each model will be useful. 

Manual testing is the most flexible.  Test case development is very cheap.  While skilled professionals will find more, a baseline of testing can be done with very little skill.  Verification of a bug is often instantaneous.  In the hands of a professional tester, a product will give up its bugs quickly.  It takes very little time to try dragging a window in a particular way or entering values into an input box.  This has the additional advantage of making the testing more timely.  There is little delay between code being ready to test and the tests being run.  The difficulty with manual testing is its cost over time.  Each iteration requires human time and humans are quite costly.  The cumulative costs over time can be very, very large.  If the test team is capable of testing version 1.0 in the development time but nothing is automated, it will take the test team all of the 1.0 testing time plus time for the new 2.0 features to release version 2.0.  Version 3.0 will cost 3x as much to test as the first version, and so on.  The cost increases are unsustainable.

On the opposite end of the spectrum is the automated test.  Development of automated tests is expensive.  It requires a skilled programmer some number of hours for each test case.  Verification of the bug can require substantial investment.  The up front costs are high.  The difficulty of development means that there is a measurable lag between code being ready for test and the tests being ready to run.  The advantage comes in the repeated nature of the testing.  With a good test system, running the tests becomes nearly free.

With those advantages and disadvantages in mind, a decision framework becomes obvious.  If testing only needs to happen a small number of times, it should be done manually.  If it needs to be run regularly--daily or even every milestone--it should be automated.  A rule of thumb might be that if there is a need for a test to be run more than twice during a product cycle, it should be automated.  Because of the delay in test development, most features should be tested manually once before writing the test automation for the feature.  This is for two reasons.  First, manual exploratory testing will almost always be more thorough.  The cost of test development ensures this.  Second, it is more timely.  Finding bugs right away while development still has the code in their minds is best.  Do thorough exploratory testing of each feature immediately.  Afterwards, automate the major cases.

This means that some tests will be run up front and never again.  That is acceptable.  If the right automated tests are chosen, they will act as canaries and detect when things go wrong later.  It is also inevitable.  Automating everything is too costly.  The project won't wait for all that testing to be written.  Those who say they automate everything are likely fooling themselves.  There are a lot of cases that they never write and thus are never run.