Tuesday, August 14, 2012

On Testing

More than decade ago, I spend a year doing quality assurance at a big and successful smart card company. It was one of the more intellectually stimulating jobs I've had in the software industry and I ended up developing a scripting language, sadly long lost, to do test automation. Doing testing right can be much harder than writing the software being tested. Especially when you're after 100% quality as is the case with software burned on the chips of millions of smart cards that would have to be thrown away in case of a serious bug discovered post-production. Companies aiming at this level of quality get ISO 900x certified to show off to clients. To get certified, they have to show a QA process that guarantees, to an acceptable degree of confidence, that products delivered work, that the organization is solid, knowledge is preserved etc. etc. The interesting part that I'd like to share with you is the specific software QA approach. It did involve an obscene amount of both documentation and software artifacts that had to be produced in a very rigid formal setting, but the philosophy behind spec-ing out the tests was sound, practical and better than anything I've seen since. 

Dijkstra famously said that testing can only prove the presence of errors, never the absence of errors. True that. To have 100% guarantee that a program works, one would need to produce a mathematical proof of its correctness. Perhaps not sufficient...as Knuth not less famously noted, when sending a program to a colleague, that the program must be used with caution because he only proved it correct, but never tested it. In either case, ultimately the goal is to gain confidence in the quality of a piece of software. We do that by presenting a strong, very, very convincing argument that the program works. 

When discussing testing methodology people generally talk about automated vs. manual testing, test-driven development where test cases are developed before the code or "classic" testing where they are done after development, but rarely do I see people mindful of how tests should be specified. The term test itself is used rather ambiguously to mean the action of testing, or the specification or the process or the development phase. And in some contexts a test means a test case which refers to an input data set, or it refers to an automated program (e.g. a jUnit test case). So let's agree that, whether you code it up or do it manually, a test case consists of a sequence of steps taken to interact with the software being tested and verify the output with the goal of ensuring that said software behaves as expected. So how do you go about deciding what this sequence of steps should be? In other words, how do you gather test requirements? 

Think of it in dialectical terms - you're trying to convince a skeptic that a program works correctly. First you'd have to agree what it means for that program to work correctly. Well, they say, it must match the requirements. So you start by reading all requirements and translating that last statement ("it must match the requirements" ) for each one of them into a corresponding set of test criteria. Naturally, the more detailed the requirements are, the easier that process is. In an agile setting, you might be translating user stories into test criteria. Let's have a simple running example:


Login form should include a captcha protection

Test Criteria:

  • C1 - the login form should display a string as an image that's hard to recognize by a program.
  • C2 - the login form should include an input field that must match the string in the captcha image for login to succeed. 

Notice how the test criteria explicitly state under what conditions one can say that a program works. One can list more criteria with further detail, stating what happens if the captcha doesn't match. What happens after n number of repeats etc. Also, this should make it clear that test criteria are not actual tests. They are not something that can be executed (manually or automatically). In fact, they are to a test program what requirements are to the software being QA-ed. And as with conventional requirements, the more clear you are on your test criteria, the better chance you have in developing adequate tests. 

The crucial point is that when you write a test case, you want to make sure that it is with a well-defined purpose, that it serves as a demonstration that an actual test criterion has been met. And this is what's missing in 90% of testing efforts (well, to be sure this is just anecdotal evidence). People write or perform tests simply for the sake of trying out things. Tests accumulate and if you have a lot of them, it makes it look like you're in good shape. But that's not necessarily the case because tests can only prove the presence of errors, not their absence. To convince your dialectical opponent of the absence of errors, given the agreed upon list of criteria, you'd have to show how your tests prove that the criteria have been met. In other words, you want to ensure that all your test criteria have been covered by appropriate test cases - for each test criterion there is at least one test case that, when successful, shows that this criterion is satisfied. A convenient way to do that is to create a matrix where you list all your criteria in the rows and all your test cases in the columns and checkmark a given cell whenever the test case covers the corresponding criterion, where "covers" means that if the test case succeeds one can be confident that the criterion is met. This implies that the test case itself will have to include all necessary verification steps. Continuing with our simple example, suppose you've developed a few test cases:

  • T1 - test succesful login
  • T2 - test failed login for all 3 fields, bad user, bad password, bad captcha
  • T3 - test captcha quality by running image recognition algos on captcha
C1  X    

A given test case may cover a certain aspect of the program, but you'd put a checkmark only if it actually verifies the criteria in question. For instance T3 would be loading a login page, but it won't be testing actual login. Similarly, T1 and T2 can observe the captcha, but they won't evaluate its quality. It may appear a bit laborious as an approach. In the aforementioned company, this was all documented ad nauseam. Criteria were classified as "normal", "abnormal", "stress" and what not, reflecting different types of expected behaviors and possible execution contexts. Now, I did warn you - this was a QA process aimed at living up to ISO standards. And it did. But think about the information this matrix provides you. It is a full, detailed spec of your software. It is a full inventory of your test suite. It tells what part of the program is being tested by what test. It shows you immediately if some criteria are not being covered by a test, or not covered enough. If shows you immediately if some criteria are being covered too much, i.e. if some tests are superfluous. When tests fail, it tells you exactly what behaviors of the software are not functioning properly. Recall that one of the main problems with automated testing is the explosion of code that needs to be written to achieve descent coverage. This matrix can go a long way to controlling that code explosion by keeping each test case with a relatively unique purpose. Most importantly, the matrix presents a pretty good argument for the program's correctness - you can see at a glance both how correctness has been defined (the list of criteria) and how it is demonstrated (the list of tests cross-referenced with criteria).

Reading about testing even from big industry names, I have been frequently disappointed at the lack of systematic approach to test requirements. In practice it's even worse. Developers, testers, business people in general have no idea what they are doing when testing. This includes agile teams where tests are sometimes supposed to constitute the specification of the program. That's plain wrong, first because it's code and code is way too low-level to be understood by all stakeholders, hence it can't be a specification that can be agreed upon by different parties. Second, because usually the same people write both the tests and the program tested, the same bugs sneak in both places and never get discovered, the same possibly wrong understanding of the desired behavior is found in both places. So expressing the quality argument (i.e. with the imaginary dialectical adversary) simply in the form of test cases can't cut it. 

That said, I wouldn't advocate following the approach outlined above verbatim and in full detail. But I would recommend having the mental picture of that Criteria x Tests matrix as guide to what you're doing. And if you're building a regression test suite, and especially if some of the tests are manual, it might be worth your while spelling it out in the corporate wiki somewhere.




  1. I can only talk on my limited perspective of an amateur programmer, but I have found that it is mutable stateful code that renders testing difficult. I was doing tests of the redis-port, using the automated testing framework scalacheck, an adapted Haskel tool. Testing was really easy for code that was composed of referentially transparent functions and immutable data. As soon as I got to classes that depended on mutable state, it got really complicated to test.
    It seems to me that HyperGraphDB's complexity when debugging and testing is because generally it is written in a imperative & mutable manner. One thing that really disturbed me more than once, is the dependency of the current JavaTypeMapper on Beans, which requires a null constructor, getters and setters. This is problematic if you want to write immutable classes which normally don't have any public setters. It also hurts compatibility with functional JVM languages such as scala, at least "idiomatic scala".
    I still don't know well myself how to avoid mutability for non-trivial projects with many interdependent compontents, so I cannot really suggest something concrete (yet), but I'd advocate to consider reducing dependency on mutable state.

  2. The pros and cons of mutable state is a debate on its own. But I don't see how it has to do with testing? If you test a program as a black box, how does it matter if it's written with mutable state or in a functional style? What's more important is how the program's observable behavior is specified. You are probably having trouble because you think of the behavior in terms of the implementation so you conceive starting from the mutable state. And HGDB in particular is a database, so it's all about read and write.