A 16-Line Test Interface

and why you don’t really need it
Tags:  technical, jackdaw, testing, C, macros




It is now two decades since it was pointed out that program testing may convincingly demonstrate the presence of bugs, but can never demonstrate their absence. After quoting this well-publicized remark devoutly, the software engineer returns to the order of the day and continues to refine his testing strategies, just like the alchemist of yore, who continued to refine his chrysocosmic purifications.


Contents

In the first part of this post, I describe the impetus for my tiny test interface and the interface itself. In the second part, I challenge the idea that software testing is important.

So it has come to this

Until a week or so ago, I had never really written a “test”. But while working on a jackdaw feature with which I had particular difficulty – DAW “automation” – I found myself writing and rewriting a bit of code that looked a lot like a test.

Context: DAW Automation

Each automation track, which is assigned to an audio track parameter like volume, pan, or filter cutoff frequency, is fundamentally an array of “keyframes.” Each keyframe in turn has a position, expressed as a count of sample frames offset from time zero, and a generic value. The “position” corresponds to the keyframe’s x coordinate, and the value to its y coordinate.

The highlighted keyframe is currently being edited with the mouse. A string representing its value is shown in the label.

The order of these keyframes is determined by their order in the array that stores them, but they must also maintain order on the timeline; the position of keyframe n must be greater than that of keyframe n - 1. The array can be mutated in a wide variety of ways: keyframes or ranges of keyframes can be inserted or removed anywhere. In the midst of these various mutations, the keyframe order rule was sometimes violated. The error occurred unpredictably, sometimes in response to a rapid succession of user inputs; it did not immediately crash the program, and was hard to reproduce.

Keyframes are conspicuously out of order

A small test

I found myself copying something like the code below and pasting it here and there to test that the keyframe order was correct at any given point of execution. I expected to delete every instance of this code once I was done testing.

int32_t pos = automation->keyframes[0].pos;
for (uint16_t i=1; i<automation->num_keyframes; i++) {
    Keyframe *k = automation->keyframes + i;
    if (k->pos <= pos) {
        fprintf(stderr, 
            "Error! keyframe at index %d is left of previous keyframe.\n", i);
        exit(1);
    }
    pos = k->pos;
}

automation is of type Automation *, pointing to an automation track.

I do this sort of thing all the time in passing, but never have I so often reused such a piece of code. Because of either the difficulty of the automation feature or an unusual degree of sloppiness on my part in designing and implementing it, the program kept worming its way into some illegal state or other, and for every bug I squashed (or gently escorted outside), another appeared.

Acknowledging the stubborn persistence of issues I hoped would be transitory, I dropped that little test into its own function, and pasted “TEST” at the beginning of the function name so I’d remember to get rid of it later. (I think that’s a fine solution, and if you favor it, you needn’t feel any shame about it.)

But this automation feature continued to dog me, and I soon found myself writing new tests. I’d have to remember to delete all of these, but what if I came back to the feature later and the tests became useful again? Should I leave them in my code? Certainly I’d at least have to delete the calls to those functions, to avoid unnecessary execution, but then I’d have unused function definitions sitting around. Should I leave them in comments? /* DNE */ ?

Macros are beautiful

My Makefile allows me to build a “debug” version of the program in addition to a production version. In a debug build, TESTBUILD is defined. I wanted to design a test interface that would compile and call test functions in a debug build, but neither compile nor call them in production. The preprocessor makes this quite easy. I also wanted to avoid enclosing every test function definition, declaration, and call within #ifdef guards, which somewhat complicates the problem, but preprocessor macros bravely rise to the challenge. Here’s the sixteen-line test interface I came up with:

#ifdef TESTBUILD 
    #define TEST_FN_DEF(name, body, ...) \
        int name(__VA_ARGS__) body
    #define TEST_FN_DECL(name, ...) \
        int name(__VA_ARGS__)
    #define TEST_FN_CALL(name, ...) \
        {int code = name(__VA_ARGS__); \
        if (code != 0) { \
            fprintf(stderr, "\n%s:%d:\ttest \"%s\" failed with error code %d\n", __FILE__, __LINE__, #name, code); \
            exit(1); \
        }}
#else
    #define TEST_FN_DEF(name, body, ...)
    #define TEST_FN_DECL(name, ...)
    #define TEST_FN_CALL(name, ...)
#endif

This solution results in some very interesting syntax that looks foreign to C. Here is the test described previously, defined using the new interface:

TEST_FN_DEF(automation_keyframe_order, 
    {
        int32_t pos = automation->keyframes[0].pos;
        for (uint16_t i=1; i<automation->num_keyframes; i++) {
            Keyframe *k = automation->keyframes + i;
            if (k->pos <= pos) {
                fprintf(stderr, 
                    "Keyframe %d pos %d, prev pos: %d\n", 
                    i, k->pos, pos);
                return 1;
            }
            pos = k->pos;
        }
        return 0;
    }, Automation *automation);

The first argument to the macro is the function name, and the second is the full body of the test function. Remaining arguments to the macro describe the test function parameters. Because of the variadic nature of the macro, the function body must precede the variable-length list of parameters. Some might find this order reversal aesthetically displeasing, but I think it has the ugly beauty of a good hack. Allowing for a variable number of parameters in turn allows the design of test functions to be fairly unconstrained.

A call to a test function looks less odd:

Automation *a = track->automations[0];
TEST_FN_CALL(automation_keyframe_order, a);

The extra set of curly braces in the definition of TEST_FN_CALL allows the code integer variable (which merely captures the return code of the test function) to be redeclared multiple times within a broader scope, so a test can be called multiple times in a single function.

When TESTBUILD is set, the function definitions and calls are populated. A failing test returns a nonzero value, and the file and line on which failure occurred are printed before the program exits:

src/project_loop.c:645: test "automation_keyframe_order" failed with error code 1

Now, if I find myself reusing some piece of test code frequently, I can easily memorialize it in a test function, call it where necessary, and know that in production, none of those tests will materialize. I don’t expect to use this interface very often, but I am glad to have it.

I am skeptical of software “tests”

As I said, until a week or so ago, I had never really written a “test.” Having never worked as a professional programmer, I was never forced to, and writing tests as such is not a natural part of my own development process most of the time. Of course, I test my code constantly, both as a user and with sanity-check print statements or the like, frequent but transient concessions to expedience over rigor that rarely live long enough to touch a git commit; but never had I committed a function whose sole purpose was to test my own code.

The testless approach has worked fine for me thus far, and I am finally satisfied enough with my own programs, one of which contains some thirty-thousand lines of C code, to claim with some confidence that even in fairly large, complex pieces of software, tests are optional.

Still, I expect to work in the software industry someday soon, and to understand the kernels of truth undergirding its orthodoxies would behoove me. To that end, a few weeks ago I made a good-faith1 effort to learn more about tests, what they’re for, and why they’re important.

add_two_numbers

I focused my research on unit tests, because that’s the type of test I have seen most commonly cited in the admonishments2 of the professional community. What I hoped to find were explanations and examples of unit tests that would persuade me that they are valuable. What I actually found was the opposite: a range of beginner resources that repeat the common credos about unit tests but fail to convincingly justify them, and in some cases actually make a compelling argument for their futility.

A majority of these resources use essentially the same example of “testable unit”: add_two_numbers, a function or class method that takes two integers as arguments and returns their sum. Sources vary in their choice of arithmetic operator, their test cases, and the profuseness of their apologies for using such an obviously bad example; but the same basic problems are present in every instance I found. Of course add_two_numbers is meant to be a toy example and not practically useful. Still, one would hope that such an example would model the thought processes required to write useful tests. It should be possible to draw a line from the toy example to some intuition for real world usefulness, and here no such line can be drawn.

Here’s the full example provided by AWS, in python:

def add_two_numbers(x, y):
    return x + y
    
# Corresponding unit tests

def test_add_positives():
    result = add_two_numbers(5, 40)
    assert result == 45

def test_add_negatives():
    result = add_two_numbers(-4, -50)
    assert result == -54

def test_add_mixed():
    result = add_two_numbers(5, -5)
    assert result == 0

These three tests can offer us certainty that exactly three of the theoretically infinite3 possible calls to the function give an expected result. But don’t be cheeky, we identified these test cases because we presume that they represent in microcosm the full range of possible inputs – that all instances of a specific case behave more or less the same.

I’m not a math person, but even I am struck by the arbitrariness of this selection of test cases. The author apparently feels squirrely about signs, but (for example) are they really satisfied that they’ve captured the mixed case adequately by testing one of the few instances where the two values sum to zero? That’s a vanishingly small4 subcase. Even worse, addition with zero is not tested. In fact, zero was excluded from all of the comparable examples I found online.

Testing something as fundamental as addition does not strike me as completely absurd, not because I have ever had to implement an addition procedure myself, but because I program in C and it’s very easy to write bugs that result from unhandled type overflow. When adding two numbers, overflow is the likeliest source of problems. Maybe a comparable resource on unit tests tailored to the C programmer would give special attention to overflow? On a whim, I googled “C unit testing” and the third result delivered yet another implementation of add_two_numbers (here called my_sum) and accompanying tests that do not consider overflow.

Again, the article does not pretend that their example is useful or complete, but the fact that a practically useful test case is so near at hand but unmentioned is conspicuous. The point is that the overflow case is the test case that’s easiest to forget, and is exactly the bug that we’re likeliest to write. In fact, the test case that’s easiest to forget is always the bug that we’re likeliest to write. To write good tests, we need to identify a set of test cases that describe the behavior of the function as completely as possible. Once we’ve done that, we’ve already done the work required to anticipate and fix those bugs. So why bother with the tests?

Regressions

One commonly-cited reason to bother with the tests is to catch regressions. If we modify the tested unit, we might create a new bug, and that new bug might be caught by one of our existing tests. The alluring and potentially harmful illusion is that a full suite of passing tests implies correct code. At best, the existence of unit tests helps us catch some trivial bugs early, but at worst, the false confidence they confer inhibits the scrutiny and care required to write fewer bugs, including those that will not be caught by our existing tests.

“Test coverage”

We can begin to glean from the add_two_numbers example the gross inadequacy of almost any contrivable “test coverage” metric. Even if our tests hit every single line of code, they may nevertheless cover only a meager fraction of possible program states and behaviors. We can thoroughly trawl ever cubic inch of the ocean with our magnificent fishing net, but if our aim is to capture aquatic bacteria we will have achieved only the pretense of thoroughness. On the other hand, if we take the time to write good tests, we have already put in the careful thought required to simply improve our code and handle edge and error cases correctly. The quality of our tests, like that of our code, depends on the quality of our thought and attention. A viable method for measuring that quality has not been discovered.

The real value of tests

I’m writing critically about software tests not because I think they shouldn’t be written, but because of the unfortunate ubiquity of the view that they must be written – that writing tests is an integral part of software development, that tests imply reliability. In my view, the dogma surrounding tests merely obscures their actual value as tools for thought.

I am no Dijkstra.5 I could try to thoroughly convince myself of the correctness of every line of code I write, but the cognitive cost of such rigor is often too great to be feasible, so I cheat, lie, and steal, engaging in practices more at home in physical engineering – where mathematical models are approximations, and never the whole truth – than the theoretical approach nominally warranted by a formal system. This is an irony inherent in all computer programming: that we humans wrangle theoretically perfect machines6 not through direct transmission of pure theory, but by means of an elaborate, sometimes histrionic dance of half-baked thoughts, transient images, remembered patterns, and externalized mnemonics. The sophistication of human cognition is profound, yet our conscious information containers are of comparitively minute size and duration. It’s why we count on our fingers, why we do arithmetic on paper, why we diagram, why we litter our code with debug print statements, and why, if we so choose, we write tests. Writing tests is a cheat, and a valid one; but it is still a cheat.


  1. 😈↩︎

  2. “Testing is something that is not really fun, but needs to be considered inextricable from the development process. If you write something non-trivial and it doesn’t have a test, you failed.”↩︎

  3. I wanted to write 1.84e19, assuming fixed-width 32-bit integer arguments, but as far as I know python does not restrict integer width.↩︎

  4. n/(n^2) for integers in the range -n through n, which approaches zero as n approaches infinity, and is 4.66e-10 for 32-bit integers.↩︎

  5. I first read On the cruelty of really teaching computer science in 2021, when I had only a couple years of informal and sporadic programming experience under my belt. I read it quite differently now than I did then, but delight no less in Dijkstra’s wit and the strength of his opinions.↩︎

  6. True computer errors, in the sense of a failure of the hardware state to conform to the abstractions defined in our programs, are rare enough to be irrelevant in the vast majority of what is called computer programming. There’s probably a Claude Shannon citation to be made here, but I am not educated enough to make it.↩︎

published by charlie on 2024-10-30