Debug Strategy

Patricia Shanahan

So, you've written your code carefully, or copied it from a book, and compiled it without any error messages, but it doesn't run, or it does run and does something you didn't intend or expect. You have tried desk-checking the program in the light of its behavior, and you just can't see what is wrong with it. You need to debug it. Unfortunately, many courses on programming spend little or no time on teaching the art of debug. Debuggers often come with tutorials on how to use them in the sense of what to do if you have decided you need a break point at a particular line.

This article is about the other end of the problem, deciding what information to collect at what points in the program, and how to use that information to decide what to do next. It attempts to explain an approach to debug that I have found to be very effective.

While the example is a short, simple Java application, I have successfully applied the high level concepts to extremely difficult intermittent bugs in multithreaded, multiprocessor programs and in prototypes of multiprocessor computers. The approach is a completely general one, independent of the tools you are using. It may be difficult to follow the example if you don't read Java, but the rest of the article is independent of programming language.

You will also need some way of obtaining information about the states of variables as selected points in the program run. How you will do this depends on the language and tools you are using. For small programs that compile and run in seconds it may be a simple as inserting extra display statements in the source code. Debuggers can also be used for these cases, but are especially useful when making a change, compiling, and running to the point of failure takes several minutes or longer.

Terminology note: The word "theory" is used here for any proposed explanation of the causes and structure behind some observations, covering everything from a totally untested hypothesis through a mathematically proved theorem.

Main Flow

The essence of this technique is a simple sequence of steps that starts after a bug has been selected for debug, and ends when it has been fixed. In the middle there is a loop that improves information about the bug.

testing flowchart
  1. Make external observations How does the program behave? What does it do for different inputs? If run repeatedly with the same inputs, does it always do the same thing?
  2. Formulate theories Form a set of theories about what could be happening in the program to cause the observations.
  3. Design experiments to support or refute each theory: Often, these experiments will take the form of setting breakpoints and examining variables or expressions at those breakpoints.
  4. Run experiments Give the highest priority to those that could disprove at least one theory.
  5. Evaluate results If you now know what is wrong go on to step 6, fix and retest.. Otherwise, go back to step 2. Repeat steps 2 through 5 until you do know what is wrong.
  6. Fix and retest Correct the cause of the bug. Test the fix. Do not just run the test that failed. Also run any tests that used to work to verify that the fix did not break anything else. Do additional testing of other cases, where there is one bug there may be others.

General Comments

Programming is usually more successful if done in steps, with each step focussed on a limited readily achievable objective. In the same way, all except the simplest of debug tasks should be done in steps, with each step focussed on forming the next set of questions. It is not necessary, and often not possible, to get to the root cause in one step. Progress is being made as long as possibilities are eliminated or the theories become more specific, a little closer to identifying the root cause.

The "Aha" Approach

An alternative approach is to wait for a sudden insight in which you just realize where the problem lies. For example, many programmers have had the experience of going to sleep thinking about a bug and waking up knowing the solution. Insights are fine, and should always be checked out. They may save a lot of time. However, insights don't always appear when needed. While waiting for one to strike, it is a good idea to pursue an organized approach that will inevitably find the bug.

Code inspection

For relatively small pieces of code, bugs can often be found by simply reading the code in the belief that it contains errors. Even in larger programs, code inspection can be very effective, and save time, once the bug has been narrowed down to a manageable area. If the bug appeared while the program is being modified there is an increased chance that the problem is in the changed code or things with which it interacts. In that case a review of those areas may be appropriate.

Details of Each Step

Make external observations

This preparatory step can be surprisingly important.

It is important to know whether the program always produces the same results, given the same input. If this is not the case great care is needed in interpreting results. It may be necessary to run tests several times, and even apply statistical techniques to the results. Multithreaded programs are more likely to have varying behavior.

In many cases it is helpful to run several tests to see what, if anything, affects the program's behavior. It is usually best to base a debug effort on the shortest and simplest case that fails. Knowing cases that work may help constrain the initial theories.

Formulate theories

There are two slightly different cases. During the first iteration the theories have to be based on the external observations and simple conclusions from the source code. During the second and subsequent iterations, there is also a body of existing theories that have been tested to some extent.

In many cases it is possible to formulate a set of theories at least one of which MUST be true.

For example, consider the first iteration when the symptom is a NullPointerException. There will be a limited set of expressions in the line throwing the exception that are used to identify objects. At least one of those expressions must be null.

On second and subsequent iterations the main source of theories is going to be refinement of the theories from the previous iteration. However, unless you are absolutely certain that you have a case where at least one of those theories must be true, you should also consider adding new theories based on the current observations. This is unavoidable if you have succeeded in disproving all your previous theories.

Design experiments

A good theory makes testable predictions. Truth of all predictions made by a theory does not in itself prove the theory to be correct. Falsity of any prediction disproves the theory, at least in its current form. Of course, given a sufficiently precise, detailed theory about a computer program, it is often possible to prove its truth by examining the code.

If two theories make differing predictions for the outcome of the same experiment that experiment must disprove at least one of them. Such experiments are especially valuable and should be given high priority.

Sometimes there is a whole spectrum of related theories. Suppose you know that a variable had the correct value at point A in the program run, had a different value at point B, and according to the program design should not change between A and B. There is in a sense a separate theory available for every line of code reached between A and B, asserting that that particular line of code on that occasion changed the value.

There are several reasonable approaches; exhaustive, work forwards, work backwards, and binary search. If the number of iterations is small, use an exhaustive approach in which you trace the value on every iteration. Forwards experiments work forwards from point A, testing whether the value is still correct. Backwards experiments look shortly before point B, looking for the last point at which the variable was correct. Binary search tests the variable at a point half way between A and B. If the value is correct there, move to a point half way between there and B. If it is incorrect, look next half way between A and the tested point.

One good approach is to try a few tests close to each of A and B. If the value is consistently good close to A, and consistently bad close to B, then switch to a binary search.

In all cases, the objective to is bracket the change in the variable by narrowing the range between the latest point at which the variable is known to be good and the earliest point at which it is known to be bad.

Another useful experimental technique is removal of code. Obviously, before embarking on major source code surgery you should make sure you have back-up copies of the original and/or have it under revision control. If you have a theory that a portion of the code is irrelevant, delete that portion or replace it with a simple stub.

Run experiments

If the problem is a complicated one, keep written notes of experiments run, their results, and conclusions. Even if the program appears to have repeatable behavior, periodically do the same experiment twice to check that there are no differences.

This is the only part of the process where the choice of debugging tool makes much difference.

Evaluate results

Considering the results of all the experiments so far, it is possible that the problem is now sufficiently constrained that it can be found by examining the code. If so, the debug effort is effectively complete.

Sometimes, it may not be clear whether the debug effort is over or not. For example, suppose you have determined that a particular variable is null at a time when it is being used to reference an object. There are three cases:

Fix and retest

Fix the bug that has been identified. Also, consider its possible implications

for the rest of the program. Is this particular bug indicative of a general problem that may affect other places in the code? Were there omissions in unit testing that let an erroneous component get integrated into the program?

Several types of testing should be done:

Special Situations

Heisenbugs

Some programs will behave differently on repeated runs given identical inputs. These can be the hardest bugs to find and fix. They are sometimes known as "Heisenbugs" after Heisenberg's Uncertainty Principle. Attempts to observe them often change their behavior.

The simplest and easiest approach, at least in the short term, is denial. Refuse to debug unless the user can provide a test case that reliably reproduces the bug. Reject bug reports with "unable to reproduce". This approach is very popular, except with people who have to use the programs.

The second simplest approach is to add arbitrary synchronization until the bug goes away. This has some serious disadvantages; it may only reduce the frequency of the bug, it may cause deadlocks, and it may reduce performance. Reducing the frequency of the bug is often not a long term solution because any production system will clock a lot more time being used than being tested. A bug whose frequency is too low for it to have much chance of appearing during testing may still be a serious problem for customers.

The hardest approach, but the most effective in the long term, is to actually debug them. The same basic system works, with a few differences:

Suspected Language Bugs

Suppose a problem has been narrowed down to a few lines of code, and what the code is doing is not what was expected. One possibility is that there is a bug in the language implementation. However, most of the time the error will be in the programmer's interpretation. Here is a series of steps that can resolve this:

Example

I have written a short program to solve a common student exercise. There are bugs in it. Please do not attempt to find them by code inspection. Pretend that the program is too big for that.

/** Solve a common student exercise - display the primes
from 2 up to a specified limit. */
public class Example{
  public static void main(String[] args){
    //Find and print primes up to args[0].

    //Convert and validate the command line argument
    if(args.length < 1){
      System.err.println("Usage: java Example max");
      System.exit(3);
    }
    int max=0;
    try{
      max = Integer.parseInt(args[0]);
    }catch(NumberFormatException e){
      e.printStackTrace();
      System.exit(2);
    }
    if(max < 2){
      System.err.println(
          "Maximum prime to look for must be 2 or greater");
      System.exit(1);
    }

    // Get the required primes in an array
    PrimeFinder pf = new PrimeFinder();
    int[] primes = pf.getPrimes(max);

    // Print them.
    for(int i=0;i<primes.length;i++){
      System.out.println(primes[i]);
    }
  }
}

/** Prime number finder. */
class PrimeFinder{

  /** Get the primes.
  @return An array containing the primes from 2 through the
  largest prime less than or equal to max.
  @param max The upper bound of the range to search for
  primes. */
  int[] getPrimes(int max){
    /* Use sieve method to eliminate all numbers
    that are divisible by 2 through sqrt(count). */

    /* Sieve set - all composite numbers in the range 2
    through max will be represented in this set. */
    FixedSizedBitSet sieve = new FixedSizedBitSet(max+1);

    /* The primes, will contain the values in the range 2
    through max that have not been found to be composite. */
    FixedSizedBitSet results = new FixedSizedBitSet(max+1);

    /* All composite numbers less than or equal to max
    will have at least one factor less than or equal to
    stop. */
    int stop = (int)Math.sqrt(max);

    /* First do the sieving - add all composite numbers
    in the range to sieve set. */
    for(int i=2;i<stop;i++){
      if(!sieve.contains(i)){
        for(int j=2*i;j<=max;j+=i){
          sieve.add(j);
        }
      }
    }

    /* Now find the numbers in the range that are not in
    the sieve set. */
    for(int i=2;i<=max;i++){
      if(!sieve.contains(i)){
        results.add(i);
      }
    }
    return results.toIntArray();
  }
}

/** Simple bit set for pre-determined sizes. */
class FixedSizedBitSet{
  long[] bits;
  int elems;  // number of potential elements

  /** Create a set based on a specified number of
  potential elements.
  The elements will be indexed 0 through elems-1.
  @param elems Number of potential elements */
  FixedSizedBitSet(int elems){
    this.elems = elems;
    bits = new long[(elems+63)/64];
  }

  /** Add a specified element to the set
  @param elem Element number to add */
  void add(int elem) throws IllegalArgumentException{
    if(elem >= elems || elem < 0){
      throw new IllegalArgumentException();
    }
    bits[elem/64] |= mask(elem);
  }

  /** Remove a specified element from the set
  @param elem Element number to remove */
  void remove(int elem)throws IllegalArgumentException{
    if(elem >= elems || elem < 0){
      throw new IllegalArgumentException();
    }
    bits[elem/64] &= ~mask(elem);
  }

  /** Test whether a specified element is in the set.
  @param elem The element to set
  @return true if it is present, false if it is absent */
  boolean contains(int elem){
    boolean result=false;
    if(elem < 0 || elem >= elems){
      return false;
    }
    return (bits[elem/64] & mask(elem))!=0;
  }

  /** Construct a mask with only the bit corresponding
  to the element number on.
  @param elem The element number */
  private long mask(int elem){
    return 1 << (elem %64);
  }

  /** Return int[] representation.
  @return An int[] containing the element numbers that
  are currently in this set, in ascending order */
  int[] toIntArray(){
    // Count the elements in the set.
    int count=0;
    for(int i=0;i<elems;i++){
      if(contains(i)){
        count++;
      }
    }

    // Create an array to contain that number of elements
    int[] results = new int[count];

    // Fill in the ones that are in the set.
    int j = 0;
    for(int i=0;i<elems;i++){
      if(contains(i)){
        results[j] = i;
        j++;
      }
    }
    return results;
  }
}

This version works correctly for command line parameters 2 and 3, but for 4 it displays:

2
3
4

falsely reporting that 4 is a prime. This is the first bug that will be analyzed.

Make external observations

The known failing case is a small and simple one. However, it is still worth doing some more experiments to find out which values fail. 4, 5, 6, 7, and 8 all show similar symptoms, with all numbers from 2 through the limit being listed as primes. 9, on the other hand, reports:

2
3
5
7
9

The values are correct, except for the inclusion of 9. Behavior changed at parameter values 4 and 9. These are both perfect squares, raising the issue of what happens around 16, the next square. 15 gives correct answers except for listing 9 and 15 as squares. 16 got completely correct results.

Nothing obvious here, but keep in mind the possibility that squares have some significance. Meanwhile, 4 is the smallest and simplest failing case and will be debugged.

Formulate theories

There are two ways a wrong answer can be in the result set. Either it was added by the PrimeFinder or it was due to an error in FixedSizedBitSet.

Design experiments

Add code to PrimeFinder to display elements being added to the result set:

    for(int i=2;i<=max;i++){
      if(!sieve.contains(i)){
        System.out.println("Adding "+i);
        results.add(i);
      }
    }

Run experiments

Adding 2
Adding 3
Adding 4
2
3
4

Evaluate results

The theory that 4 is being added by PrimeFinder is supported, and the theory that it was appearing due to a bug in FixedSizedBitSet is disproved. There is insufficient information to fix the bug at this point, so continue debug effort.

Formulate theories

I now have a well supported theory, that PrimeFinder is adding 4 to the result set in the loop where it also adds the correct result elements. New theories should be based on refining that one. It appears that 4 may have been absent from the sieve, not marked as a composite in the sieve loop.

Design experiments

Move the display code to the sieve loop to see whether 4 is detected there as a composite number:

    for(int i=2;i<stop;i++){
      if(!sieve.contains(i)){
        for(int j=2*i;j<=max;j+=i){
          System.out.println("Adding "+j);
          sieve.add(j);
        }
      }
    }

Run experiments

The new output is:

2
3
4

Evaluate results

Because of the lack of any debug output at all, check the output statement, but it appears to be correct. In a more complicated case, I would test other values, to verify the output statement or breakpoint, before fully accepting an absence of output as valid data.

This is an interesting result. It matches the final result in showing that no numbers are detected as composite. On the other hand, I know the loop can find composites because of other tests. I would have expected 4 to be eliminated when i is 2 in the outer loop. However, i will not take the value 2. For max=4, stop should itself take the value 2, and the loop continuation condition is i<stop. This is the bug.

It is consistent with the other wrong answers under consideration. It will cause failure to sieve out composite numbers whose smallest prime factor is equal to stop, the largest int less that or equal to the square root of max. For 4 through 8, stop will be 2, and even numbers will be missed. For 9, stop is 3, and 9 will be missed. Perfect squares are significant.

Fix and retest

Change the loop continuation condition from i<stop to i<=stop.

Test that the current cases are fixed. It is. Parameter 4 now gets 2 and 3 as output. Also the other failing cases such as 9 now get correct results.

Next, regression test previously working cases. Cases 2 and 3 still get correct results.

Finally, extend testing to additional cases. Since the FixedSizedBitSet deals with block of 64 bits, 64 and values greater than 64 are obviously interesting. 64 gets:

5
11
17
29
37
43
49
61

This is obviously broken. There are many primes missing, although almost all reported values are prime. I wonder what happens for 32?

2
3
5
7
11
13
17
19
23
29
31

Looks good. How about 48?

5
11
17
19
23
29
31
37
43

Not so good. Try 40:

5
11
13
17
19
23
29
31
37

Also bad, and similar to 48.

36?

5
7
11
13
17
19
23
29
31

34?

3
5
7
11
13
17
19
23
29
31

Two is missing. 33?

2
3
5
7
11
13
17
19
23
29
31

All correct.

It is possible, though by no means certain, that 34 is the smallest failing case. Pick that as the one to debug. Note that the problem seems to get worse with increasing size of the parameter.

Formulate theories

The failure of 2 to appear in the result set could either be an error in the result set handling or a failure by PrimeFinder to put it in the result set.

Design experiments

Put back in the first debug statement, tracing additions to the result set.

Run experiments

Adding 3
Adding 5
Adding 7
Adding 11
Adding 13
Adding 17
Adding 19
Adding 23
Adding 29
Adding 31
3
5
7
11
13
17
19
23
29
31

Evaluate results

Looks as though 2 was never added to the result set.

Formulate theories

That could be because it had been falsely picked up by the sieve, so that the !sieve.contains(i) test failed, or because the loop doesn't test 2, or because of an error in FixedSizedBitSet.

Design experiments

The first two theories can be tested by adding display code to PrimeFinder:

    /* First do the sieving - add all composite numbers
    in the range to sieve set. */
    for(int i=2;i<stop;i++){
      if(!sieve.contains(i)){
        for(int j=2*i;j<=max;j+=i){
          System.out.println("Sieved out "+j);
          sieve.add(j);
        }
      }
    }

    /* Now find the numbers in the range that are not in
    the sieve set. */
    for(int i=2;i<=max;i++){
      System.out.println("Testing "+i);
      if(!sieve.contains(i)){
        results.add(i);
      }
    }

Run experiments

Sieved out 4
Sieved out 6
Sieved out 8
Sieved out 10
Sieved out 12
Sieved out 14
Sieved out 16
Sieved out 18
Sieved out 20
Sieved out 22
Sieved out 24
Sieved out 26
Sieved out 28
Sieved out 30
Sieved out 32
Sieved out 34
Sieved out 6
Sieved out 9
Sieved out 12
Sieved out 15
Sieved out 18
Sieved out 21
Sieved out 24
Sieved out 27
Sieved out 30
Sieved out 33
Sieved out 10
Sieved out 15
Sieved out 20
Sieved out 25
Sieved out 30
Testing 2
Testing 3
Testing 4
Testing 5
Testing 6
Testing 7
Testing 8
Testing 9
Testing 10
Testing 11
Testing 12
Testing 13
Testing 14
Testing 15
Testing 16
Testing 17
Testing 18
Testing 19
Testing 20
Testing 21
Testing 22
Testing 23
Testing 24
Testing 25
Testing 26
Testing 27
Testing 28
Testing 29
Testing 30
Testing 31
Testing 32
Testing 33
Testing 34
3
5
7
11
13
17
19
23
29
31

Evaluate results

2 was not added to the sieve set, was tested by the second loop, but at that time was reported as being in the sieve set.

Formulate theories

At some point 2 must have appeared in the set. This includes the possibility that the set was not initially empty. There are three families of theories, it was always there, it appeared during the first loop, and it appeared during the second loop.

Design experiments

Display the truth of the expression sieve.contains(2) before each of the two loops:

    System.out.println("Before first loop "+sieve.contains(2));

    /* First do the sieving - add all composite numbers
    in the range to sieve set. */
    for(int i=2;i<stop;i++){
      if(!sieve.contains(i)){
        for(int j=2*i;j<=max;j+=i){
          sieve.add(j);
        }
      }
    }

    System.out.println("Between loops "+sieve.contains(2));

Run experiments

Before first loop false
Between loops true

[From here on, the non-debug output is dropped to save space].

Evaluate results

2 somehow got into the sieve set during the first loop, but without being explicitly added.

Formulate theories

There is a whole family of theories, one for each iteration of the first loop.

Design experiments

If the loop did more iterations, I might do a binary search for when 2 got into the set. As it is, I'm just going to display the truth of the contains expression on each iteration:

    for(int i=2;i<stop;i++){
      System.out.println("Start i="+i+" "+sieve.contains(2));
      if(!sieve.contains(i)){
        for(int j=2*i;j<=max;j+=i){
          sieve.add(j);
        }
      }
      System.out.println("End i="+i+" "+sieve.contains(2));
    }

Run experiments

Start i=2 false
End i=2 true
Start i=3 true
End i=3 true
Start i=4 true
End i=4 true
Start i=5 true
End i=5 true

Evaluate results

2 got into the sieve set during the first iteration of the outer loop of the sieve.

Formulate theories

Again, it could have appeared before the inner loop, during any iteration of the inner loop, or after the inner loop was done.

Design experiments

Move the output statements into the inner loop, but modified to make them conditional on being on the i=2 iteration of the outer loop:

        for(int j=2*i;j<=max;j+=i){
          if(i==2){
            System.out.println("Start j="+j+" "+sieve.contains(2));
          }
          sieve.add(j);
          if(i==2){
            System.out.println("End j="+j+" "+sieve.contains(2));
          }
        }

Run experiments

Start j=4 false
End j=4 false
Start j=6 false
End j=6 false
Start j=8 false
End j=8 false
Start j=10 false
End j=10 false
Start j=12 false
End j=12 false
Start j=14 false
End j=14 false
Start j=16 false
End j=16 false
Start j=18 false
End j=18 false
Start j=20 false
End j=20 false
Start j=22 false
End j=22 false
Start j=24 false
End j=24 false
Start j=26 false
End j=26 false
Start j=28 false
End j=28 false
Start j=30 false
End j=30 false
Start j=32 false
End j=32 false
Start j=34 false
End j=34 true

Evaluate results

2 appeared in the sieve set during the j=34 iteration of the inner loop.

Formulate theories

The obvious theory is that somehow attempting to add 34 to the set added 2. It could be in addition to, or instead of, 34.

Design experiments

Modify the display code to focus on the j=34 iteration, and display the state of 34 as well as 2:

        for(int j=2*i;j<=max;j+=i){
          if(i==2 && j==34){
            System.out.println("Start 2 "+sieve.contains(2)+
              " 34 "+sieve.contains(34));
          }
          sieve.add(j);
          if(i==2 && j==34){
            System.out.println("End 2 "+sieve.contains(2)+
              " 34 "+sieve.contains(34));
          }
        }

Run experiments

Start 2 false 34 false
End 2 true 34 true

Evaluate results

Both 2 and 34 were added.

Formulate theories

At this point attention shifts to inside the FixedSizedBitSet implementation. A single "add" call should only add one element to the set, but appears to have caused the contains() result for both 2 and 34 to change to true.

Looking at the add code in FixedSizedBitSet, it appears that the problem is unlikely to be selection of the wrong array element since 34 and two should both be represented in element 0. The best theory to test next seems to be that the mask is wrong.

Design experiments

There are two approaches at this point. Normally, I would put a pure utility class like FixedSizedBitSet in its own file, and unit test it by itself. This would be an easier approach. However, in order to illustrate techniques, I'm going to continue to debug it in the whole program.

Delete the existing debug outputs, and modify FixedSizedBitSet's add method to display the mask when working on element 34:

  void add(int elem) throws IllegalArgumentException{
    if(elem >= elems || elem < 0){
      throw new IllegalArgumentException();
    }
    bits[elem/64] |= mask(elem);
    if(elem == 34){
      System.out.println("mask "+
        Long.toBinaryString(mask(elem)));
    }
  }

Run experiments

mask 100

Evaluate results

Bit 2 is on, and bit 34 is off. Clearly, the problem is in the mask calculation.

The mask calculation:

    return 1 << (elem %64);

is obviously incorrect. Because 1 is an int the shift will be done in 32 bit int arithmetic, and will wrap around. An attempted 34 bit shift will actually shift by 2 bits. More generally, each time the PrimeFinder tried to sieve out a number n in the range 32 through 63 it would actually sieve out n-32. This explains worse results from larger numbers.

Fix and retest

The fix is to change the 1 to 1L, the long literal with the same value.

The problem case 34, other cases that had been previously tested, and several additional tests, all now give correct results.

Omitting the argument, and invalid numeric values such as 1, 0, and -1, all gave reasonable error messages. However, testing "ssss" got:

java.lang.NumberFormatException: ssss
        at java.lang.Integer.parseInt(Integer.java:409)
        at java.lang.Integer.parseInt(Integer.java:458)
        at Example.main(Example.java:15)

Although the parameter value is incorrect, a raw exception is not a good way of handling a user input error. This requires little or no debug - it just needs a modification to the catch block in the main method.

Conclusion

Each bug was tracked down to the statement that caused it without any great leaps of insight, just steady cutting away at the scope of the problem.


© Patricia Shanahan, 1999-2005. All rights reserved.
Web-design by Andrew Thompson, of PhySci.codes 2004.