The Seven Steps of the Successful Debugging

Developing software is hard, developing embedded software with realtime requirements is much harder. When the time comes that you are facing “The Problem” then it is better to proceed in a rational way.

The Seven Steps

There are several steps that should be followed when facing non obvious anomalies:

1) Defect Detection: understanding that a problem is present

There is no amount of testing that can prevent this, problems happen. First, we need to understand that there is a problem, this is often not so simple because the most nasty problems are those that do not manifest themselves in an obvious way for example: increased response time Jitter, randomly missed deadlines or mysterious random lockups.

2) Defect Reproduction: being able to reproduce the problem

If a problem cannot be reliably reproduced then it is hard to proceed, luckily if a problem is not dependent on external events (communication, sampling, reading of I/O lines, external interrupts) it will happen reliably because an MCU system is deterministic in nature. The worst case is a problem truly random in nature.

3) Defect Reduction: creating a minimal test case that exhibits the problem

The first step toward a solution is to exclude all the code that is not necessary for the bug to happen. A small test case is easier to handle than a complex application. For random problems the best approach is to design/implement a stress test for the involved software components in order to trigger the problem more easily.

4) Defect Triggers: isolating the triggering condition

Find the trigger of the bug, for example an IRQ preempting another IRQ at lower priority. Understanding when the problem happens is a necessary step toward the solution.

5) Defect Understanding: understanding the root cause(s) of the problem

Before trying to fix the problem is necessary to understand exactly what is happening, without a complete understanding there is the risk to mask the effects of the problem rather than properly fix it. This is very dangerous because a masked problem *will* return and bite you.

6) Defect Analysis: understanding the impact on the system

The defect solution could be a simple single line fix or impact the whole system design, the latter is, of course, more problematic because the defect would have its root in the system design itself.

7) Defect Elimination: fixing it

Having the defect been found, reproduced, understood, analyzed now it is possible to properly implement the obvious fix.

Fixing is not sufficient, there is more

Now that the problem has been fixed there are several other things that should be considered before the incident can be closed:

  1. Can similar problems occur in other parts of the systems?
  2. Has the problem been properly documented/tracked so it will not happen again in the future?
  3. Are there techniques or procedures that could have prevented the defect to occur in first place?
  4. What I learned from the problem? Can I generalize what I learned?
chibios/articles/debugging7steps.txt · Last modified: 2011/12/22 14:46 by giovanni
Except where otherwise noted, content on this wiki is licensed under the following license:GNU Free Documentation License 1.3