Developing software is hard, developing embedded software with realtime requirements is much harder. When the time comes that you are facing “The Problem” then it is better to proceed in a rational way. |
|
There are several steps that should be followed when facing non obvious anomalies:
There is no amount of testing that can prevent this, problems happen. First, we need to understand that there is a problem, this is often not so simple because the most nasty problems are those that do not manifest themselves in an obvious way for example: increased response time Jitter, randomly missed deadlines or mysterious random lockups.
If a problem cannot be reliably reproduced then it is hard to proceed, luckily if a problem is not dependent on external events (communication, sampling, reading of I/O lines, external interrupts) it will happen reliably because an MCU system is deterministic in nature. The worst case is a problem truly random in nature.
The first step toward a solution is to exclude all the code that is not necessary for the bug to happen. A small test case is easier to handle than a complex application. For random problems the best approach is to design/implement a stress test for the involved software components in order to trigger the problem more easily.
Find the trigger of the bug, for example an IRQ preempting another IRQ at lower priority. Understanding when the problem happens is a necessary step toward the solution.
Before trying to fix the problem is necessary to understand exactly what is happening, without a complete understanding there is the risk to mask the effects of the problem rather than properly fix it. This is very dangerous because a masked problem *will* return and bite you.
The defect solution could be a simple single line fix or impact the whole system design, the latter is, of course, more problematic because the defect would have its root in the system design itself.
Having the defect been found, reproduced, understood, analyzed now it is possible to properly implement the obvious fix.
Now that the problem has been fixed there are several other things that should be considered before the incident can be closed: