Optimizing embedded code

This article explains some techniques for Embedded-oriented code optimizations.

The Compiler

A common misconception is that the embedded programmer should not try to hand optimize code because the compiler will probably optimized it even better and readability would suffer, this is only partially true. There are situations where knowing how the compiler works and the underlying hardware architecture make possible to write better C code.

False Optimizations

There are things that the compiler can easily optimize by itself, for example, writing a = a » 1 should never replace a = a / 2 if your intention is to divide the variable “a” by 2 and not shifting it. By doing so you are reducing the readability of your code without really improving anything, modern compilers are perfectly able to do that optimization by themselves.

Real Optimizations

On the other hand, there are many optimizations that the compiler cannot perform by itself because it does not have all the information required or because architectural constraints. Often, by simply slightly reorganizing existing code, it is possible to improve both size and speed.

Tail Calls

A common situation that can be hand optimized it to write code that is “tail calls” friendly. What is exactly a tail call? consider the following code of a function invoking another function:

void bbb(void);
 
void aaa(void) {
 
  /* Code block 1.*/
  ...
  bbb();
  /* Code block 2.*/
  ...
}

It will result in the following pseudo-asm code:

        extern bbb
        public aaa
aaa:
        push    used-registers
        /* Asm code block 1.*/
        ...
        call    bbb
        /* Asm code block 2.*/
        ...
        pop     used-registers
        ret

Now imagine that the function could be rewritten as follow, moving the invocation of bbb() at the end of aaa():

void bbb(void);
 
void aaa(void) {
 
  /* Code block 1.*/
  ...
  /* Code block 2.*/
  ...
  bbb();
}

A non optimizing compiler would generate:

        extern bbb
        public aaa
aaa:
        push    used-registers
        /* Asm code block 1.*/
        ...
        /* Asm code block 2.*/
        ...
        call    bbb
        pop     used-registers
        ret

But a good compiler would optimize the above code as:

        extern bbb
        public aaa
aaa:
        push    less-registers
        /* Asm code blocks 1 and 2.*/
        ...
        pop     less-registers
        jmp     bbb

Let's see what happened exactly:

  1. The compiler saved one instruction, by directly jumping to the bbb() address, it does not have to generate the final ret instruction.
  2. The two code blocks can be better optimized together because there is not the barrier of the opaque function call, a function can change global variables and, being bbb() external, the compiler cannot know what it exactly does, the compiler will have to make the safest assumptions thus preventing possible optimizations.
  3. The compiler probably will need to push/pop less registers because it would not have to “transport” automatic variables across the function call thus reducing the overall “registers pressure”.
  4. Less stack usage because the stack frame of the function aaa() is removed before invoking bbb(), this means precious RAM space saved especially in a multithreaded environment.

The above example assumes that the function can be moved to the end, both versions are perfectly good C code but the latter is better and the compiler has no way to optimize it by itself.
Note that tail calls can be performed also on functions with parameters (depending on the ABI) and that have return values.

Parameters life span

Lets consider this code:

void bbb(int n);
 
void aaa(int i) {
  bbb(4);
  use_parameter(i);
  /* Other code.*/
  ...
}

The above code requires the parameter “i” to survive until its use. Most modern architectures (ARM, Power etc) pass parameters within the processor registers, the above code means that the the register holding the parameter “i” (let's say R0) is not available until the function use_parameter() is invoked. Invoking the function bbb() also requires R0 so the compiler will have to move “i” into some other register:

        extern bbb
        extern use_parameter
        public aaa
aaa:
        push    R4
        mov     R4,R0
        mov     R0,#4
        call    bbb    // Function calls are assumed to trash R0...R3
        mov     R0,R4
        call    use_parameter
        /* Other asm code.*/
        pop     R4
        ret

Now, assuming that bbb() does not have to be invoked before of use_parameter() we can change the code in:

void bbb(int n);
 
void aaa(int i) {
  use_parameter(i);
  bbb(4);
  /* Other code.*/
  ...
}

And this would result in:

        extern bbb
        extern use_parameter
        public aaa
aaa:
        call    use_parameter
        mov     R0,#4
        call    bbb
        /* Other asm code.*/
        ret

This new code uses three less instructions and saves some stack space. In general reducing the life span of parameters can be a huge advantage working with architectures that use registers for parameters passing. It would not affect architectures passing parameters in the stack however (usually those with few internal registers).

Parameters reordering

Lets consider the following code:

void bbb(int c, int d);
 
void aaa(int a, int b) {
 
  bbb(b, a);
  /* Other code.*/
  ...
}

Assuming that the target architecture specifies that parameters are passed in registers R0…R3 (incidentally ARM does that) we will get:

        extern bbb
        public aaa
aaa:
        mov     R2,R0
        mov     R0,R1  // "b" in R0
        mov     R1,R2  // "a" in R1
        call    bbb
        /* Other asm code.*/
        ...
        ret

A registers mess, now if we could rewrite bbb() as follow:

void bbb(int d, int c);
 
void aaa(int a, int b) {
 
  bbb(a, b);
  /* Other code.*/
  ...
}

The result would be:

        extern bbb
        public aaa
aaa:
        call    bbb
        /* Other asm code.*/
        ...
        ret

Saving the swap of the parameters and potentially also extra push/pop instructions. In general try to keep the parameters in the same position in architectures passing parameters into registers. Even keeping just some parameters in the same position can help avoid some registers reordering.

Automatic Variables

Just like parameters also automatic variables can be allocated into registers, it is a good idea to keep variable life spans (first use to last use) as short as possible and, if possible, do not have function calls inside:

int get_integer(void);
void use_integer(int a);
void bbb(void);
 
int global_a = 1;
 
void aaa(void) {
  int i;
 
  i = get_integer();
  ...
  bbb();
  ...
  use_integer(i + a);
  ...
}

This would result in something like:

        extern get_integer
        extern use_integer
        public aaa
aaa:
        push    R4
        call    get_integer
        mov     R4,R0
        ...
        call    bbb    // Function calls are assumed to trash R0...R3
        ...
        mov     R0,R4
        add     R0,@global_a
        call    use_integer
        ...
        pop     R4
        ret

The function bbb() is like a barrier that enforces the use of another register that must be also saved and restored from the stack. Now, if we can reorganize the code as follow:

int get_integer(void);
void use_integer(int a);
void bbb(void);
 
int global_a = 1;
 
void aaa(void) {
  int i;
 
  i = get_integer();
  ...
  use_integer(i + a);
  bbb();
  ...
}

We would have removed the barrier into the variable “i” life span and the code would become:

        extern get_integer
        extern use_integer
        public aaa
aaa:
        call    get_integer
        ...
        add     R0,@global_a
        call    use_integer
        call    bbb
        ...
        ret

Much better result. In general try to keep variable life spans short and to avoid function barriers, the cost is high.

Final Considerations

Few final random notes:

  • The C code is usually well portable but not necessarily optimal on all architectures.
  • Non-automatic variables have higher access cost on some architectures, 32 bits RISCs usually, because the relatively high cost of accessing an absolute address. Automatic variables have higher access cost on some other architectures instead, some 8bits architectures can be less efficient in handling variables allocated on the stack.
  • Knowledge of the target architecture is important.
  • Knowledge of the ABI is important.
  • Consider function calls as barriers, crossing barriers has a cost for automatic variables and, in general, increases stack usage.
  • Consider inlining of small functions. An inlined function no more represents a barrier in the invoking code, some compilers do this automatically at high optimization levels.
 
chibios/articles/code_optimization.txt · Last modified: 2011/10/03 20:43 by giovanni
 
Except where otherwise noted, content on this wiki is licensed under the following license:GNU Free Documentation License 1.3