### Self translation Fixed Point Arithmetic on the ARM

### introduce

This application note describes how to use ARM C compiler and arm or Thumb assembler to write efficient fixed-point arithmetic code. Since the ARM kernel is an integer processor, integer arithmetic must be used to simulate all floating-point operations. Using fixed-point algorithm instead of floating-point number will greatly improve the performance of many algorithms.

This document contains the following sections:

- Principle of fixed-point operation: describes the mathematical concepts required for fixed-point operation.
- Example: an example of writing fixed-point code for signal processing and graphics processing is given, which are the two most common uses.
- C programming: covers the actual details of implementing fixed-point code in C. The sample program is given with a set of macro definitions.
- Assembler: introduces an example of an assembler.

### Principle of fixed point algorithm

In computational arithmetic, a pair of integers (n, e) can be used to approximate fractions: mantissa and exponent, respectively. This pair of integers represents the fraction n2 − en2 − E. The exponent e can be considered to be the number of digits that must be moved before the binary decimal point is placed. such as

Mantissa (n) | Exponent (e) | Binary | Decimal |
---|---|---|---|

01100100 | -1 | 011001000. | 200 |

01100100 | 0 | 01100100. | 100 |

01100100 | 1 | 0110010.0 | 50 |

01100100 | 2 | 011001.00 | 25 |

01100100 | 3 | 01100.100 | 12.5 |

01100100 | 7 | 0.1100100 | 0.78125 |

If e is a variable, stored in a register, and unknown at compile time, (n, e) is called a floating-point number. If e is known in advance, at compile time, (n, e) is called a fixed number of points. By storing mantissa, fixed number of points can be stored in standard integer variable (or register).

For fixed-point numbers, the exponent e is usually represented by the letter q. The following section describes how to perform basic arithmetic operations on two fixed-point numbers, two of which are a=n2 − pa=n2 − P and b=m2 − qb=m2 − q, and the result is expressed as c=k2 − rc=k2 − r, where p, q and r are fixed constant exponents.

In the calculation, we only need two mantissa n m to get k. They will be converted using their index. Only the mantissa is stored when the number is actually stored. The exponent is stored by another corresponding number.

#### Index movement

The easiest operation to perform on a fixed number of points is to change the exponent. In order to change the exponent from p to r (perform operation c = a), the mantissa k c a n be calculated from n by a simple shift.

Because: n2 − P = n2 r – p * 2 − rn2 − p=n2r – p * 2 − r

k=n<<(r−p)k=n<<(r−p) if(r>=p)if(r>=p)

k=n>>(p−r)k=n>>(p−r) if(p>r)if(p>r)

#### Addition and subtraction

To perform the operation c = a + b, first convert a and b to the same exponent r as c, and then add the mantissa. The method consists of the following equation:

n2−r+m2−r=(n+m)2−rn2−r+m2−r=(n+m)2−r

Subtraction analogous

#### multiplication

Product c = ab can be performed using a simple integer multiplication. The equation is as follows:

ab=n2–p∗m2–q=(nm)2–(p+q)ab=n2–p∗m2–q=(nm)2–(p+q)

So the product n * m is the mantissa of the answer with exponent p + q. To convert the answer to have an exponent r, perform the shift as described above. such as

k=(n∗m)>>(p+q−r)k=(n∗m)>>(p+q−r) if(p+q>=r)if(p+q>=r)

#### division

Division c=a/b can also be performed by simple integer division. The formula is:

a/b=(n2−p)/(m2−q)=(n/m)2q−p=(n/m)2r+q−p2−ra/b=(n2−p)/(m2−q)=(n/m)2q−p=(n/m)2r+q−p2−r

In order not to lose precision, a multiplication of 2r+q − p2r+q − P must be performed before dividing by m. For example,

k=(n<<(r+q−p))/mk=(n<<(r+q−p))/m if(r+q>=p)if(r+q>=p)

#### square root

The formula of the square root is as follows:

a−−√=n2−p−−−−√=n22r−p−−−−−√∗2−ra=n2−p=n22r−p∗2−r

In other words, c=a − √ c=a performs k = isqr (n < < 2R − p) k = isqr (n < < 2R − p) where isqr is the square root function of an integer.

#### overflow

The above equation shows how to use only integer operations to perform basic operations of addition, subtraction, multiplication, division and root extraction for fixed-point numbers. However, in order for the answer to be accurately stored in the result represented by (± 2 − r ± 2 − r), you must ensure that no overflow occurs in the calculation. If the index is not carefully selected, every left shift, plus / minus or multiplication will overflow and lead to meaningless answers.

The following example of a "real world" scenario shows how to choose an index.

### Example

#### signal processing

In signal processing, data is usually composed of fractional values, ranging from - 1 to + 1. This example uses the exponent Q and 32-bit integer mantissa (so that each value can be saved in an ARM register). To be able to multiply two numbers without overflowing, you need 2q < 32 or Q < = 15. In practice, q = 14 is usually chosen because it allows for multiplication with several accumulations without spillover risk.

Fix q = 14 at compile time. Then 0xFFFFC000 represents - 1, 0x00004000 represents + 1 and 0x00000001 represents 2 − 142 − 14, the most accurate divisor.

Suppose x, y are two q = 14 mantissa, n, m are two integers. By applying the above formula, you can get the basic operation, where a is a normal integer:

Operation | Code to produce mantissa of answer in q=14 format |
---|---|

x + y | x + y |

x + a | x + (a<<14) |

xa | x*a |

xy | (x*y)>>14 |

x/a | x/a |

a/b | (a<<14)/b |

x/y | (x<<14)/y |

x−−√x | sqr(x<<14) |

#### image processing

In this example, the (x, y, z) coordinates are stored in the form of eight decimal places (q = 8). If x is stored in a 32-bit register, the lowest byte of X gives the decimal part, and the highest three parts are the integer part.

To calculate the distance d from the origin, in the form of q = 8: d=x2+y2+z2 − − − − − − − − − − − − − √ d=x2+y2+z2

If you apply the above formula directly and save all intermediate answers in the form of q = 8, you will get the following code:

`x = (x*x)>>8 square x y = (y*y)>>8 square y z = (z*z)>>8 square z s = x+y+z sum of squares d = sqrt(s<<8) the distance in q=8 form`

- 1
- 2
- 3
- 4
- 5

Or, if you keep the middle answer in q = 16 format, reduce the number of shifts and improve the accuracy:

`x = x*x square of x in q=16 form y = y*y square of y in q=16 form z = z*z square of z in q=16 form s = x+y+x sum of squares in q=16 form d = sqr(s) distance d in q=8 form`

- 1
- 2
- 3
- 4
- 5

#### summary

- If you add two numbers in Q, they will remain q-form.
- If you multiply two numbers by Q, the answer is 2q.
- If you take the square root of a number in q, the answer is q / 2.
- To convert from q-form to r-form, you need to move left (r-q) or right (q-r), depending on which of Q and r is larger
- To get the best precision results, select the largest q and ensure that the intermediate calculation does not overflow.

### C program

Fixed point arithmetic can be programmed in C using standard integer arithmetic operations, and shifting is used to change the q-form when needed (usually before or after operations to ensure that the answer is still in the q-form).

To make programming easier to read, a set of C macros has been defined. The following example program defines these macros and explains their use:

Note: compilation needs to add - lm option at the end, such as gcc fixed.c -o fixed -lm

`/* A Simple C program to illustrate the use of Fixed Point Operations */ #include <stdio.h> #include <stdlib.h> #include <math.h> /* DEFINE THE MACROS */ /* The basic operations perfomed on two numbers a and b of fixed point q format returning the answer in q format */ #define FADD(a,b) ((a)+(b)) #define FSUB(a,b) ((a)-(b)) #define FMUL(a,b,q) (((a)*(b))>>(q)) #define FDIV(a,b,q) (((a)<<(q))/(b)) /* The basic operations where a is of fixed point q format and b is an integer */ #define FADDI(a,b,q) ((a)+((b)<<(q))) #define FSUBI(a,b,q) ((a)-((b)<<(q))) #define FMULI(a,b) ((a)*(b)) #define FDIVI(a,b) ((a)/(b)) /* convert a from q1 format to q2 format */ #define FCONV(a, q1, q2) (((q2)>(q1)) ? (a)<<((q2)-(q1)) : (a)>>((q1)-(q2))) /* the general operation between a in q1 format and b in q2 format returning the result in q3 format */ #define FADDG(a,b,q1,q2,q3) (FCONV(a,q1,q3)+FCONV(b,q2,q3)) #define FSUBG(a,b,q1,q2,q3) (FCONV(a,q1,q3)-FCONV(b,q2,q3)) #define FMULG(a,b,q1,q2,q3) FCONV((a)*(b), (q1)+(q2), q3) #define FDIVG(a,b,q1,q2,q3) (FCONV(a, q1, (q2)+(q3))/(b)) /* convert to and from floating point */ #define TOFIX(d, q) ((int)( (d)*(double)(1<<(q)) )) #define TOFLT(a, q) ( (double)(a) / (double)(1<<(q)) ) #define TEST(FIX, FLT, STR) { \ a = a1 = randint(); \ b = bi = a2 = randint(); \ fa = TOFLT(a, q); \ fa1 = TOFLT(a1, q1); \ fa2 = TOFLT(a2, q2); \ fb = TOFLT(b, q); \ ans1 = FIX; \ ans2 = FLT; \ printf("Testing %s\n fixed point answer=%f\n floating point answer=%f\n", \ STR, TOFLT(ans1, q), ans2); \ } int randint(void) { int i; i=rand(); i = i & 32767; if (rand() & 1) i=-i; return i; } int main(void) { int q=14, q1=15, q2=7, q3=14; double fa, fb, fa1, fa2; int a,b,bi,a1,a2; int ans1; double ans2; /* test each of the MACRO's with some random data */ TEST(FADD(a,b), fa+fb, "FADD"); TEST(FSUB(a,b), fa-fb, "FSUB"); TEST(FMUL(a,b,q), fa*fb, "FMUL"); TEST(FDIV(a,b,q), fa/fb, "FDIV"); TEST(FADDI(a,bi,q), fa+bi, "FADDI"); TEST(FSUBI(a,bi,q), fa-bi, "FSUBI"); TEST(FMULI(a,bi), fa*bi, "FSUBI"); TEST(FDIVI(a,bi), fa/bi, "FSUBI"); TEST(FADDG(a1,a2,q1,q2,q3), fa1+fa2, "FADDG"); TEST(FSUBG(a1,a2,q1,q2,q3), fa1-fa2, "FSUBG"); TEST(FMULG(a1,a2,q1,q2,q3), fa1*fa2, "FMULG"); TEST(FDIVG(a1,a2,q1,q2,q3), fa1/fa2, "FDIVG"); printf("Finished standard test\n"); /* the following code will calculate exp(x) by summing the series (not efficient but useful as an example) and compare it with the actual value */ while (1) { printf("Please enter the number to be exp'ed (<1): "); scanf("%lf", &fa); printf(" x = %f\n", fa); printf(" exp(x) = %f\n", exp(fa)); q = 14; /* set the fixed point */ a = TOFIX(fa, q); /* get a in fixed point format */ a1 = FCONV(1, 0, q); /* a to power 0 */ a2 = 1; /* 0 factorial */ ans1 =0; /* series sum so far */ for (bi=0 ; bi<10; bi++) { /* do ten terms of the series */ int j; j = FDIVI(a1, a2); /* a^n/n! */ ans1 = FADD(ans1, j); a1 = FMUL(a1, a, q); /* increase power of a by 1 */ a2 = a2*(bi+1); /* next factorial */ printf("Stage %d answer = %f\n", bi, TOFLT(ans1, q)); } } return 0; }`