# Fixed point number programming

### introduce

This application note describes how to use ARM C compiler and arm or Thumb assembler to write efficient fixed-point arithmetic code. Since the ARM kernel is an integer processor, integer arithmetic must be used to simulate all floating-point operations. Using fixed-point algorithm instead of floating-point number will greatly improve the performance of many algorithms.

This document contains the following sections:

• Principle of fixed-point operation: describes the mathematical concepts required for fixed-point operation.
• Example: an example of writing fixed-point code for signal processing and graphics processing is given, which are the two most common uses.
• C programming: covers the actual details of implementing fixed-point code in C. The sample program is given with a set of macro definitions.
• Assembler: introduces an example of an assembler.

### Principle of fixed point algorithm

In computational arithmetic, a pair of integers (n, e) can be used to approximate fractions: mantissa and exponent, respectively. This pair of integers represents the fraction n2 − en2 − E. The exponent e can be considered to be the number of digits that must be moved before the binary decimal point is placed. such as

Mantissa (n) Exponent (e) Binary Decimal
01100100 -1 011001000. 200
01100100 0 01100100. 100
01100100 1 0110010.0 50
01100100 2 011001.00 25
01100100 3 01100.100 12.5
01100100 7 0.1100100 0.78125

If e is a variable, stored in a register, and unknown at compile time, (n, e) is called a floating-point number. If e is known in advance, at compile time, (n, e) is called a fixed number of points. By storing mantissa, fixed number of points can be stored in standard integer variable (or register).

For fixed-point numbers, the exponent e is usually represented by the letter q. The following section describes how to perform basic arithmetic operations on two fixed-point numbers, two of which are a=n2 − pa=n2 − P and b=m2 − qb=m2 − q, and the result is expressed as c=k2 − rc=k2 − r, where p, q and r are fixed constant exponents.

In the calculation, we only need two mantissa n m to get k. They will be converted using their index. Only the mantissa is stored when the number is actually stored. The exponent is stored by another corresponding number.

#### Index movement

The easiest operation to perform on a fixed number of points is to change the exponent. In order to change the exponent from p to r (perform operation c = a), the mantissa k c a n be calculated from n by a simple shift.

Because: n2 − P = n2 r – p * 2 − rn2 − p=n2r – p * 2 − r

k=n<<(r−p)k=n<<(r−p) if(r>=p)if(r>=p)

k=n>>(p−r)k=n>>(p−r) if(p>r)if(p>r)

To perform the operation c = a + b, first convert a and b to the same exponent r as c, and then add the mantissa. The method consists of the following equation:

n2−r+m2−r=(n+m)2−rn2−r+m2−r=(n+m)2−r

Subtraction analogous

#### multiplication

Product c = ab can be performed using a simple integer multiplication. The equation is as follows:

ab=n2–p∗m2–q=(nm)2–(p+q)ab=n2–p∗m2–q=(nm)2–(p+q)

So the product n * m is the mantissa of the answer with exponent p + q. To convert the answer to have an exponent r, perform the shift as described above. such as

k=(n∗m)>>(p+q−r)k=(n∗m)>>(p+q−r) if(p+q>=r)if(p+q>=r)

#### division

Division c=a/b can also be performed by simple integer division. The formula is:

a/b=(n2−p)/(m2−q)=(n/m)2q−p=(n/m)2r+q−p2−ra/b=(n2−p)/(m2−q)=(n/m)2q−p=(n/m)2r+q−p2−r

In order not to lose precision, a multiplication of 2r+q − p2r+q − P must be performed before dividing by m. For example,

k=(n<<(r+q−p))/mk=(n<<(r+q−p))/m if(r+q>=p)if(r+q>=p)

#### square root

The formula of the square root is as follows:

a−−√=n2−p−−−−√=n22r−p−−−−−√∗2−ra=n2−p=n22r−p∗2−r

In other words, c=a − √ c=a performs k = isqr (n < < 2R − p) k = isqr (n < < 2R − p) where isqr is the square root function of an integer.

#### overflow

The above equation shows how to use only integer operations to perform basic operations of addition, subtraction, multiplication, division and root extraction for fixed-point numbers. However, in order for the answer to be accurately stored in the result represented by (± 2 − r ± 2 − r), you must ensure that no overflow occurs in the calculation. If the index is not carefully selected, every left shift, plus / minus or multiplication will overflow and lead to meaningless answers.

The following example of a "real world" scenario shows how to choose an index.

### Example

#### signal processing

In signal processing, data is usually composed of fractional values, ranging from - 1 to + 1. This example uses the exponent Q and 32-bit integer mantissa (so that each value can be saved in an ARM register). To be able to multiply two numbers without overflowing, you need 2q < 32 or Q < = 15. In practice, q = 14 is usually chosen because it allows for multiplication with several accumulations without spillover risk.

Fix q = 14 at compile time. Then 0xFFFFC000 represents - 1, 0x00004000 represents + 1 and 0x00000001 represents 2 − 142 − 14, the most accurate divisor.

Suppose x, y are two q = 14 mantissa, n, m are two integers. By applying the above formula, you can get the basic operation, where a is a normal integer:

Operation Code to produce mantissa of answer in q=14 format
x + y x + y
x + a x + (a<<14)
xa x*a
xy (x*y)>>14
x/a x/a
a/b (a<<14)/b
x/y (x<<14)/y
x−−√x sqr(x<<14)

#### image processing

In this example, the (x, y, z) coordinates are stored in the form of eight decimal places (q = 8). If x is stored in a 32-bit register, the lowest byte of X gives the decimal part, and the highest three parts are the integer part.

To calculate the distance d from the origin, in the form of q = 8: d=x2+y2+z2 − − − − − − − − − − − − − √ d=x2+y2+z2

If you apply the above formula directly and save all intermediate answers in the form of q = 8, you will get the following code:

``````x = (x*x)>>8      square x
y = (y*y)>>8      square y
z = (z*z)>>8      square z
s = x+y+z         sum of squares
d = sqrt(s<<8)    the distance in q=8 form``````
• 1
• 2
• 3
• 4
• 5

Or, if you keep the middle answer in q = 16 format, reduce the number of shifts and improve the accuracy:

``````x = x*x      square of x in q=16 form
y = y*y      square of y in q=16 form
z = z*z      square of z in q=16 form
s = x+y+x    sum of squares in q=16 form
d = sqr(s)   distance d in q=8 form``````
• 1
• 2
• 3
• 4
• 5

#### summary

• If you add two numbers in Q, they will remain q-form.
• If you multiply two numbers by Q, the answer is 2q.
• If you take the square root of a number in q, the answer is q / 2.
• To convert from q-form to r-form, you need to move left (r-q) or right (q-r), depending on which of Q and r is larger
• To get the best precision results, select the largest q and ensure that the intermediate calculation does not overflow.

### C program

Fixed point arithmetic can be programmed in C using standard integer arithmetic operations, and shifting is used to change the q-form when needed (usually before or after operations to ensure that the answer is still in the q-form).

To make programming easier to read, a set of C macros has been defined. The following example program defines these macros and explains their use:

Note: compilation needs to add - lm option at the end, such as gcc fixed.c -o fixed -lm

``````/* A Simple C program to illustrate the use of Fixed Point Operations */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
/* DEFINE THE MACROS */
/* The basic operations perfomed on two numbers a and b of fixed
point q format returning the answer in q format */
#define FSUB(a,b) ((a)-(b))
#define FMUL(a,b,q) (((a)*(b))>>(q))
#define FDIV(a,b,q) (((a)<<(q))/(b))
/* The basic operations where a is of fixed point q format and b is
an integer */
#define FSUBI(a,b,q) ((a)-((b)<<(q)))
#define FMULI(a,b) ((a)*(b))
#define FDIVI(a,b) ((a)/(b))
/* convert a from q1 format to q2 format */
#define FCONV(a, q1, q2) (((q2)>(q1)) ? (a)<<((q2)-(q1)) : (a)>>((q1)-(q2)))
/* the general operation between a in q1 format and b in q2 format
returning the result in q3 format */
#define FSUBG(a,b,q1,q2,q3) (FCONV(a,q1,q3)-FCONV(b,q2,q3))
#define FMULG(a,b,q1,q2,q3) FCONV((a)*(b), (q1)+(q2), q3)
#define FDIVG(a,b,q1,q2,q3) (FCONV(a, q1, (q2)+(q3))/(b))
/* convert to and from floating point */
#define TOFIX(d, q) ((int)( (d)*(double)(1<<(q)) ))
#define TOFLT(a, q) ( (double)(a) / (double)(1<<(q)) )
#define TEST(FIX, FLT, STR) { \
a = a1 = randint(); \
b = bi = a2 = randint(); \
fa = TOFLT(a, q); \
fa1 = TOFLT(a1, q1); \
fa2 = TOFLT(a2, q2); \
fb = TOFLT(b, q); \
ans1 = FIX; \
ans2 = FLT; \
STR, TOFLT(ans1, q), ans2); \
}
int randint(void) {
int i;
i=rand();
i = i & 32767;
if (rand() & 1) i=-i;
return i;
}
int main(void) {
int q=14, q1=15, q2=7, q3=14;
double fa, fb, fa1, fa2;
int a,b,bi,a1,a2;
int ans1;
double ans2;
/* test each of the MACRO's with some random data */
TEST(FSUB(a,b), fa-fb, "FSUB");
TEST(FMUL(a,b,q), fa*fb, "FMUL");
TEST(FDIV(a,b,q), fa/fb, "FDIV");
TEST(FSUBI(a,bi,q), fa-bi, "FSUBI");
TEST(FMULI(a,bi), fa*bi, "FSUBI");
TEST(FDIVI(a,bi), fa/bi, "FSUBI");
TEST(FSUBG(a1,a2,q1,q2,q3), fa1-fa2, "FSUBG");
TEST(FMULG(a1,a2,q1,q2,q3), fa1*fa2, "FMULG");
TEST(FDIVG(a1,a2,q1,q2,q3), fa1/fa2, "FDIVG");
printf("Finished standard test\n");
/* the following code will calculate exp(x) by summing the
series (not efficient but useful as an example) and compare it
with the actual value */
while (1) {
printf("Please enter the number to be exp'ed (<1): ");
scanf("%lf", &fa);
printf(" x = %f\n", fa);
printf(" exp(x) = %f\n", exp(fa));
q = 14; /* set the fixed point */
a = TOFIX(fa, q); /* get a in fixed point format */
a1 = FCONV(1, 0, q); /* a to power 0 */
a2 = 1; /* 0 factorial */
ans1 =0; /* series sum so far */
for (bi=0 ; bi<10; bi++) { /* do ten terms of the series */
int j;
j = FDIVI(a1, a2); /* a^n/n! */