Fixed point number programming


Self translation Fixed Point Arithmetic on the ARM


This application note describes how to use ARM C compiler and arm or Thumb assembler to write efficient fixed-point arithmetic code. Since the ARM kernel is an integer processor, integer arithmetic must be used to simulate all floating-point operations. Using fixed-point algorithm instead of floating-point number will greatly improve the performance of many algorithms.

This document contains the following sections:

  • Principle of fixed-point operation: describes the mathematical concepts required for fixed-point operation.
  • Example: an example of writing fixed-point code for signal processing and graphics processing is given, which are the two most common uses.
  • C programming: covers the actual details of implementing fixed-point code in C. The sample program is given with a set of macro definitions.
  • Assembler: introduces an example of an assembler.

Principle of fixed point algorithm

In computational arithmetic, a pair of integers (n, e) can be used to approximate fractions: mantissa and exponent, respectively. This pair of integers represents the fraction n2 − en2 − E. The exponent e can be considered to be the number of digits that must be moved before the binary decimal point is placed. such as

Mantissa (n) Exponent (e) Binary Decimal
01100100 -1 011001000. 200
01100100 0 01100100. 100
01100100 1 0110010.0 50
01100100 2 011001.00 25
01100100 3 01100.100 12.5
01100100 7 0.1100100 0.78125

If e is a variable, stored in a register, and unknown at compile time, (n, e) is called a floating-point number. If e is known in advance, at compile time, (n, e) is called a fixed number of points. By storing mantissa, fixed number of points can be stored in standard integer variable (or register).

For fixed-point numbers, the exponent e is usually represented by the letter q. The following section describes how to perform basic arithmetic operations on two fixed-point numbers, two of which are a=n2 − pa=n2 − P and b=m2 − qb=m2 − q, and the result is expressed as c=k2 − rc=k2 − r, where p, q and r are fixed constant exponents.

In the calculation, we only need two mantissa n m to get k. They will be converted using their index. Only the mantissa is stored when the number is actually stored. The exponent is stored by another corresponding number.

Index movement

The easiest operation to perform on a fixed number of points is to change the exponent. In order to change the exponent from p to r (perform operation c = a), the mantissa k c a n be calculated from n by a simple shift.

Because: n2 − P = n2 r – p * 2 − rn2 − p=n2r – p * 2 − r

k=n<<(r−p)k=n<<(r−p) if(r>=p)if(r>=p)

k=n>>(p−r)k=n>>(p−r) if(p>r)if(p>r)

Addition and subtraction

To perform the operation c = a + b, first convert a and b to the same exponent r as c, and then add the mantissa. The method consists of the following equation:


Subtraction analogous


Product c = ab can be performed using a simple integer multiplication. The equation is as follows:


So the product n * m is the mantissa of the answer with exponent p + q. To convert the answer to have an exponent r, perform the shift as described above. such as

k=(n∗m)>>(p+q−r)k=(n∗m)>>(p+q−r) if(p+q>=r)if(p+q>=r)


Division c=a/b can also be performed by simple integer division. The formula is:


In order not to lose precision, a multiplication of 2r+q − p2r+q − P must be performed before dividing by m. For example,

k=(n<<(r+q−p))/mk=(n<<(r+q−p))/m if(r+q>=p)if(r+q>=p)

square root

The formula of the square root is as follows:


In other words, c=a − √ c=a performs k = isqr (n < < 2R − p) k = isqr (n < < 2R − p) where isqr is the square root function of an integer.


The above equation shows how to use only integer operations to perform basic operations of addition, subtraction, multiplication, division and root extraction for fixed-point numbers. However, in order for the answer to be accurately stored in the result represented by (± 2 − r ± 2 − r), you must ensure that no overflow occurs in the calculation. If the index is not carefully selected, every left shift, plus / minus or multiplication will overflow and lead to meaningless answers.

The following example of a "real world" scenario shows how to choose an index.


signal processing

In signal processing, data is usually composed of fractional values, ranging from - 1 to + 1. This example uses the exponent Q and 32-bit integer mantissa (so that each value can be saved in an ARM register). To be able to multiply two numbers without overflowing, you need 2q < 32 or Q < = 15. In practice, q = 14 is usually chosen because it allows for multiplication with several accumulations without spillover risk.

Fix q = 14 at compile time. Then 0xFFFFC000 represents - 1, 0x00004000 represents + 1 and 0x00000001 represents 2 − 142 − 14, the most accurate divisor.

Suppose x, y are two q = 14 mantissa, n, m are two integers. By applying the above formula, you can get the basic operation, where a is a normal integer:

Operation Code to produce mantissa of answer in q=14 format
x + y x + y
x + a x + (a<<14)
xa x*a
xy (x*y)>>14
x/a x/a
a/b (a<<14)/b
x/y (x<<14)/y
x−−√x sqr(x<<14)

image processing

In this example, the (x, y, z) coordinates are stored in the form of eight decimal places (q = 8). If x is stored in a 32-bit register, the lowest byte of X gives the decimal part, and the highest three parts are the integer part.

To calculate the distance d from the origin, in the form of q = 8: d=x2+y2+z2 − − − − − − − − − − − − − √ d=x2+y2+z2

If you apply the above formula directly and save all intermediate answers in the form of q = 8, you will get the following code:

x = (x*x)>>8      square x
y = (y*y)>>8      square y
z = (z*z)>>8      square z
s = x+y+z         sum of squares
d = sqrt(s<<8)    the distance in q=8 form
  • 1
  • 2
  • 3
  • 4
  • 5

Or, if you keep the middle answer in q = 16 format, reduce the number of shifts and improve the accuracy:

x = x*x      square of x in q=16 form
y = y*y      square of y in q=16 form
z = z*z      square of z in q=16 form
s = x+y+x    sum of squares in q=16 form
d = sqr(s)   distance d in q=8 form
  • 1
  • 2
  • 3
  • 4
  • 5


  • If you add two numbers in Q, they will remain q-form.
  • If you multiply two numbers by Q, the answer is 2q.
  • If you take the square root of a number in q, the answer is q / 2.
  • To convert from q-form to r-form, you need to move left (r-q) or right (q-r), depending on which of Q and r is larger
  • To get the best precision results, select the largest q and ensure that the intermediate calculation does not overflow.

C program

Fixed point arithmetic can be programmed in C using standard integer arithmetic operations, and shifting is used to change the q-form when needed (usually before or after operations to ensure that the answer is still in the q-form).

To make programming easier to read, a set of C macros has been defined. The following example program defines these macros and explains their use:

Note: compilation needs to add - lm option at the end, such as gcc fixed.c -o fixed -lm

/* A Simple C program to illustrate the use of Fixed Point Operations */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
/* The basic operations perfomed on two numbers a and b of fixed
point q format returning the answer in q format */
#define FADD(a,b) ((a)+(b))
#define FSUB(a,b) ((a)-(b))
#define FMUL(a,b,q) (((a)*(b))>>(q))
#define FDIV(a,b,q) (((a)<<(q))/(b))
/* The basic operations where a is of fixed point q format and b is
an integer */
#define FADDI(a,b,q) ((a)+((b)<<(q)))
#define FSUBI(a,b,q) ((a)-((b)<<(q)))
#define FMULI(a,b) ((a)*(b))
#define FDIVI(a,b) ((a)/(b))
/* convert a from q1 format to q2 format */
#define FCONV(a, q1, q2) (((q2)>(q1)) ? (a)<<((q2)-(q1)) : (a)>>((q1)-(q2)))
/* the general operation between a in q1 format and b in q2 format
returning the result in q3 format */
#define FADDG(a,b,q1,q2,q3) (FCONV(a,q1,q3)+FCONV(b,q2,q3))
#define FSUBG(a,b,q1,q2,q3) (FCONV(a,q1,q3)-FCONV(b,q2,q3))
#define FMULG(a,b,q1,q2,q3) FCONV((a)*(b), (q1)+(q2), q3)
#define FDIVG(a,b,q1,q2,q3) (FCONV(a, q1, (q2)+(q3))/(b))
/* convert to and from floating point */
#define TOFIX(d, q) ((int)( (d)*(double)(1<<(q)) ))
#define TOFLT(a, q) ( (double)(a) / (double)(1<<(q)) )
#define TEST(FIX, FLT, STR) { \
  a = a1 = randint(); \
  b = bi = a2 = randint(); \
  fa = TOFLT(a, q); \
  fa1 = TOFLT(a1, q1); \
  fa2 = TOFLT(a2, q2); \
  fb = TOFLT(b, q); \
  ans1 = FIX; \
  ans2 = FLT; \
  printf("Testing %s\n fixed point answer=%f\n floating point answer=%f\n", \
  STR, TOFLT(ans1, q), ans2); \
int randint(void) {
  int i;
  i = i & 32767;
  if (rand() & 1) i=-i;
  return i;
int main(void) {
  int q=14, q1=15, q2=7, q3=14;
  double fa, fb, fa1, fa2;
  int a,b,bi,a1,a2;
  int ans1;
  double ans2;
  /* test each of the MACRO's with some random data */
  TEST(FADD(a,b), fa+fb, "FADD");
  TEST(FSUB(a,b), fa-fb, "FSUB");
  TEST(FMUL(a,b,q), fa*fb, "FMUL");
  TEST(FDIV(a,b,q), fa/fb, "FDIV");
  TEST(FADDI(a,bi,q), fa+bi, "FADDI");
  TEST(FSUBI(a,bi,q), fa-bi, "FSUBI");
  TEST(FMULI(a,bi), fa*bi, "FSUBI");
  TEST(FDIVI(a,bi), fa/bi, "FSUBI");
  TEST(FADDG(a1,a2,q1,q2,q3), fa1+fa2, "FADDG");
  TEST(FSUBG(a1,a2,q1,q2,q3), fa1-fa2, "FSUBG");
  TEST(FMULG(a1,a2,q1,q2,q3), fa1*fa2, "FMULG");
  TEST(FDIVG(a1,a2,q1,q2,q3), fa1/fa2, "FDIVG");
  printf("Finished standard test\n");
  /* the following code will calculate exp(x) by summing the
  series (not efficient but useful as an example) and compare it
  with the actual value */
  while (1) {
    printf("Please enter the number to be exp'ed (<1): ");
    scanf("%lf", &fa);
    printf(" x = %f\n", fa);
    printf(" exp(x) = %f\n", exp(fa));
    q = 14; /* set the fixed point */
    a = TOFIX(fa, q); /* get a in fixed point format */
    a1 = FCONV(1, 0, q); /* a to power 0 */
    a2 = 1; /* 0 factorial */
    ans1 =0; /* series sum so far */
    for (bi=0 ; bi<10; bi++) { /* do ten terms of the series */
      int j;
      j = FDIVI(a1, a2); /* a^n/n! */
      ans1 = FADD(ans1, j);
      a1 = FMUL(a1, a, q); /* increase power of a by 1 */
      a2 = a2*(bi+1); /* next factorial */
      printf("Stage %d answer = %f\n", bi, TOFLT(ans1, q));
  return 0;

Tags: Programming

Posted on Thu, 26 Mar 2020 07:31:10 -0700 by _theworks