1. Introduction
The use of different number types in machine learning applications has been analyzed extensively in previous years, more specifically in deep learning neural networks [
1,
2]. These kinds of neural networks use the convolution as the basic function and have thousands of parameters and must be trained first; that is, the network must be optimized by modifying all the parameters to obtain a local minimum of the goal function. The optimization step is called training and it could take hours in modern hardware of general purpose graphics processor units (GPGPUs). A special type of number, Brain Floating Point (bfloat16), which is a half-precision FP format of 16 bits with the same range of the usual single precision FP numbers (float in C programming language, of 32 bits length), has been proposed for training deep learning neural networks [
2]. Other FP numbers of 16 bit length are the so-called FP16 numbers, these are an IEEE standard [
1,
2] for half-precision FP numbers and can be used on ARM processors.
The goal of using different, shorter numbers in machine learning applications is to improve the speed, and as a consequence reduce the power consumption as it would take less time to train a deep learning network, and also reduce the storage memory or disk size for the variables. In [
1] it is mentioned that half precision is also attractive for accelerating general purpose scientific computing, such as weather forecasting, climate modeling, and solution of linear systems of equations. The supercomputer Summit (it was in the Top 500 list
https://www.top500.org (accessed on 3 February 2021)), has a peak performance of 148.6 petaflops in the LINPACK benchmark, a benchmark that employs only double precision. For a genetics application that uses half precision, the same machine has a peak performance of 2.36 exaflops [
1].
In this work it is proposed to analyze the well known heuristic for single objective optimization, the differential evolution (DE) algorithm, under FP16 numbers, and also under fixed point arithmetic that uses integer numbers of different lengths. This analysis is important if we think of embedded optimization algorithms within a chip [
3], which performs a dedicated task. One constraint in these kinds of applications must be that the power consumption is as low as possible. Also it is important if one designs a dedicated algorithm in hardware, just as in FPGAs (Field Programmable Gate Arrays), to accelerate its behavior. Also, another possible application is to execute a fast and small DE inside each core in a GPGPU. These three application scenarios justify the analysis of the DE performed in this work.
The rest of this article is organized as: in
Section 2 a very brief description of fixed point arithmetic and FP numbers is made. In
Section 3 the DE algorithm is analyzed for which parts could be improved by using other different number types. In
Section 4 some experiments and their results are described. Finally, in
Section 6 some conclusions are presented.
2. Fixed Point Arithmetic and Floating Point Numbers
The notation will be used here to represent a set of integer numbers that uses a bits in the integer part, and b bits in the fractional part. Each number is of size bits (plus the sign bit).
For a number
, the range of numbers that can be represented is:
Summing up two numbers
results in a number
[
4]. The multiplication of two numbers
results in a number
[
4]. It is possible to verify these results by applying the respective operation to two extreme numbers in (
1).
The microprocessors offer the sum and multiplication of two integer numbers and the result is stored in a number of the same size as the operands. In a hardware design for a given application, one must use a big enough number to store the sum of two numbers, and the result to multiply two numbers must be returned to a number. The easiest way to perform this is by truncating the result: the resulted is shifted b bits to the right, again the number must be big enough to store the resulted number. In a PC, if one uses 32 bit integer numbers, the first bit is the sign bit, and then one could multiply up two values to keep the result within the used 31 bits. In any application, normally one does not take care if the used numbers can keep the result of the operations applied to them, and one trusts that the numbers are big enough to store the results.
The operations sum and multiplication of two integer numbers are the fastest because each operation is built in the hardware and both take a single clock step.
The sum and multiplication of two FP numbers is totally different. An FP number is composed as , where s is the significant and e the exponent. If p bits are used for the significant, it is an integer that could take values from 0 to . The exponent e is an integer number too. The sum of two FP numbers is carried on first by expressing both numbers with the same exponent, then summing up both significants. The greater exponent of both numbers is used to express them with the same exponent. The result must be rounded to express the same number of bits used in the significants. Also, the result could be normalized, which means that the exponent will have a single binary precision number.
The multiplication takes more steps because two numbers , and are multiplied as and the exponents are summed (), and also both results are rounded and the final result is normalized.
In the IEEE 754 standard [
5], an FP number has a sign bit,
i, and the represented number is equal to
, where
. The values used in common FP numbers are shown in
Table 1.
Floating point operations take more than a clock cycle within a microprocessor.
The IEEE 754 standard [
5] gives much more aspects that are necessary to work with FP numbers, such as rounding methods, Not a Number (NaN), infinities, and how to handle exceptions. In [
6] all these details about FPs are explained.
3. DE Analysis
DE is a heuristic used for global optimization under continuous spaces. DE solves problems as:
where
is the function to optimize;
, that is, the problem has
n variables; and also we could have
,
inequality constraints; and
,
equality constraints. The solution to the problem
is in a subset
S of the whole search space
and where the constraints are satisfied, this space
S is called the
feasible space.
Also, the
search space contains the feasible space and is defined by the
box constraints:
This is, each variable is searched in the interval defined by the lower bound value , and the upper bound value , for .
Constraints can be incorporated into the problem (
2) by modifying the objective function as:
Now the
will be optimized instead of
f in (
2).
and
in (
4) represent the penalty coefficients that weigh the relative importance of each kind of constraint.
One important point about DE is that the heuristic needs to only evaluate the problem to solve. Classical mathematical optimization methods use the first and perhaps also the second derivative of the given problem. These derivatives are easy to obtain if one has in hand the mathematical expression to the given problem. It is possible to approximate the derivatives numerically but with a very high computational cost [
7].
According to the test in the CEC 2005 conference [
8], DE is the second best heuristic to solve real parameter optimization problems, when the number of parameters is around 10.
The DE pseudocode is shown in Algorithm 1.
Algorithm 1 Differential evolution algorithm (rand/1/bin version) |
- Require:
The search space and the value v for the stop condition. The values for population size, ; maximum number of generations, g; difference and recombination constants, F and R, respectively. - Ensure:
A solution of the minimization problem - 1:
- 2:
- 3:
- 4:
repeat - 5:
for to do - 6:
Let , and be three random integers in , such that - 7:
Let be a random integer in - 8:
for to n do - 9:
- 10:
if then ▹ Check bounds - 11:
- 12:
end if - 13:
end for - 14:
if then - 15:
- 16:
end if - 17:
end for - 18:
, - 19:
for to do - 20:
if then - 21:
- 22:
end if - 23:
if then - 24:
- 25:
end if - 26:
end for - 27:
- 28:
until or
|
DE works with a population that is composed of a set of individuals, or vectors, of real numbers. All vectors are initialized with random numbers with a uniform distribution within the search bounds of each parameter (line 1 in Algorithm 1). For a certain number of iterations (line 4) the population is modified and this modified population could replace the original individuals. The core of DE is in the loop on lines 8–13: a new individual is generated from three different individuals chosen randomly; each value of the new vector (it represents a new individual) is calculated from the first father, plus the difference of the other two fathers multiplied by
F, the difference constant; the new vector value is calculated if a random real number (between zero and one) is less than
R, the DE’s recombination constant. To prevent the case when the new individual could be equal to the current father
i, at least one vector’s component (a variable value) is forced to be calculated from their random fathers values: it is in line 9 of the pseudocode, when
, and
is an integer random number between 1 and
n. In lines 10–12 it is checked if each combined variable value is within the search space. Then the new individual is evaluated, and if it is better than the father (in lines 11–12), then the child replaces its father. The stop condition used here is: if the number of iterations is greater than a maximum number of iterations or when the difference in the objective function values of the worst and best individuals is less than
v. This stop condition is called
diff criterion in [
9], and is recommended for a global optimization task.
A general form to set the parameter values for DE is: if
d is the number of variables, the population size is set to
,
, and
[
9].
The DE in Algorithm 1 can be improved by using a random integer number generator as the one described in [
10], which does not use divisions or FP numbers. This idea could improve the algorithm in line 6 (to generate three numbers in the interval
, and in line 7 where another random integer number is generated in the interval
. Also, the values for
F and
R are within the interval
, and usually no more than one or two decimal values are used for these constants, thus these values are not affected by using half precision numbers (see
Table 1). Even more, a totally integer arithmetic could be used in the comparison
(in line 9 in Algorithm 1), if it is used instead
, with
.
Two implementations of DE were used in this work: one with fixed point arithmetic, and another one using FP16 numbers. The implementation with fixed point arithmetic uses integer (of 32 bits) numbers for all the variables. The implementation using FP16 numbers uses half precision floats (FP16, 16 bits) for all the variables. In this paper a computer of 64 bits architecture was used, then the multiplication of two integers was stored in a long type variable of 64 bits, shifted and truncated to a integer of 32 bits. The core part of DE (lines 8–13 in Algorithm 1) calculates the selected and mutated vector
as:
for
, this is for each variable of the given problem. Thus, one subtraction (
) followed of one multiplication (by constant
F) and one summation (with
) are needed to calculated the new vector
. The greatest value for
F could be 1, if all the search space is equal for all variables, the result in (
5) could be the double of the current
value.
Then, the maximum possible values in the search space could be the double of the bound values of the search space. Another problem is to find the maximum possible value in the function space. Also, it is not clear how many bits are necessary in the fractional part for the fixed point arithmetic. These items are solved in the following section.
4. Experiments with Three Multimodal Functions in 10 Dimensions
Three very well known benchmark functions were used: shifted version of Rosenbrock, Rastrigin, and Ackley functions in 10 dimensions. All these functions are multimodal, which justify solving them using the DE heuristic. The used Rosenbrock function is defined as:
its minimum value is 0.39 with
.
The Rastrigin function is defined as:
its minimum value is
for
.
The Ackley function is defined as:
its minimum value is
with
also equal to
. These three functions are scaled with respect to the three ones defined in [
11] in order to keep their amplitudes within the range of half precision FP numbers (see
Table 1). A summary of these three functions is described in
Table 2.
All functions were programmed in single precision FP (float in C) arithmetic.
The number of bits used for the integer and fractional parts for the simulations in fixed point arithmetic is shown in
Table 3. The number of bits in the integer part is set according to
Table 2 because the maximum number in the third column in
Table 3 must be greater than the maximum extreme value shown in
Table 2.
The resulted statistics for the simulations using 100 runs per bit in the fractional part and FP16 arithmetic are shown in
Table 4,
Table 5 and
Table 6, for the Rosenbrock, Rastrigin, and Ackley functions, respectively. In those tables the statistics for the number of generations and the obtained function values are shown. The used number of bits in the integer part are shown in
Table 3. These number of bits in the integer part were calculated from data in
Table 2, for example, for the Rosenbrock function in
Table 2 the maximum obtained value function is 10891.29, thus the number of bits for the integer part must be greater than this number, therefore 14 bits were selected because
. The corresponding variable values for the minimum for each function for the FP16 simulations are shown in
Table 7. The obtained mean value for the FP16 simulation for the Rosenbrock function is 0.391538 (see at the end of sixth column in
Table 4). The equivalent mean for the fixed point arithmetic is 0.391079 at 11 bits in the fractional part; the associated variable values at this simulation with 11 bits is also shown in
Table 7. The same procedure was repeated for the results for the Rastrigin and Ackley functions and are also shown in
Table 7.
5. Discussion
With the simulation results shown in
Table 4,
Table 5,
Table 6 and
Table 7 it is confirmed that the heuristic DE can be executed in fixed point arithmetic or half precision FP numbers.
As one can see in
Table 4,
Table 5 and
Table 6 not all the fractional numbers of bits are necessary with a given application. From
Table 7 same results for FP16 numbers can be obtained with numbers 14.11, 7.12, and 5.11 for the scaled Rosenbrok, Rastrigin, and Acklen functions.
About the precision obtained in the solution using FP16 or integer arithmetic. The defined machine epsilon value is that such when
. In most of the modern microprocessors (that use two’s complement arithmetic) this machine epsilon value for each data type is shown in
Table 8.
The precision bits is one bit more than the positive exponent of epsilon in floating point types and equal to the number of bits used in the fractional part in integer arithmetic.
Roughly, one cannot expect a result in an optimization problem beyond the precision of the machine epsilon. Thus, using FP16 numbers will give precision in the result at most . Or using an integer number , the result will have at most a precision of . This means also that using FP16 numbers the heuristic, DE in this case, will finish early compared to using single or double precision floating point numbers. In the experiment in this work the DE’s stop condition was set equal to . It is expected that using a smaller stop condition the heuristic will finish in more generations but then is necessary to change to other number types.
One possible application of using FP16 numbers of integer arithmetic could be to obtain first a low precision result within the precision given by the used type numbers (see
Table 8). If a bigger precision is required, then a traditional mathematical algorithm, such as the Newton method, could be used. The starting solution for the Newton method will be the previous obtained low resolution solution.
Of course if FP16 numbers of integer arithmetic are used, the application should work at the precision results given by those type numbers. Finally, this behavior must be analyzed in advance for a given application.
For all the simulations the DE’s stop condition was set equal to 0.0001. This number in 3.28 notation is equal to 0x000068db (it is a hexadecimal number of 32 bits), and this number can be written by convenience with the binary point as 0x0.00068db. The 13 bits after the binary point are all zeros, thus the stop condition is equal to zero for less than 13 bits used in the fractional part, as one can confirm in
Table 4 and
Table 6 where the simulations show the maximum number of iterations and the stop condition is not taken into account for lesser than and equal to 13 bits.
For the use of fixed point arithmetic in DE, it is critical to know in advance the range of values for the function to optimize. Here the extremes values of the search space were used to know those quantities. In a practical task, it could be tried with the extremes and perhaps other points, on a very coarse grid, to evaluate the function to optimize. The same procedure should be applied to use FP16 numbers.
DE core (in Algorithm 1) uses one difference and one multiplication, thus there is not a numerical problem to be used with fixed point arithmetic or FP16 numbers.
A naive implementation of fixed point arithmetic with a word length of 32 bits is not required, in general. As one can see in
Table 4, the same results using 14–17 bits in the fractional part for the Rosenbrock function are obtained. The same applies from results in
Table 5 for the Rastringin function for 11–24 bits, and in
Table 6 for the Ackley function from 13 to 26 bits in the fractional part.
A future work will be the design in the hardware of DE, which should include the random number generator that can be optimized to use directly the generated bits without FP divisions, as is suggested in [
10]. This idea of this design also could be used in software within each core of a GPGPU. Also an interesting idea is to incorporate a random number generator based in chaos [
12], which is easy to implement.