Keywords

1 Introduction

The sequence of Fibonacci integers \({\left\langle {f_k}\right\rangle }\) appears often in a broad spectrum of engineering applications including coding theory, cryptography, simulation, and software management. Additionally, Fibonacci numbers are very closely tied to the golden ratio \(\varphi \) which is frequently encountered in nature, such as in population growth models and in botanics. Moreover, in architecture \(\varphi \) is almost considered synonymous to harmony. Thus, Fibonacci integers are arguably among the most significant sequences. Although their defining linear recurrence equation is simple, serially generating a long sequence of consecutive Fibonacci integers is by no means a trivial task.

However, with the advent of GPU computing, the efficient parallel generation of \({\left\langle {f_k}\right\rangle }\) has been rendered feasible. Indeed, by exploiting known properties of the Fibonacci integers it is possible to build parallel generators which exploit the underlying hardware potential to a great extent, achieving low response times. This requires specialized scientific software such as TensorFlow which not only contains very efficient libraries, but also facilitates the development of high quality custom source code.

The primary research contribution of this work is twofold. First, it lays the groundwork for two parallel Fibonacci integer generators, one based on a closed form for each number in the sequence and one based on certain linear algebraic properties of Fibonacci integer pairs. The performance of the two proposed generators developed in TensorFlow for Python is evaluated in terms of both total turnaround time and FLOPS against a serial implementation with the same software tools. Second, it explores the question whether it is worth substituing the computation of known quantities with a lookup table.

This conference paper is structured as follows. Section 2 reviews current scientific literature regarding the computational aspects of Fibonacci numbers. Their fundamental properties are described in Sect. 3. The TensorFlow implementation is presented in Sect. 4, whereas future research directions are outlined in Sect. 5. Table 1 summarizes the notation of this work. Concerning notation, matrices and vectors are depicted with boldface uppercase and boldface lowercase respectively, whereas roman lowercase is reserved for scalars.

Table 1. Notation of this conference paper.

2 Previous Work

The Fibonacci sequence of integers, examined among others in [6] and [30], has perhaps the most applications not only in computer science and in engineering but in science as a whole. Closed forms for the spectral norms of circulant matrics whose entries are either Fibonacci or Lucas integers are derived in [21]. In data structure analysis the Fibonacci heap [19] and the associated pairing heap [18] have efficient search and insertion operations with numerous applications such as network optimization. A pair of successive Fibonacci numbers are known to be the worst case in Euclidean integer division algorithm as shown in [9] as well as in [29]. Fibonacci numbers play a central role in estimating task duration and, consequently, task difficulty in scrum based software engineering methodologies [27, 28], including inaccurate estimation discovery [25] and using agile methodologies to predict student progress [26]. Moreover, Fibonacci numbers are very closely linked to the golden ratio \(\varphi \)Footnote 1 as well as to symmetry of many geometric shapes, the latter having important implications in group theory [16]. Many identities in combinatorics regarding Fibonacci numbers can be found in [31] as well as in the most recent works [5] and [23]. Finally, the Lucas sequence \({\left\langle {\ell _k}\right\rangle }\) is closely associated with the Fibonacci sequence \({\left\langle {f_k}\right\rangle }\) since the two integer sequences constitute a Lucas complementary pair and share similar properties such as growth rate and generation mechanism [3, 4].

TensorFlow, originally developed by Google for massive brain circuit simulation, is an open source computational framework whose algorithmic cornerstone is the dataflow paradigm as described in [1, 2], or [20]. A library for generating Gaussian processes in TensorFlow is presented in [24]. For a genetic algorithm implementation in TensorFlow for heuristically discovering community structure, a problem examined in [10], in large multilayer graphs, such as those presented in [12] and in [11], see [15]. In [14] the ways insurance and digital health markets can benefit from blockchain and GPU computing are explored. For a path and triangle based graph resilience metric in TensorFlow see [13]. A very popular front end for the low level TensorFlow is keras, which allows the easy manipulation of neural network layers, including connecticity patterns and activation functions [8]. Model training and prediction generation is done also easily in keras in four stages [22]. Convolutional kernels whose lengths depend on their relative location inside the neural network architecture for computational vision purposes implemented in keras are introduced [7].

3 Fibonacci Numbers

The n-th integer in the Fibonacci sequence \({\left\langle {f_n}\right\rangle }\) is defined as:

$$\begin{aligned} f_n \,=\, f_{n-1}+ f_{n-2}, \quad f_0 \,=\, 0, f_1 \,=\, 1 \end{aligned}$$
(1)

Theorem 1

The n-th Fibonacci number \(f_n\) has the closed form:

$$\begin{aligned} f_n \,=\, \frac{1}{\sqrt{5}}{\left( {\frac{1+\sqrt{5}}{2}}\right) }^{\!n} - \frac{1}{\sqrt{5}}{\left( {\frac{1-\sqrt{5}}{2}}\right) }^{\!n} \,=\, \frac{\varphi ^n + {\left( {1-\varphi }\right) }^{\!n}}{\sqrt{5}} \end{aligned}$$
(2)

Proof

The characteristic polynomial of (1) is:

$$\begin{aligned} z^2 \,=\, z + 1 \,\Leftrightarrow \, z^2 - z -1 \,=\, 0 \end{aligned}$$
(3)

And its two real and distinct roots are:

$$\begin{aligned} r_{0,1} \,=\, \frac{1 \pm \sqrt{5}}{2} \end{aligned}$$
(4)

Therefore, it follows that:

$$\begin{aligned} f_n \,=\, \alpha _0 r_0 + \alpha _1 r_1 \,=\, \frac{1}{\sqrt{5}}{\left( {\frac{1+\sqrt{5}}{2}}\right) }^{\!n} - \frac{1}{\sqrt{5}}{\left( {\frac{1-\sqrt{5}}{2}}\right) }^{\!n} \end{aligned}$$
(5)

The constants \(\alpha _0\) and \(\alpha _1\) are computed using the initial conditions derived by the first two Fibonacci numbers as follows:

$$\begin{aligned} \alpha _0 + \alpha _1&\,=\, f_0 \,=\, 0 \\\alpha _0 r_0 + \alpha _1 r_1&\,=\, f_1 \,=\, 1 \end{aligned}$$
(6)

The above conditions yield:

$$\begin{aligned} \alpha _0&\,=\, \frac{1}{r_0 - r_1} \,=\, \frac{1}{\sqrt{5}} \\\alpha _1&\,=\, -\alpha _0 \,=\, \frac{1}{r_1 - r_0} \,=\, -\frac{1}{\sqrt{5}} \end{aligned}$$
(7)

   \(\square \)

Another way to prove Theorem 1 is the following:

Proof

Another way to directly compute the n-th Fibonacci number \(f_n\) is to rewrite the Fibonacci definition of Eq. (1) and the identity \(f_n \,=\, f_n\) combined in matrix-vector format as follows:

$$\begin{aligned} \begin{bmatrix} f_n \\f_{n-1} \end{bmatrix} \,=\, \begin{bmatrix} 1&1 \\1&0 \end{bmatrix} \begin{bmatrix} f_{n-1} \\f_{n-2} \end{bmatrix} \,=\, {\mathbf {A}}\, \begin{bmatrix} f_{n-1} \\f_{n-2} \end{bmatrix} \,=\, {\mathbf {A}}{\mathbf {f}}_{n-2} \end{aligned}$$
(8)

The eigenvalues \(\lambda _1\) and \(\lambda _2\) (recall that \({{\,\mathrm{det}\,}}\left( {{\mathbf {A}}}\right) \,=\, \lambda _1\lambda _2\) and that \({{\,\mathrm{tr}\,}}\left( {{\mathbf {A}}}\right) \,=\, \lambda _1 + \lambda _2\)) and the corresponding eigenvectors \({\mathbf {e}}_1\) and \({\mathbf {e}}_2\) are:

(9)
$$\begin{aligned} {\mathbf {e}}_1 \,=\, \frac{1}{\sqrt{1 + \lambda _1^2}} \begin{bmatrix} \lambda _1 \\1 \end{bmatrix} \qquad {\mathbf {e}}_2 \,=\, \frac{1}{\sqrt{1 + \lambda _2^2}} \begin{bmatrix} \lambda _2 \\1 \end{bmatrix} \end{aligned}$$
(10)

Notice that \(\left\| {{\mathbf {e}}_1}\right\| _2 \,=\, 1\) and \(\left\| {{\mathbf {e}}_2}\right\| _2 \,=\, 1\) and, additionally, \({\mathbf {e}}^T_1{\mathbf {e}}_2 \,=\, 0\). From the spectral decomposition of \({\mathbf {A}}\) it follows that:

$$\begin{aligned} {\mathbf {A}}&\,=\, \lambda _1 {\mathbf {e}}_1 {\mathbf {e}}_1^T+ \lambda _2 {\mathbf {e}}_2 {\mathbf {e}}_2^T\\{\mathbf {A}}^{\!n}&\,=\, \lambda _1^n {\mathbf {e}}_1 {\mathbf {e}}_1^T+ \lambda _2^n {\mathbf {e}}_2 {\mathbf {e}}_2^T\\{\mathbf {A}}^{\!n} {\mathbf {f}}_0&\,=\, {\mathbf {f}}_{n-1} \,=\, \lambda _1^n {\left( {{\mathbf {e}}_1^T{\mathbf {f}}_0}\right) } {\mathbf {e}}_1 + \lambda _2^n {\left( {{\mathbf {e}}_2^T{\mathbf {f}}_0}\right) } {\mathbf {e}}_2 \end{aligned}$$
(11)

Observe that the first element of vector \({\mathbf {f}}_{n-1}\) is \(f_n\) and it is equal to:

$$\begin{aligned} f_n \,=\, \lambda _1^{n+1}\frac{1 + \lambda _1}{1 + \lambda _1^2} + \lambda _2^{n+1}\frac{1 + \lambda _2}{1 + \lambda _2^2} \end{aligned}$$
(12)

Finally, the structure of \({\mathbf {A}}^{\!n}\) can be shown by induction to be:

$$\begin{aligned} {\mathbf {A}}^{\!n} \,=\, \begin{bmatrix} f_n&f_{n-1} \\f_{n-1}&f_{n-2} \end{bmatrix} \end{aligned}$$
(13)

   \(\square \)

Despite form (5), Fibonacci numbers as evident by the initial conditions and their generation mechanism. Another way to see this is the following theorem:

Theorem 2

Fibonacci numbers are integers.

Proof

Applying the Newton binomial theorem to (5) yields:

$$\begin{aligned} f_n \,=\, \frac{1}{\sqrt{5}}\,\sum _{k=0}^n \left( {\begin{array}{c}n\\ k\end{array}}\right) 2^{-n} \underbrace{{\left( {\sqrt{5}^k - {\left( {-\sqrt{5}}\right) }^{\!k}}\right) }}_{\gamma _k} \end{aligned}$$
(14)

When \(k \,\equiv \, 0 \pmod {2}\) then \(\gamma _k\) is zero, whereas if \(k \,\equiv \, 1 \pmod {2}\) then \(\gamma _k\) equals \(2 \cdot 5^{\frac{k+1}{2}}\). Therefore:

$$\begin{aligned} f_n \,=\, \sum _{k} \left( {\begin{array}{c}n\\ k\end{array}}\right) 2^{1-n} 5^{\frac{k}{2}}, \qquad k \,\equiv \, 1 \pmod {2} \end{aligned}$$
(15)

   \(\square \)

4 TensorFlow Implementation

As stated earlier, TensorFlow relies on the dataflow algorithmic paradigm, which essentially utilizes a potentially large operations graph in order to break down the desired computational task to smaller manageable components. Dataflow graphs have the following properties:

  • Vertices represent a wide array of mathematical operations including advanced ones such as eingenvector computation, singular value decomposition, and least squares fitting.

  • Edges describe the directed data flow between operation results.

  • The operands between the various graph operations are tensors of aritrary dimensions as long as they are compatible.

TensorFlow r1.12 was installed to Ubuntu 18.04 LTS for Python 3.6 using the pip package installer. An NVIDIA Titan Xp GPU based on Pascal architecture was available in the system and was successfully discovered by TensorFlow as gpu0.

Three Fibonacci generators were implemented in total. Each such generator yields a batch consisting of the first n consecutive Fibonacci integers. In the experiments, n ranged from 2 to 1024 with an exponentially increasing distance between two successive batch sizes. Since the particular GPU has 3840 CUDA cores, the parallelism potential is high. The generators are:

  • A serial implementation which consists of a single loop which adds one new Fibonacci number with each pass.

  • A parallel implementation which directly computes the k-th element of the sequence \({\left\langle {f_k}\right\rangle }\) based on Eq. (11).

  • A second implementation which relies on the slightly simpler closed expression of Eq. (2).

Figure 1 shows the total number of floating point operations which were required for each batch size. The values are the arithmetic mean of ten executions for each batch size. Preceding each such execution there was a trial run not taken into consideration in the final results which served the single purpose of loading the data into system cache. Since the serial implementation is a loop, the number of additions is linear in terms of n. On the contrary, the parallel implementations require a number of auxiliary floating point operations, most notably the exponentiation of certai parameters. Thus, they require more operations whose number is a polynomical function, approximately quadratic, of batch size, with the second generator clearly always being more expensive in terms of operations.

Fig. 1.
figure 1

Number of operations vs batch size of Fibonacci numbers.

However, the fact that a generator requires more floating point operations does not necessarily makes it slower in terms of total execution time. Instead, the results shown in Fig. 2 indicate that both parallel generators achieve considerably lower wallclock execution time in milliseconds. Since the computation of \({\left\langle {f_k}\right\rangle }\) is GPU-bound process and the design of both parallel generators entail very low communication across the memory hierarchy, ordinary wallclock time in this case consists almost entirely of time spent to actual computations.

Notice that, unlike the previous figure, no parallel implementation appears to be ideal for every batch size. Specifically, the second implementation is better for lower sized batches, whereas the first one becomes more preferable as batch size grows despite its more complex formula. This can be attributed to the fact that the second generator achieves more locality as certain parameters are common across each batch size.

Fig. 2.
figure 2

Wall clock time (msec) vs batch size of Fibonacci numbers.

This difference in performance can be also seen by dividing the number of floating point operations to the wallclock time, yielding an approximation of the FLOPS for each generator as seen in Fig. 3. Although by no means a single absolute benchmark, especially within a parallel computation context, FLOPS are in this case indicative since the algorithmic core is purely computational. The difference from the serial implementation is now obvious, as is also the fact that the second generatorfor large batches performs better.

Fig. 3.
figure 3

Approximate FLOPS vs batch size of Fibonacci numbers.

Notice that in both parallel implementations appear many consecutive powers of known quantities. In order to save floating point operations, a lookup table could have been used according the design principles found for instance in [17]. In order to evaluate the impact of relying on a lookup table to the total FLOPS for the two parallel generators, two variants of each were also implemented. The first version used locally half the known quantities required, whereas the second only one quarter of them. In both cases, and in order to achieve comparable result accuracy for fairness reasons, spline interpolation was used. As it can be seen from Fig. 4, the introduction of a lookup table for both generators downgraded their FLOPS counter. This can be explained from the facts that TensorFlow has an efficient multipication algorithm especially for large numbers and that frequently accessesing the shared GPU memory eventually slowed computations down.

Fig. 4.
figure 4

Approximate FLOPS for two lookup table options.

5 Conclusions and Future Work

This conference paper presented two parallel Fibonacci integer generators for TensorFlow running over Python 3.6 and an NVIDIA Titan Xp GPU and discussed certain implementation aspects. Both generators yield batches of n integers, with n ranging from 2 to 1024. The maximum batch size is smaller than the number of cores in the GPU, increasing thus parallel potential. The baseline was a serial implementation for TensorFlow consisting of a single loop generating one new Fibonacci integer in each pass.

The primary finding of the experiments is the superior performance of parallel generators. Although requiring more floating point operations in total, both parallel implementations outperform the baseline in terms of wallclock time and of FLOPS. This is attributed to the efficient use of parallelism. A secondary finding was that replacing actual computations with frequent accesses to a shared lookup table led to lower FLOPS values. This can be explained by the latency caused by a large number of threads asking for the same information.

Concerning future research directions, there is a number of options which can be followed. Experiments with larger batch sizes should be conducted, especially with sizes which exceed the number of GPU cores. Additionally, more algorithmic schemes should be tested, such as those constructing Fibonacci integers bitwise, as they may led in generators with higher parallelism. Finally, the performance of the proposed algorithms to other parallel architectures can be evaluated, in order to understand whether a given hardware architecture is more appropriate for particular algorithmic principles.