CentraleSupélecDépartement informatique
Plateau de Moulon
3 rue Joliot-Curie
F-91192 Gif-sur-Yvette cedex
1CC1000 - Information Systems and Programming - Lab: Introduction to Matplotlib and Numpy

Table des matières

The objective of this lab is to give an introduction to visualization with Matplotlib and scientific computing with Numpy. We'll implement a simple function with different programming methods, measure the performances of each and visualize the results.

Classic programming methods

Recursion

Develop a function fibonacci_rec(n) that computes the value of the n-th term of the Fibonacci sequence. This function must be recursive. The expected results for the first values are: {$ 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181... $}
Remark for advanced students : do not use memoization.

A recursive function is a function that calls itself to solve a problem. It consists of two main parts:

1. Base case. This is the condition that stops the recursion. Without a base case, the function would call itself indefinitely. The base case typically handles the simplest instance of the problem, returning a direct answer. For example, in Fibonacci, F(0) = 0 and F(1) = 1 are base cases.

2. Recursive case, where the function calls itself on a smaller or simpler version of the problem, gradually reducing the problem until the base case is reached.

Memoization is a technique to store previously computed results, which avoids redundant calculations and makes the recursion much more efficient.


ANSWER ELEMENTS

Recall the definition of recursive functions.

def fibonacci_rec(n):
    if n == 0:
        return 0
    if n == 1:
        return 1
    else:
        return fibonacci_rec(n-1)+fibonacci_rec(n-2)


Time analysis

The timeit function of the package bearing the same name is used to measure the execution time of Python code. In the following example, timeit returns the total time of 100 executions of the Python code to smooth the results.

Execution times may vary from machine to machine; you should expect an execution time of a few thousandths of a second. Try it.

from timeit import timeit
timeit(lambda : fibonacci_rec(10), number=100)

We defined a function measure() that determines the execution time of a function for a sequence of values.
When we measure the performances of fibonacci() with different values of {$n$}, we should observe a high execution time for values beyond {$n = 20$}.

def measure(func, values, number=100):
    m = []
    for i in values :
        m.append(timeit(lambda : func(i), number=number))
    return m

measure(fibonacci_rec, [1, 2, 3, 5, 10, 20, 25])

When a function calls itself (recursion), the computer needs to remember where to come back after each call. This is done using the call stack, which is an area of the computer's memory. In fact, the call stack is used when any function, including non-recursive, is called.

Here is how recursion is implemented under the hood:

1. Each recursive call creates a small "box" in memory called a stack frame. The stack frame stores the function's parameters, local variables, and the place to return to (the return memory address) after the call.

2. Every time the function calls itself, a new stack frame is added on top of the stack. The stack works last-in, first-out: the most recent call is completed first.

3. When the base case is reached, the function starts returning values, popping each stack frame off the stack in reverse order.

Since each process is given a limited portion of memory by the operating system, if there are too many recursive calls, the stack can overflow.

Iteration

Develop a function fibonacci_iter(n) that computes the value of the n-th term of the Fibonacci sequence. This function must be iterative.
Measure the execution time of fibonacci_iter(n). It should be significantly faster than the recursive version.


ANSWER ELEMENTS

def fibonacci_iter(n):
    if n == 0:
        return 0
    else:
        f = [0, 1]
        for i in range(2, n + 1):
            f.append(f[-1] + f[-2])
        return f[n]

Visualization

We'll now plot these values. We use the pyplot module from the matplotlib library. The following instructions plot a series of points defined by two lists of coordinates x and y.

import matplotlib.pyplot as plt
plt.plot(xvalues, yvalues, label="name for legend")
plt.legend()
plt.show()

The graph should always contain a title and a legend, have named axes, the scale of the y-axis must be logarithmic, the scale of the x-axis must start at 0. Here are some useful functions.

plt.yscale("log")
plt.xlabel("Name")
plt.ylabel("Name")
plt.title("Name")

Be inclusive, don't forget color-blind people (about 1 in 20 people) and choose line styles that can be told apart without using colors. Line styles can be set with the following codes: ":", "-", "-." and "--".

plt.plot(xvalues, yvalues, "--", label="Dashed line")

Put it all together

Add a main section to your Python script as follows:

if __name__ == "__main__":
    # measure execution times
    # plot performance curves

and add the instructions to measure that execution time of fibonacci_rec() and fibonacci_iter(), and plots the performance curves.

if __name__ == "__main__" is a special Python construct that checks whether a script is being run directly or imported as a module.

If the script is run directly, the code inside this block is executed.

If the script is imported in another module, the code inside the block is not executed.

This allows you to write code that can be tested or executed when run as a program, without running automatically when imported.


ANSWER ELEMENTS

import matplotlib.pyplot as plt

if __name__ == "__main__":
    l = list(range(1, 20, 1))
    frec = measure(fibonacci_rec, l)
    fiter = measure(fibonacci_iter, l)
    plt.yscale("log")
    plt.title("Execution time of fibonacci functions")
    plt.xlabel("Fibonacci number")
    plt.ylabel("Time in second")

    plt.plot(l, frec, "-.", label="fibonacci_rec")
    plt.plot(l, fiter, "--", label="fibonacci_iter")

    plt.legend()
    plt.show()


Alternative method

An alternative method to compute the Fibonacci sequence consists in using matrices. A matricial form of the Fibonacci sequence is given below:

{$$ \left( \begin{matrix} F_1 \\ F_2 \end{matrix} \right) = \left( \begin{matrix} 0 & 1 \\ 1 & 1 \end{matrix} \right) * \left( \begin{matrix} F_0 \\ F_1 \end{matrix} \right) $$}

and in general:

{$$ \left( \begin{matrix} F_n \\ F_{n+1} \end{matrix} \right) = \left( \begin{matrix} 0 & 1 \\ 1 & 1 \end{matrix} \right)^n * \left( \begin{matrix} F_0 \\ F_1 \end{matrix} \right) $$}

We'll use numpy to compute operations on matrices.

Numpy is an open source Python library that provides functions to manipulate multidimensional data structures. The main data structure is the np.array, which represents a multidimensional vector. Multidimensional vectors can also be represented by using standard Python lists.

However, two important differences exist between Python lists and Numpy arrays:

1. The size of a NumPy array is decided at the creation and cannot change. As opposed to that, Python lists can grow dynamically. Changing the size of a Numpy array will create a new array and delete the original.

2. All elements in a NumPy array must have the same data type.

As a result, the elements of a Numpy array are all stored in contiguous locations in memory, which is not the case of Python lists. This is why iterating over Numpy arrays is much more efficient than iterating over Python lists.

NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences.

Here are some simple steps to help you.

Creation of matrices and vectors.

import numpy as np
mat1 = np.array([[5,6],[10,3]])
vec = np.array([[9],[2]])

Access to one element.

mat1[1,0]     # => 10

Some term-to-term operations.

mat2 = mat1 + mat1  # => mat2[i,j] = mat1[i, j] + mat1[i,j]
mat3 = mat1 * 2     # => mat3[i,j] = mat1[i,j] * 2
 ...

Some matrix operations.

mat4 = np.matmul(mat1, mat1)      # matrix product
mat5 = mat1 @ mat1            # abbreviated product syntax
mat6 = np.linalg.matrix_power(mat1, 2)   # mat1**2

Take a moment to familiarize yourself with numpy matrix operations in a Python shell.

WARNING: you may already have used np.matrix instead of np.array. They work slightly differently. In particular, operations like M ** 2 or M * vec are available for np.matrix; it all seems very practical. HOWEVER, the numpy documentation is very clear on this point, the np.matrix object will be removed from the library soon, so you have to stop using them: "It is no longer recommended to use this class, even for linear algebra. Instead use regular arrays. The class may be removed in the future".

Matrix

Develop a function fibonacci_mat(n) that computes the value of the n-th term of the Fibonacci sequence. This function must use numpy functions.

Include this function in your performance analysis.


ANSWER ELEMENTS

import numpy as np
def fibonacci_mat(n):
    matrix = np.array([[0,1],[1,1]])
    vec = np.array([[0],[1]])
    return (np.linalg.matrix_power(matrix, n-1) @ vec)[1, 0] 

if __name__ == "__main__":
    l = list(range(1, 20, 1))
    frec = measure(fibonacci_rec, l)
    fiter = measure(fibonacci_iter, l)
    fmat = measure(fibonacci_mat, l)
    plt.yscale("log")
    plt.title("Execution time of fibonacci functions")
    plt.xlabel("Fibonacci number")
    plt.ylabel("Time in second")

    plt.plot(l, frec, "-.", label="fibonacci_rec")
    plt.plot(l, fiter, "--", label="fibonacci_iter")
    plt.plot(l, fmat, "-", label="fibonacci_mat")

    plt.legend()
    plt.show()


Large values of {$n$}

Your recursive function is probably much slower than the other two variants. The execution time seems to increase exponentially with the value of {$n$}. And it does. Test your functions with larger values of {$n$}. Start with range(1, 20, 2), then range(1, 30, 3) and so on until the execution time of fibonacci_rec becomes too long.

Test the three functions.


ANSWER ELEMENTS

if __name__ == "__main__":
    l = list(range(0, 20, 1))
    l2 = list(range(0, 20, 1)) + list(range(20, 1000, 10))
    frec = measure(fibonacci_rec, l)
    fiter = measure(fibonacci_iter, l2)
    fmat = measure(fibonacci_mat, l2)
    plt.yscale("log")
    plt.title("Execution time of fibonacci functions")
    plt.xlabel("Fibbnacci number")
    plt.ylabel("Time in second")

    plt.plot(l, frec, "-.", label="fibonacci_rec")
    plt.plot(l2, fiter, "--", label="fibonacci_iter")
    plt.plot(l2, fmat, "-", label="fibonacci_mat")

    plt.legend()
    plt.show()



The devil is in the detail

Compare the results of fibonacci_iter(93) and fibonacci_mat(93). Where does the problem come from?


ANSWER ELEMENTS

In Python integers haven't a fixed size, the only limit is the available memory. The interpreter manages the overflows and, if necessary, a greater number of bits is used.

Integer values in Numpy have a fixed size. By default an integer is a signed 64 bits value. Overflows are not detected. Conclusion: Numpy is a precious library but you need to be careful when you use it.


Well done

You have completed the first part of this lab. You studied the basics of visualization with matplotlib and, for the first time, touched with hand the simplicity of matrix calculations with numpy. If you still have time, you may want to check the next following questions.

Optional questions

The problem with the recursive function comes from the exponentially large number of recursive calls. The function may be called many times with the same input value. It is possible to work around this problem through memoization. The rationale of memoization is to store the intermediate results in a dictionary and make a recursive call only if the result of the call is not in the dictionary yet.

Memoization

Modify the implementation of your recursive function to use memoization and compare its performances with the two other functions.

You can find out more on memoization here.


ANSWER ELEMENTS

Warning: when the default value of a parameter is a mutable object (dictionary, list, etc.), this object is preserved between calls. Hence, in the following example, the memoization dictionary is completed only once. This biases the comparisons, as the timeit module averages the execution time over several calls.

def fibonacci_rec(n, memo={}):
    if n == 0:
        return 0
    if n == 1:
        return 1
    else:
        if n in memo:
            return memo[n]
        else :
            f = fibonacci_rec(n-1, memo=memo) + fibonacci_rec(n-2, memo=memo)
            memo[n] = f
            return f

Two solutions are possible. The first is to use an auxiliary function to explicitly initialize the dictionary. The second is to set the default value to None.

def fibonacci_rec(n, memo=None):
    if n == 0:
        return 0
    if n == 1:
        return 1
    else:
        if memo is None:
            memo = {}
        if n in memo:
            return memo[n]
        else :
            f = fibonacci_rec(n-1, memo=memo) + fibonacci_rec(n-2, memo=memo)
            memo[n] = f
            return f

Very large values of {$n$}

Up to what value of {$n$} can you compute results? It depends on your machine. Increase the measurement range, ex. from {$0$} to {$20,000$} with a step of {$1,000$}. One of the functions should generate an error message for too large values of {$n$}.

In your opinion

  1. Why does fibonacci_mat(n) have an execution time which increases for small values of {$n$} and then seems to be constant?
  2. What does fibonacci_rec(n) error message means for large values of {$n$}?

ANSWER ELEMENTS

  1. To compute the power of a matrix {$M^n$}, numpy uses a "exponentiation by squaring" algorithm. The number of operations performed is proportional to {$\log_2(n)$}, to observe this increase we would need values ​​of {$n$} much more important. The notion associated with this question is the complexity of algorithms, we will come back to this during the ST2 algorithm course.
  2. Recursive calls are accumulated on the stack of the processor. The stack has a limited memory size. When the stack size is exceeded the processor can no longer continue its normal execution, an exception is raised. You'll take a closer look at this notion if you'll follow the third year major "Computer systems architecture".

Try the matrix computation your way

Since matrix computation with Numpy generates overflows, you can implement your own method using lists of native Python integer numbers. You'll need a function to compute the product of matrices (matrices will be represented as lists of lists), and a fast exponentiation function.


ANSWER ELEMENTS

def prod(m1, m2):
    result = [[sum(a * b for a, b in zip(X_row, Y_col)) for Y_col in zip(*m2)] for X_row in m1]
    return result

def fast_expo(m, e):
    result = [[1, 0], [0, 1]]
    while e > 0:
        if e % 2 == 1:
            result = prod(result, m)
        m = prod(m, m)
        e = e // 2
    return result

def fibonacci_mat(n):
    matrix = [[0, 1], [1, 1]]
    vec = [[0], [1]]
    e = fast_expo(matrix, n-1)
    return prod(e, vec)[1][0]