CentraleSupélec
Département informatique
Plateau de Moulon
3 rue Joliot-Curie
F-91192 Gif-sur-Yvette cedex
1CC2000 - Algorithmics and complexity - lab session 1

Table des matières

Objectives of the first lab session

The work realized during this 1.5 hour lab session aims to:

  1. To allow students to program algorithms for:
    • find duplicate items in a list using lists
    • find duplicate items in a list using dictionaries
    • compare complexity between lists and dictionaries
    • calculate the transitive closure of a binary relation with ajacency lists
    • calculate the transitive closure of a binary relation with adjacency matrix
    • understand the difference in their complexity
  2. To use an environment and working tools to resolve the problem:
    • generation of random inputs
    • definitions of metrics and creation of “benchmarks”
    • display of inputs and outputs

Lists and dictionaries

This is a classic and relatively simple problem, but it helps illustrate the difference of complexity between lists and dictionaries.

Find duplicate items in a list

  • Input: a list of items
  • Output: a list of items that appear more than once in the input list

The associated decision problem is as follows:

  • Input: a list of items
  • Output: a boolean indicating whether the list contains duplicate elements

We will consider different solutions and analyze their time and space complexity. To do this, you can get help from documentation on the complexity of operations on Python data structures.

 

Write the following three functions:

  • find_duplicates_list(L) using lists, which takes a list L as a parameter and generates a list of elements that appear more than once in L ;
  • find_duplicates_sort(L) using lists by ordering the elements, which involves sorting the list and then iterating through the ordered list to find the duplicate elements. Once the list is sorted, the duplicate elements are consecutive, which speeds up their identification;
  • find_duplicates_dict(L) using dictionaries to test if an element is present in the dictionary in constant time.

Test your functions with the following code to verify that the three versions return the same result.


import random

n = 20000
# Génération de n entiers aléatoires entre 0 et n//10
data = [ random.randint(0, n//10) for _ in range(n) ]
dup1 = find_duplicates_list(data)
dup2 = find_duplicates_sort(data)
dup3 = find_duplicates_dict(data)
assert (sorted(dup1) == dup2 == sorted(dup3))

 

  • the comparison may require ordering the lists of duplicate elements that are returned by the different versions.
  • we cannot use the trick of converting these lists to set(), because this would remove potential duplicates which would indicate an error in one of the versions.

 

  • Analyze the complexity of these three versions in order to compare them. Write your answer in a comment in your source file.
  • For a more detailed comparison, we can measure the execution time of each version, and on more realistic data: randomly generated strings. What do you see in the benchmark? Discuss it with your teacher.

Benchmark

The Python module timeit allows you to measure the execution time of a Python function passed as a parameter and launched number times. Using the following code, compare the execution times of the three versions.


import itertools

def generate_all_strings(n):
    alphabet=[chr(c) for c in range(97, 123)]
    res=list(itertools.product(alphabet, repeat=n))
    res = ["".join(i) for i in res]
    return res

import random, timeit
import matplotlib.pyplot as plt
import copy

def Q2_benchmark():

    # Il y a 26 lettres dans l'alphabet, donc 26^5 = 11881376 possibilités
    all_strings = generate_all_strings(5)

    n_values = [100, 200, 500, 1000, 2000, 5000, 10000, 20000
                # , 50000, 100000, 200000, 500000, 1000000
                ]
    timings_list = [0 for _ in n_values]
    timings_sort = [0 for _ in n_values]
    timings_dict = [0 for _ in n_values]
    for i, n in enumerate(n_values):
        print(f'i={i}, n={n}')
        data1 = all_strings[:n//2]
        data2 = copy.copy(data1)
        random.shuffle(data1)
        random.shuffle(data2)
        data = data1 + data2
        timings_list[i] = timeit.timeit(lambda: find_duplicates_list(data), number=1)
        timings_sort[i] = timeit.timeit(lambda: find_duplicates_sort(data), number=1)
        timings_dict[i] = timeit.timeit(lambda: find_duplicates_dict(data), number=1)
    plt.plot(n_values, timings_list, color='red', marker='o')
    plt.plot(n_values, timings_sort, color='blue', marker='x')
    plt.plot(n_values, timings_dict, color='green', marker='+')
    # Plot with a log/log scale
    plt.xscale('log')
    plt.yscale('log')
    plt.title('Timings')
    plt.show()

Adjacency lists versus adjacency matrices

We represent a binary relation as a subset of the Cartesian product of the set of integers from 0 to 𝑛−1 with itself. This binary relation is also a directed graph whose vertices are the integers from 0 to 𝑛−1 and the edges are the pairs of the relation.

First, we generate a binary relationship using the Erdos-Renyi (ER) model.

Principle: random extraction of a subset of fixed cardinality from all possible pairs.


import random

def genere_relation_binaire(n, m):
    return random.sample([(x, y) for x in range(n) for y in range(n)], m)


Representation

We can consider different representations of a binary relationship:

  • adjacency list: list of dictionaries/sets
  • adjacency matrix: list of lists

We write a conversion function to each of these representations. We also add a function returning the neighborhood of a point.


def convertit_liste_adjacence(relation, n):
    """
    Convertit une relation binaire en liste d'adjacence
    """
    liste_adjacence = [[] for _ in range(n)]
    for x, y in relation:
        liste_adjacence[x].append(y)
    return liste_adjacence

def voisinage_liste_adjacence(liste_adjacence, u):
    """
    Renvoie le voisinage de x dans la relation binaire
    """
    return liste_adjacence[u]

Write two similar functions with adjacency matrix:

  • convertit_matrice_adjacence(relation, n)
  • voisinage_matrice_adjacence(matrice_adjacence, u)

Analyze the complexity of each.

Depth-First Search of a graph

We use the iterative method which is based on the use of a stack.

All nodes are covered: when the algorithm has exhausted a connected component, we move on to a next node not yet visited.

The algorithm receives a graph in one of the following representations along with the appropriate neighbour function:

  • adjacency matrix
  • adjacency list

Note: the depth-first search from node 𝑢 visits all nodes reachable from 𝑢 .

  • Write a function depth_first_search(graph, n, voisinage) which takes as parameter a graph (in two possible representations), a maximum integer (see the function genere_relation_binaire(n, m) above) and the name of a voisinage function. It returns all the nodes visited.
  • To test these functions, we consider two cases: a dense graph and a sparse graph. In the case of the sparse graph, the number of edges is barely greater than the number of vertices. In the case of dense graphs, the number of edges is close to {$n^2$}.

def sparse_factor(n):
    return int(1.05*n)

def dense_factor(n):
    return int(0.95 * n * n)

def validation_dfs(n):

    r_dense = genere_relation_binaire(n, dense_factor(n))
    print(len(r_dense))
    r_sparse = genere_relation_binaire(n, sparse_factor(n))

    for r in [r_dense, r_sparse]:
        assert depth_first_search(convertit_liste_adjacence(r, n), n, voisinage_liste_adjacence) == depth_first_search(convertit_matrice_adjacence(r, n), n, voisinage_matrice_adjacence)

validation_dfs(1000)

 

  • Analyze the complexity with these two different representations.
  • What do you see in the following benchmark? Discuss it with your tutorial teacher.

Benchmark


# On a besoin de les passer à la fonction timeit par le biais de globals()

def Q5_benchemark():
    global liste_adjacence, matrice_adjacence

    timings = {}
    timings['listes_adjacence'] = {}
    timings['matrice_adjacence'] = {}
    # Nombre de fois où l'on répète la mesure
    number = 1
    for n in [10, 20, 50, 100, 200, 500, 1000
              , 2000, 5000
              ]:

        timings['listes_adjacence'][n] = {}
        timings['matrice_adjacence'][n] = {}
        r_dense = genere_relation_binaire(n, dense_factor(n))
        r_sparse = genere_relation_binaire(n, sparse_factor(n))
        for i, r in enumerate([r_sparse, r_dense]):
            liste_adjacence = convertit_liste_adjacence(r, n)
            matrice_adjacence = convertit_matrice_adjacence(r, n)
            timings['listes_adjacence'][n][i] = timeit.timeit(f"depth_first_search(liste_adjacence, {n}, voisinage_liste_adjacence)",number=number,globals=globals())

            timings['matrice_adjacence'][n][i] = timeit.timeit(f"depth_first_search(matrice_adjacence, {n}, voisinage_matrice_adjacence)",number=number,globals=globals())

    # Define colors for representations and shapes for graph types: 
    # red for listes_adjacence and blue for matrice_adjacence
    # o for sparse graphs and * for dense graphs
    colors = ['red', 'blue']
    marks=['o', '*']
    for i, representation in enumerate(['listes_adjacence', 'matrice_adjacence']):
        for n in timings[representation]:
            for p in timings[representation][n]:
                plt.plot(n, timings[representation][n][p], color=colors[i], marker=marks[p])
    plt.xscale('log')
    plt.yscale('log')
    plt.title('Timings')
    plt.show()