CentraleSupélecDépartement informatique
Plateau de Moulon
3 rue Joliot-Curie
F-91192 Gif-sur-Yvette cedex
1CC1000 - Information Systems and Programming - Lab: ETL processes with Python

Activities of this lab:

  • Learn how to manipulate semi-structured data with Pandas.
  • Learn what ETL processes are.
  • PistusResa ETL module: clean and import data into a relational database.
  • PistusResa Student module: read data from a relational database using Python code.

Introduction to Pandas

The data used in PistusResa is stored in two CSV files. We'll use Pandas to read the data from the CSV files, clean them and load them into the Pistus database.

The goal of this section is to learn how to use Pandas.


Download this notebook and import it into Visual Studio Code. Read the notebook and do the exercises before proceeding to the next section.

⏰ Expected duration of this part: at most 1.5 hours.


The PistusResa ETL module

The goal of this section is to extract the data from the input CSV files, transform and load them into the Pistus database.


Download this notebook and import it into Visual Studio Code. Read the notebook and do the exercises before proceeding to the next section.

⏰ Expected duration of this part: at most 30 minutes.


ANSWER ELEMENTS

You can download here the notebook with the solutions.



Implementation of PistusResa

You're now ready to work on the implementation of PistusResa.


  • Download the skeleton of the application PistusResa and import it into Visual Studio Code.
  • Copy the database file pistus.db to the folder data.

⏰ Time to spend on this part: at most 1 hour.


The Student module

The Student module contains the functions necessary to add and edit student information in the database.


The code of the Student module is in file mstudent.py.


In file mstudent.py, the following functions are defined but not implemented.

  • get_student, returns a student's data from the database.
  • get_associations, returns all associations stored in the database.
  • get_roles, returns the student roles in an association.
  • get_memberships, returns the associations of a specific student.
  • add_email_address, adds an email address to a student.
  • add_student, adds a student to the database.
  • add_membership, adds a student to an association in the database.
  • delete_email_address, removes the email address of a student.
  • delete_membership, deletes a student's membership in an association.
  • update_first_name, updates student's first name.
  • update_last_name, updates a student's last name.
  • update_gender, updates a student's gender.
  • update_email_address, updates the email address of a student.
  • update_membership, updates an existing membership of a student.

In file mstudent_test.py we defined and implemented a test function for each of the functions listed above. These test functions are called in the main section of file mstudent.py: see the code after the line 487.


Here's what you have to do:


  • Open file mstudent.py.
  • Implement the functions in the order in which they are defined: the order is important as the tests depend on that.
  • Each function is commented in file mstudent.py. Read carefully all the comments to understand how to implement each function.
  • You'll have to use SQL queries in each function. Refer to the SQL cheat sheet if you have doubts about the SQL syntax.
  • Whenever you finish the implementation of a function, run file mstudent.py as a Python script. This will trigger the execution of the tests.
  • If a test fails, you'll need to understand the error messages and fix the problem.

👉 Remember to remove the instruction raise NotImplementedError when you implement a function. Otherwise, your function would always be considered as not implemented.

👉 The Registration module (file mregistration.py) is already implemented. Look at the code to understand how the Student module should be implemented.


ANSWER ELEMENTS

You can download the skeleton with the solutions here.


def get_student(stud_number, cursor):
    stud_data = None
    # Gets the data from the table Student
    try:
        cursor.execute('SELECT * FROM student WHERE stud_number=?', \
            (stud_number,))
        stud_data = cursor.fetchone()
    except sqlite3.Error as error:
        print(error) 
        return None 
    # If no student can be found, return an empty tuple
    if stud_data is None:
        return () 

    email_addresses = []
    try:
        # Gets the email addresses of the student.
        cursor.execute("SELECT email FROM emailAddress WHERE stud_number=?", (stud_number,))
        result = cursor.fetchall()
        for row in result:
            email_addresses.append(row[0])
    except sqlite3.Error as error:
        print(error) 
        return None
    return (stud_data[0], stud_data[1], stud_data[2], stud_data[3], email_addresses)

#######

def get_associations(cursor):
    associations = []
    try:
        # Gets the data from the table Association.
        cursor.execute('SELECT * FROM association')
        result = cursor.fetchall()
        for row in result:
            associations.append((row[0], row[1]))
    except sqlite3.Error as error:
        print(error) 
        return None 
    return associations

#######

def get_roles(cursor):
    roles = []
    try:
        # Get the student roles in the associations.
        cursor.execute('SELECT DISTINCT stud_role FROM membership')
        result = cursor.fetchall()
        for row in result:
            roles.append(row[0])
    except sqlite3.Error as error:
        print(error) 
        return None
    return roles

#######

def get_memberships(stud_number, cursor):
    memberships = []
    try: 
        cursor.execute('SELECT asso_name, stud_role FROM membership WHERE stud_number=?', \
            (stud_number,))
        result = cursor.fetchall()
        for row in result:
            memberships.append((row[0], row[1]))
    except sqlite3.Error as error:
        print(error) 
        return None
    return memberships

#######

def add_email_address(stud_number, email_address, cursor):
    try:
       cursor.execute('INSERT INTO emailAddress VALUES (?, ?)',\
           (email_address, stud_number))
    except sqlite3.IntegrityError as error:
        return (False, DUPLICATE_EMAIL_ADDRESS, email_address)
    except sqlite3.Error as error:
        return (False, UNEXPECTED_ERROR, error)
    return (True, None, None)

#######

def add_student(stud_number, first_name, last_name, gender, email_addresses, cursor):
    try:
        # Insert the values in the table Student
        cursor.execute('INSERT INTO student VALUES (?, ?, ?, ?)', \
            (stud_number, first_name, last_name, gender))
    except sqlite3.IntegrityError as error: 
        return (False, DUPLICATE_STUD_NUMBER, None)
    except sqlite3.Error as error:
        return (False, UNEXPECTED_ERROR, error)

    # Add the email addresses of the student
    for email in email_addresses:
        res = add_email_address(stud_number, email, cursor)
        if res[0] == False:
            return res
    return (True, None, None)

#######

def add_membership(stud_number, membership, cursor):
    try:
        cursor.execute('INSERT INTO membership VALUES (?, ?, ?)', \
                (stud_number, membership[0], membership[1]))
    except sqlite3.IntegrityError as error:
        return (False, DUPLICATE_MEMBERSHIP, membership[0])
    except sqlite3.Error as error:
        return (False, UNEXPECTED_ERROR, error)
    return (True, None, None)

#######

def delete_email_address(stud_number, email_address, cursor):
    try:
       cursor.execute('DELETE FROM emailAddress WHERE stud_number=? AND email=?',\
           (stud_number, email_address))
    except sqlite3.Error as error:
        return (False, UNEXPECTED_ERROR, error)

    return (True, None, None)

#######

def delete_membership(stud_number, asso_name, cursor):
    try:
        cursor.execute('DELETE FROM membership WHERE stud_number=? and asso_name=?', \
            (stud_number, asso_name))
    except sqlite3.Error as error:
        return (False, UNEXPECTED_ERROR, error)
    return (True, None, None)

#######    

def update_first_name(stud_number, first_name, cursor):
    try: 
        cursor.execute('UPDATE student SET first_name=? WHERE stud_number=?',\
            (first_name, stud_number))
    except sqlite3.Error as error:
        print(error) 
        return (False, UNEXPECTED_ERROR, error)
    return (True, None, None)

#######

def update_last_name(stud_number, last_name, cursor):
    try:
        cursor.execute('UPDATE student SET last_name=? WHERE stud_number=?',\
            (last_name, stud_number))
    except sqlite3.Error as error:
        print(error) 
        return (False, UNEXPECTED_ERROR, error)
    return (True, None, None)

#######

def update_gender(stud_number, gender, cursor):
    try:
        cursor.execute('UPDATE student SET gender=? WHERE stud_number=?',\
            (gender, stud_number))
    except sqlite3.Error as error:
        print(error) 
        return (False, UNEXPECTED_ERROR, error)
    return (True, None, None)

#######

def update_email_address(stud_number, old_email_address, new_email_address, cursor):
    try:
        cursor.execute('UPDATE emailAddress SET email=? WHERE stud_number=? AND email=?',\
            (new_email_address, stud_number, old_email_address))
    except sqlite3.IntegrityError as error:
        return (False, DUPLICATE_EMAIL_ADDRESS, new_email_address)
    except sqlite3.Error as error:
        return (False, UNEXPECTED_ERROR, error)
    return (True, None, None)

#######    

def update_membership(stud_number, old_association, new_association, role, cursor):
    try:
        cursor.execute('UPDATE membership SET asso_name=?, stud_role=? WHERE stud_number=?\
            AND asso_name=?', (new_association, role, stud_number, old_association))
    except sqlite3.IntegrityError as error:
        return (False, DUPLICATE_MEMBERSHIP, new_association)
    except sqlite3.Error as error:
        return (False, UNEXPECTED_ERROR, error)
    return (True, None, None)


Next lab

In the next lab we will:

  • code a login window for PistusResa.
  • code the deadline and authentication modules of PistusResa (if time allows it).

👉 If you reached the end of this lab, and you still have time, you can start the next lab.