Part 1: Relational database modeling

Objective. In the first part of this tutorial we do not code; instead we design a relational database for the management of French real estate transaction data, using the same dataset introduced in the previous tutorial.

In the previous tutorial, we considered only the Extract and Transform phases of the ETL pipeline; the Load phase was intentionally omitted. Consequently, the transformed data were simply persisted as a Parquet file.

As a result, the transformed dataset contains transaction data, administrative data about departments, regions, and cities, as well as spatial information, all in a single table. This organization introduces a high degree of redundancy. Indeed, because multiple transactions may refer to the same city, the corresponding spatial geometry of that city is repeated for every transaction. The same phenomenon occurs for department-level attributes such as the code and name, and for regional attributes. Such a design has two main drawbacks:

Risk of inconsistencies: any modification to a department, region, or city attribute must be propagated to every affected record.
Inefficient use of storage: repeating the same information across many records leads to unnecessary duplication.

A well-structured relational database allows these issues to be eliminated through a proper decomposition of the data into related entities.

Structure of the database

The database must preserve all information that was present in the transformed Parquet file produced in the previous tutorial.

For each transaction, we store the transaction date, the transaction type (for instance, sale or expropriation), and the value of the property involved.

Each transaction concerns a specific property, for which we record the property type (for example, house or apartment), the surface area, and, when applicable, the surface area of any annexed land, such as a garden. We also store the number of main rooms and the number of units composing the property.

In addition, each property is associated with an address composed of the street number, street type, and street name, together with the postal code and the city name and INSEE code. Each specific address at the street number level is identified in the cadastre by two values: the cadastral section and the land parcel. Each city belongs to a department, which is identified by a code and characterized by a name; each department, in turn, belongs to a region, which is also identified by a code and characterized by a name. Finally, for each city, spatial information is stored in the form of polygon geometries.

Propose a conceptual model of the database using a entity-relationship diagram. Database modelling is covered in Chapter 5 of the course handbook; a short introduction is given here.
Derive a logical model of the database from the conceptual model. Logical models are covered in Chapter 5 of the course handbook; a short introduction is is given here.

Part 2: relational databases with Python

Objective. In this part you'll learn how to query a relational database from a Python program.

Preparation. Follow these instructions to prepare the working environment:

Clone this project in GitLab.
Open the project with Visual Studio Code.
Open a terminal and create a Python virtual environment: python3 -m venv .venv
Activate the virtual environment: source .venv/bin/activate
Upgrade pip: pip install --upgrade pip
Install the required libraries in the virtual environment: pip install -r requirements.txt

Work to do. Open notebook notebooks/td14-sql-en.ipynb and follow the instructions.

ANSWER ELEMENTS

Solutions available on GitLab: git clone git@gitlab-research.centralesupelec.fr:sip/teachers/tutorials/en/td14-sql-en.git