This tutorial consists of two parts:
- Modeling relational databases.
- Using relational databases with Python.
Relational database modeling
Objective. In the first part of this tutorial we do not code; instead we design a relational database for the management of French real estate transaction data, using the same dataset introduced in the previous tutorial.
In the previous tutorial, we considered only the Extract and Transform phases of the ETL pipeline; the Load phase was intentionally omitted. Consequently, the transformed data were simply persisted as a Parquet file.
As a result, the transformed dataset contains transaction data, administrative data about departments, regions, and cities, as well as spatial information, all in a single table. This organization introduces a high degree of redundancy. Indeed, because multiple transactions may refer to the same city, the corresponding spatial geometry of that city is repeated for every transaction. The same phenomenon occurs for department-level attributes such as the code and name, and for regional attributes. Such a design has two main drawbacks:
- Risk of inconsistencies: any modification to a department, region, or city attribute must be propagated to every affected record.
- Inefficient use of storage: repeating the same information across many records leads to unnecessary duplication.
A well-structured relational database allows these issues to be eliminated through a proper decomposition of the data into related entities.
Structure of the database
The database must preserve all information that was present in the transformed Parquet file produced in the previous tutorial.
For each transaction, we store the transaction date, the transaction type (for instance, sale or expropriation), and the value of the property involved.
Each transaction concerns a specific property, for which we record the property type (for example, house or apartment), the surface area, and, when applicable, the surface area of any annexed land, such as a garden. We also store the number of main rooms and the number of units composing the property.
In addition, each property is associated with an address composed of the street number, street type, and street name, together with the postal code and the city name and INSEE code. Each specific address at the street number level is identified in the cadastre by two values: the cadastral section and the land parcel. Each city belongs to a department, which is identified by a code and characterized by a name; each department, in turn, belongs to a region, which is also identified by a code and characterized by a name. Finally, for each city, spatial information is stored in the form of polygon geometries.
- Propose a conceptual model of the database using a entity-relationship diagram. Database modelling is covered in Chapter 5 of the course handbook; a short introduction is given here.
ANSWER ELEMENTS
The conceptual model is given in figure.
Important points to remind students.
- At the conceptual level, the terms entity and attribute (or property) must be used instead of table and column. Similarly, instead of key, we use entity identifier (or simply identifier).
- Foreign keys do not belong to the conceptual model. References between entities are represented through explicit relationships.
- The attribute
property_valuemay also be associated with the entityTransaction. In the proposed diagram, it is attached to the relationshiprelates_toin order to illustrate that relationships may themselves carry attributes. - It is important to decouple street information from property information. Because a street name may change over time, this information should not be duplicated for each historical transaction. It is preferable to store it in a single place. In addition, a street may include extra attributes, such as traditional names (in french, *lieu-dit*).
- Names and codes of cities, departments, and regions may change over time. Centralizing these data justifies the creation of separate entities.
- Surrogate identifiers are used for
Property,Street, andTransaction(that is, an artificial attribute is introduced as identifier), because no natural combination of available attributes provides a robust and practical identifier. Although the trivial identifier based on all attributes is possible, it is generally not useful in practice. In general, simple identifiers (a single attribute) are preferred. - For
City, a surrogate identifiercity_idis required, as a single city may be associated with multiple names sharing the same INSEE code. A city may have changed name over the course of the years. - For
Department, andRegion, existing administrative codes are used as identifiers, although this choice can be debated. Since these codes may evolve, many practitioners discourage mutable identifiers, even if SQL does not prohibit them to be used in relational databases. If a department or a region name changes, and we want to keep the old names, we would need to use surrogate identifiers. Here we don't consider this case. - The schema does not include an explicit many-to-many relationship. One can illustrate this case by adding an entity
Personlinked toPropertythrough a relationshipowns: a person may own several properties, and a property may have several owners. Students should be reminded that owner data are not part of the dataset and that this extension is only a conceptual example. - Other cardinalities are possible. For instance, the
(1, n)cardinality onrelates_tocould be replaced by(0, n)if the database is intended to allow registration of properties that have not yet been involved in any transaction.
- Derive a logical model of the database from the conceptual model. Logical models are covered in Chapter 5 of the course handbook; a short introduction is is given here.
ANSWER ELEMENTS
In the logical model, each table corresponds to an entity. The table has a column for each property of the corresponding entity; a table may receive additional columns (in red in the model below) through their relationships with other tables.
transaction (transaction_id, transaction_date, transaction_type, property_id, property_value) property (property_id, property_type, built_area, land_area, num_main_rooms, number_of_units, street_id) street (street_id, street_number, street_type, street_name, zip_code, cadastral_section, land_parcel, city_id) city (city_id, insee_code, city_name, geometry, dept_code) department (dept_code, dept_name, reg_code) region (reg_code, reg_name)
Foreign keys
transaction(property_id)referencesproperty(property_id)property(street_id)referencesstreet(street_id)street(city_id)referencescity(city_id)city(dept_code)referencesdepartment(dept_code)department(reg_code)referencesregion(reg_code)
If we added to the conceptual model an entity Person linked to entity Property via a many-to-many relationship owns, we would add to the conceptual model two additional tables:
person (person_id, person_first_name, person_last_name ...) ownership (property_id, person_id)
and two additional foreign keys:
ownership(property_id)referencesproperty(property_id)ownership(person_id)referencesperson(person_id)

