1CC1000 - Information Systems and Programming

1CC1000 - Information Systems and Programming - Lab: Comparing machines

Consider the three Dell machines described in the following table.

Model	Latitude 3440	Precision Mobile 7780	Precision 7960
Type	Laptop	Laptop/Workstation	Workstation
Processor	Intel Core i5-1335U	Intel Core i9-13950HX	Intel Xeon w9-3495X
Memory (RAM)	8GB - 3 200 MT/s - DDR4 (1 × 8GB)	32GB - 5 600 MHz - DDR5 (1 × 32GB)	512 GB - 4 800 MHz - GDDR6 (8 × 64 GB)
GPU	Intel Iris Xe Graphics G7 80EUs (integrated)	NVIDIA RTX 3500	NVIDIA RTX A6000
Storage	SSD 256 GB	SSD 1 TB	RAID 5 : 4 × SSD 4To
Dimensions	14" - 1920 × 1080	15.6" - 3840 × 2160	--
Weight	1.5 kg	2.5 kg	--
Price	900 €	4 600 €	32 100 €
Specs	click here	click here	click here

The first computer is a light laptop suitable for personal and/or office use. The second is a laptop workstation that can support applications requiring computing and graphics. The third is a powerful server capable of working on massive calculations, such as deep learning.

Discuss the fundamental concepts related to the main components of a computer: processor, memory (RAM), and storage.

ANSWER ELEMENTS

The purpose of this question is to recall the main notions discussed during the lecture.

Here are the key takeaways:

The processor is responsible for executing programs that are stored in main memory.
The processor main components are:
- Control unit: responsible for fetching program instructions from main memory and decoding their type using the instruction code.
- Arithmetic logic unit (ALU): performs arithmetic and logic operations.
- Registers: store temporary results and control information (e.g., the program counter stores the address of memory of the next instruction to fetch).
The main memory (RAM) is used to store temporary data and the code of running programs.
- RAM is a volatile memory: its content is lost when the computer is powered off.
- RAM is composed of a set of equal-sized words, each word is a sequence of bits. Each word is identified by an address (used to read/write the content of the word).
- A word can contain indifferently a value, or a program instruction (Von Neumann architecture).
Storage refers to the non-volatile memory. Two main types of storage:
- Hard disk drives (HDD): store data on platters made of magnetic material. They are slow, because they contain mechanical parts.
- Solid-state drives (SSD): store data in semiconductor cells. They do not contain any moving parts, they are faster.

Also, remember the main reason why computers use multiple levels of memory.

Disks (HDD or SSD) are used for long-term storage of data and programs.
Before execution, the CPU loads program instructions and data from disk into RAM, since accessing RAM is much faster than reading directly from disk.
However, RAM capacity is smaller than that of disks, because RAM is significantly more expensive.
In addition, the CPU includes its own registers (the smallest memories in the computer) and its own cache memory to provide even faster access to data and instructions. We will return to this point in a later question.

Look at the specification of the first two computers (link in the last row of the table). How many screens can be simultaneously used with these laptops, regardless of the characteristics of the GPU?

ANSWER ELEMENTS

Latitude 3440 : 3
- Laptop screen
- HDMI 1.4
- USB 3.2 Gen 2 Type-C port with DisplayPort
Precision Mobile 7780 : 7
- Laptop screen
- HDMI 2.0a or 2.1
- USB 3.2 Gen 2 Type-C port with DisplayPort
- 2 x Thunderbolt 4 ports with USB Type-C (each supports two 4K displays)

What are the data transfer rates on the available connections of the two laptops?

ANSWER ELEMENTS

Latitude 3440
- USB 3.2 Gen 1 : 5 Gbit/s
- USB 3.2 Gen 2 Type-C : 10 Gbit/s
- RJ45 (Ethernet) : 10/100/1000 Mbit/s
- Wifi : Up to 2400 Mbit/s
- Bluetooth 5.3 : Up to 2 Mbit/s
- WWAN module : Up to 1 Gbit/s DL - 150 Mbit/s UL
Precision Mobile 7780 : add/change
- Thunderbolt 4 ports with USB Type-C : 40 Gbit/s
- WWAN module : Up to 3 Gbit/s DL - 250 Mbit/s UL

We now compare the processors of the three computers. The following table summarizes the key points. An official and exhaustive comparison is available here.

	Intel Core i5-1335U	Intel Core i9-13950HX	Intel Xeon w9-3495X
Number of cores	2 + 8	8 + 16	56
Base frequency	1.7 GHz	2.20 GHz	1.90 Ghz
Turbo frequency	4.60 GHz	5.50 Ghz	4.80 Ghz
Cache	12 MB	36 MB	105 MB
Power	15W	55W	350W
Max. memory size	64 GB	128 GB	4 TB
Max. memory channels	2	2	8
Recommended price	$340.00	$590.00	$5889.00
Geekbench 5 single-core score (cpu-monkey)	1628	2108	1734
Geekbench 5 multi-core score (cpu-monkey)	7240	19759	56911

Discuss the differences between the three processors. Can you guess what the different features listed in the table mean?
Look at the CPU limitations with memory. Are these limitations respected in the hardware configuration of the three computers? Can we still add memory to the three configurations?

ANSWER ELEMENTS

Here is a description of the features:

Number of cores. A core is a single processing unit of a CPU. A core can read and execute program instructions. A multi-core CPU can execute multiple instructions at the same time. You might have noticed that the first two computers have two types of cores. You can see it in the detailed specifications. Here are the major takeaways:
- Intel CPUs are generally equipped with P cores (or, Performance cores) and E cores (or, Efficient cores).
- P cores are used for the heavy workload; they are more performant, but they cost more and they generate more heat (as a result, they consume more power).
- E cores are used for background tasks that do not require a high computing power. They are less performant, but they cost less and generate less heat.
- The goal of having two types of cores is to strike a balance between the performances and the cost and power consumption of the CPU.
Base frequency. The frequency of the CPU clock under typical utilization.
Turbo frequency. The frequency of the CPU clock under heavy utilization. The Turbo Boost technology dynamically increases the clock speed to handle heavy workload. The frequency of the clock is set based on the system heat and the number of cores in use. The clock speed cannot exceed the one specified by the turbo frequency feature.
Cache. Indicates the amount of cache memory integrated in the CPU chip. CAREFUL on this point. The number here refers to the size of the L3-cache. To find out the size of the other cache levels, we need to read the detailed CPU datasheets. Take the Intel Core i5-1335U as an example. This processor belongs to the 13th generation, the information about its cache sizes is available on this page. Here the major takeaways:
- Each core has its own L1 cache. The L1 cache is divided into two cache memories, one stores data (DCU) and the other stores instructions (IFU). The DFU is 48 KB for P-cores and 32 KB for E-cores. The IFU is 32 KB for P-cores and 64 KB for E-cores.
- Each P-core has its own L2 cache, the size is 1.25 MB.
- There is a shared L2 cache for every 4 E-cores, the size of each is 2MB.
- All cores share the same L3 cache memory (256 MB).
- The cache size increases at each level: L1 caches are smaller than L2 caches, which are in turn smaller than L3 caches. This arrangement balances speed and storage capacity close to the processor
Power. The average level of heat generated under heavy utilization while the CPU is running at its base frequency.
Max. memory size. The maximum amount of memory that the CPU can address.
Max. memory channels. The maximum number of memory modules that can be supported by the CPU.
Geekbench. A benchmark, that is a set of performance tests, that gives a performance score to a CPU. The scores in the table are taken from the website cpu-monkeys.

In general, a higher clock speed means a faster CPU. However, many other factors might affect the performances. The CPU has several ways to optimize the execution of program instructions. Today's technology is able to distribute the execution of instructions among the computer cores in an intelligent way. It is therefore possible that an older CPU with a higher clock frequency is less performant of a newer CPU with a lower clock speed.

The cache size is also important.

Benchmarks are used to compare the CPU performances. From the table we learn that the Intel Core i9 has a higher score than the Intel Xeon (that costs more) when only one single core is used. This is true for the selected benchmark, another may give a different result. The performance scores should be considered with caution.

As for the second question, we see that the three hardware configurations respect the limitations with the memory. Actually, the first two computers only use one memory channel, we can add one more module, as the respective CPUs support two channels.

The size of a memory block of the Intel Core i5-1335U is 64 bytes. How many blocks can the L3 cache store?

ANSWER ELEMENTS

256 MB / 64 B = 4M blocks

We now look at the graphics processing unit (GPU). The following table describes the features of the GPUs in the three computers.

	Intel Iris Xe Graphics G7 80EUs (integrated)	NVIDIA RTX 3500	NVIDIA RTX A6000
Number of cores (CUDA)	640	5 120	10 752
Clock speed	1 100 MHz	1 545 MHz	1 800 MHz
Memory	8 GB (shared)	12 GB	48 GB
FP16 performance	2.816 TFLOPS	15.82 TFLOPS	38.71 TFLOPS
FP32 performance	1 408.0 GFLOPS	15.82 TFLOPS	38.71 TFLOPS
FP64 performance	352 GFLOPS	247.2 GFLOPS	604.8 GFLOPS
Tensor cores (Deep learning)	--	160	336
Max power consumption	< 15W	100W	300 W

Compare the features of the three GPUs. Can you understand the meaning of each feature listed in the table?

ANSWER ELEMENTS

1. Context

For 30 years, one of the most important methods to increase the performance of computers has been to increase the clock frequency of the CPU.
The first personal computers had a clock frequency of a 1 MHz, while modern CPUs can reach frequencies of 1-4 GHz.
However, this approach has reached its limits due to power consumption and heat dissipation issues.
As a result, the focus has shifted towards parallel computing, where multiple compute units work together to perform tasks more efficiently.
CPUs started to include multiple cores in the early 2000s.
A CPU core includes: a control unit, that provides the necessary functionalities to fetch, decode and execute instructions, an arithmetic logic unit (ALU) and a set of registers.

2. GPUs

Graphical Processing Units (GPU) were originally built for graphics rendering.
The CPU would take care of general-purpose tasks, while offloading graphics-related tasks to the GPU.
Researchers started to realize that a GPU can be tricked into performing non-graphics (general purpose) related computations, provided that the problem can be expressed as a series of graphics-related operations.
Initially, programmers had to use graphics APIs like OpenGL or DirectX to perform general-purpose tasks, which was cumbersome and limited in scope.
Programming was made easier with CUDA (NVIDIA), an extension of the C language.

3. GPU anatomy

A GPU consists of many compute units (called Streaming Multiprocessors by Nvidia, Execution Units by Intel, Compute Units by AMD).
Each compute unit has one control unit + multiple cores (ALU + registers), as shown in figure (Nvidia refers to cores as CUDA cores).
A compute unit executes the same instruction on different data in parallel.
Example: add two vectors of 128 floating-point numbers. Each core handles 16 additions in parallel.
This is referred to as Single Instruction, Multiple Data (SIMD) or Single Instruction Multiple Threads (SIMT).

4. Our example

In the first computer, the GPU is integrated into the CPU. As a result, the CPU and the GPU share the same memory. While integrating the GPU into the CPU results in lower performances, for most users this is a convenient and cheap solution. Gamers and developers of graphics applications need a configuration with a dedicated GPU.
The second and third computers are equipped with dedicated NVIDIA GPUs. These GPUs have a dedicated memory.
A CUDA core executes 2 floating-point operations per clock cycle. The number of cores and the clock frequency can be used to estimate a theoretical number of floating-point operations per second (FLOPS).

FLOPS = Cores × Clock speed × 2

The value obtained with the formula is only theoretical. To have a concrete comparison, it is better to look for benchmarks like here.
Since CUDA cores are limited to the execution of two operations per clock cycle, NVIDIA developed more advanced cores, called tensor cores. They can calculate entire matrix operations per clock cycle and bring new deep learning applications to GPUs.

5. CPU vs GPU

CPUs are optimized for control-intensive tasks (branching, complex logic).
GPU are optimized for data-parallel tasks (same operation on many data).