Chapter 1 A Tour of Computer System

Compilation System

Take C language as an example linux > gcc hello.c -o hello.

Pre-processor (cpp)

$.c \to .i$ Handles include/define, strips off comments and conditional compilation #ifdef
Compiler (cc1)

$.i \to .s$ Scan/parse/semantic check/code gen/opt
Assembler (as)

$.s \to .o$ From assembly to machine code
Linker (ld) $.o \to Executable$

Relocation & reference resolution [重定位 & 引用解析]

Hardware Organization of a System

DMA (Direct Memory Access) is usually used for high-spped and bulk data transfers, controlled by system programmers.

System Bus

Transfer data in fixed-size blocks called words (4/8 bytes on a 32/64 bit system)

Memory Hierarchy

The localities of cache: Temporal Locality & Spatial Locality.

Abstractions in Computer Systems (Virtualization)

Virtualization is often related to multiplicity, fake versions, and sharing.

Process & Thread

Multiple processes can run concurrently on the same system, and each process appears to have exclusive use of the hardware.

Context switching with OS Kernel.

A process can actually consist of multiple execution units, called threads. Threads shares the same code and global data.
Virtual Memory

Program Code & Data, Shared Libraries, Heap, Stack, Kernel Virtual Memory from bottom to top.
- Virtual address space canve greater than the physical memory.
- Memory serves as a cache for virtaul memory.
- Support multiprogramming.
- Allow multiple processes to share data.
File

Import Themes

Amdahl's Law (Quantifying the performance improvement ceiling)

$T_{n e w} = (1 - α) T_{o l d} + \frac{α T _{o l d}}{k} = T_{o l d} (1 - α + \frac{α}{k})$

$S = \frac{T _{o l d}}{T _{n e w}} = \frac{1}{1 - α + \frac{α}{k}}$

Concurrency and Parallelism

Concurrency It refers to the general concept of a system having multiple, simultaneous activities, which do not necessarily execute at the same time, but may interleave in time to create the logical impression of simultaneity.
Parallelism It refers to the use of concureency to make a system run faster.

ILP and DLP

Instruction-Level e.g. Parallelism pipelining, superscalar, out-of-order execution.
Data-Level Parallelism e.g. Single Insruction Multiple Data (SIMD), Vector instructions.

Chapter 2 Representing and Manipulating Information

Information Storage

Words

$w$ word size $⟺$ $[0, 2^{w})$ virtual address space
Addressing and Byte Ordering Big endian & Little endian
ABI Application Binary Interface

It is based on three key components : the computer ISA, the OS, and the calling convention. ABI incompatibility will occur due to differences between operating systems.

Integer Representaions

Sizes of Data Type in C/C++

Signed	Unsigned	32-bit (Bytes)	64-bit (Bytes)
$[signed] char$	$unsigned char$	1	1
$short$	$unsigned short$	2	2
$int$	$unsigned$	4	4
$long$	$unsigned long$	$4$	$8$
$int32_t$	$uint32_t$	4	4
$int64_t$	$uint64_t$	8	8
$char *$	—	$4$	$8$
$float$	—	4	4
$double$	—	8	8

Unsigned Encodings

Suppose a vector $x = [x_{w - 1}, x_{w - 2}, \dots, x_{0}]$ , then $B2 U_{w} (x) = i = 0 \sum w - 1 x_{i} \cdot 2^{i}$

Two's Complement Encodings

Inverse form (1's Complement) For a negative number, keep the signed the same and invert the rest.
two zero exits & end-round carry out issues (hte end carry-out bit needs to add back to the LSB)

2's Complement For a negative number, keep the sign the same, invert the rest and add 1.

Suppose a vector $x = [x_{w - 1}, x_{w - 2}, \dots, x_{0}]$ , then $B2 T_{w} (x) = - x_{w - 1} \cdot 2^{w - 1} + i = 0 \sum w - 2 x_{i} \cdot 2^{i}$

Conversions between Signed and Unsigned

$T2 U_{w} (x) = {x + 2^{w}, x < 0 x, x \geq 0 U2 T_{w} (u) = {u, x \leq T ma x_{w} u - w^{w}, x > T ma x_{w}$

For a $n$ -bit 2's complemetn signed binary numeral system:

Minimum $- 2^{n - 1}$ , corresponding to binary $100 \dots 0$ (A special case that does not satisfy the invert bits and add 1 rule used for other negative numbers).
Maximum $2^{n - 1} - 1$ , corresponding to binary $011 \dots 1$ .

Sign Extension

Small to Big

Zero extension of unsigned numbers
Sign extension of two's complement numbers

$B2 T_{w} ([x_{w - 1}, x_{w - 2}, \dots, w_{0}]) = B2 T_{w + k} ([x_{w - 1}, x_{w - 1}, \dots, x_{w - 1}, x_{w - 1}, x_{w - 2}, \dots, x_{0}])$

Since $B2 T_{w + 1} - B2 T_{w} = (- x_{w - 1} \cdot 2^{w} + x_{w - 1} \cdot 2^{w - 1}) - x_{w - 1} \cdot 2^{w - 1} = 0$ , by induction, we can proof it.

Big to Small

$B2 U_{k} (x) = B2 U_{w} (x) mod 2^{k}$
$B2 T_{k} (x) = U2 T_{w} (B2 U_{w} (x) mod 2^{k})$

Integer Arithmetic

Addition

$x + y_{w}^{u} = {x + y, x + y < 2^{w} x + y - 2^{w}, 2^{w} \leq x + y < 2^{w + 1}$

$x + y_{w}^{t} = ⎩ ⎨ ⎧ x + y - 2^{w}, 2^{w - 1} \leq x + y x + y, - 2^{w - 1} \leq x + y < 2^{w - 1} x + y + 2^{w}, x + y < - 2^{w - 1}$

Additive Inverse

$x + x^{'} = 0 or 2^{w}$

$- x_{w}^{u} = {x, x = 0 2^{w} - x, x > 0 - x_{w}^{t} = {x, x > T mi n_{w} T mi n_{w}, x = T mi n_{w}$

For two numbers A, B, the overflow occurs when (NOT (sign_A XOR sign_B)) AND (sign_A XOR new_sign) == 1. (Two operands have the same sign but the result has a different sign.)
One Bit Full Adder

$S_{i} = A_{i} + B_{i} + carry_in carry_out = (A_{i} & B_{i}) ∣ (carry_in & (A_{i} \oplus B_{i}))$
Carry Lookahead Adder

$⎩ ⎨ ⎧ g_{i} = (A_{i} \cdot B_{i}) p_{i} = (A_{i} \oplus B_{i}) C_{1} = g_{0} + (p_{0} \cdot C_{0}) C_{2} = g_{1} + (p_{1} \cdot g_{0}) + (p_{1} \cdot p_{0} \cdot C_{0}) C_{3} = g_{2} + (p_{2} \cdot g_{1}) + (p_{2} \cdot p_{1} \cdot g_{0}) + (p_{2} \cdot p_{1} \cdot p_{0} \cdot C_{0}) ⋮$

Multipilication

$x \times y_{w}^{u} = (x \cdot y) mod 2^{k} x \times y_{w}^{t} = U2 T_{w} ((x \cdot y) mod 2^{k})$

Division (by a power of 2)

Unsigned numbers use logical shift, while two's complement numbers use arithmetic shift to achieve sign-preserving extension.

Right shift performs integer division by powers of two :

$x \leq 0 x >> k = ⌊ \frac{x}{2 ^{k}} ⌋$
$x < 0 (x + (1 << k) - 1) >> k = ⌈ \frac{x}{2 ^{k}} ⌉$

Floating Point

Floating-Point Representation

$V = (- 1)^{s} \times M \times 2^{E}$

s (sign) : The number is positive ( $s = 0$ ) or negative ( $s = 1$ ).

M (fraction) : A binary fraction.

E (exponent) : $2^{E}$ weight.

$bias (float) = 127 bias(double) = 1023$ ( $bias = 2^{k - 1} - 1$ )

Category	Exponent $e$	Fraction $f$	Value Formula
Normalized	$1 \leq e \leq 254$	any	$V = (- 1)^{s} \times (f + 1) \times 2^{e - 127}$
Denormalized	$e = 0$	$f \neq = 0$	$V = (- 1)^{s} \times f \times 2^{- 126}$
Zero	$e = 0$	$f = 0$	$V = \pm 0.0$
Infinity	$e = 255$	$f = 0$	$V = \pm \infty$
NaN (Not a Number)	$e = 255$	$f \neq = 0$	`NaN`

Comparison (postive number)

Format	Minimum	Maximum
Single Precision Normalized $V = (- 1)^{s} \times \overline{1. f} \times 2^{e - 127}$	$e = 00000001$ $E_{m i n} = - 126$ $f = 0$ $V = 1.0 \times 2^{- 126}$	$e = 11111110$ $E_{m a x} = 127$ $f = 0. 23 ones 11 \dots 1$ $M = 1 + f = 1 + (1 - 2^{- 23})$ $V = 1.0 \times 2^{127} \times (2 - 2^{- 23}) \approx 3.4 \times 1 0^{38} (0x7F7FFFFF)$
Single Precision Denormalized $V = (- 1)^{s} \times \overline{0. f} \times 2^{- 126}$	$e = 00000000$ $f = 2^{- 23}$ $V = 1.0 \times 2^{- 149} (0x00800000)$	$e = 00000000$ $f = 0. 23 ones 11 \dots 1$ $V = 1.0 \times 2^{- 126} \times (1 - 2^{- 23})$

Rounding

Round-down	Round-up	Round-toward-zero	Round-to-even
$1.40 \to 1$ $- 1.5 \to - 2$	$1.40 \to 2$ $- 1.5 \to - 1$	$1.40 \to 1$ $- 1.5 \to - 1$	$1.40 \to 1$ $1.6 \to 2$ $1.5 \to 2$ $2.5 \to 2$ Non-midpoint round to the nearest representable value Midpoint choose the $even$ one

Floating Point Operations

Lack of Associativity & Lack of Distributivity

Example Questions

Let $x, f, d$ are of type int, float, double (Their values are arbitrary, except that neither $f$ nor $d$ equals $+ \infty$ , $- \infty$ , or $NaN$ .). Then:

x == (int)(double) x True
x == (int)(float) x False $2^{24} - 1 \to (1 + 0. 23 ones 11 \dots 1) \times 2^{23}$ (e.g. $16777217$ )
d == (double)(float) d False (e.g. $1.234$ )
f == (float)(double) f True
d*d >= 0.0 True

Precision (IEEE 754-2008 Standard Formats)

Format	Sign Bits	Exponent Bits	Mantissa Bits
Quad precision	1	15	112
Double precision	1	11	52
Single precision (FP32)	1	8	23
Half precision (FP16)	1	5	10

bfloat16

Format: 1 sign, 8 exponent, 7 mantissa bits
Same exponent range as FP32 (faster computation, lower power, reduces memory footprint, enables larger models)
Reduced precision acceptable for AI (AI/LLM is based on predictions – approximation does not require high precision)

FMA/FMAC

Fused Multiply and Add (FMA) or Fused Multiply and Accumulate (FMAC) combines multiply and add in one instruction ( $A \times B + C$ ).

Partial Product Generation
Partial Product Reduction
Final add together with the last step of reduction
Normalization & Rounding

Use -mfma -ffp-contract=fast to enable FMA.

Chapter 3 Machine-Level Representation of Programs

Machine-Level Representation of Programs

RISC

Principles

Simplicity favors regularity
Smaller is faster
Make common cases fast
Good design demands good compromises

[RSA should be scalable, flexible, and extensible.]

Instruction Types in RV32I

Types	Instructions
ALU	`add`, `sub`, `and`, `or`, `xor`, `slt`, `sltu`, `sll`, `srl`, `sra`, and all with immediate (No `subi` in RV32I)
Control Instructions	`beq`, `bne`, `blt`, `bge`, `bltu`, `bgeu`, `jal`, `jalr`
Memory Instructions	`lw` (No `lwu` in RV32I), `lh`, `lb`, `sw`, `sh`, `sb`, `lbu`, `lhu`
Privileged Instructions	Interrupt, Memory Management, System Calls, Control and Status Registers (CSR), Mode Change

Format	Name	Instructions
R-type	Register	`add`, `sub`, `sll`, `srl`, `sra`, `xor`, `or`, `and`
I-type	Immediate	`addi`, `slli`, `srli`, `srai`, `xori`, `ori`, `andi`
I-type	Load	`lb`, `lh`, `lw`, `lbu`, `lhu`
I-type	Jump and Link Register	`jalr`
S-type	Store	`sb`, `sh`, `sw`
B-type	Branch	`beq`, `bne`, `blt`, `bge`, `bltu`, `bgeu`
U-type	Upper Immediate	`lui`, `auipc`
UJ-type	Unconditional Jump	`jal`

Three operand instruction format.
X0 is always zero.
32 means Address cability/Integer register length.
Immediate
- $R$ type No immediate.
- $B,I$ type 12 bits immediate. Shift instructions are I-type but repurpose the immediate field as a 5-bit shift amount for RV32I (6-bit for RV64I). inst[30] distinguishes arithmetic from logical right shift, while inst[31:26] are fixed to zero except inst[30] (Range: $\pm 4 KB$ ).
- $S$ type The 12‑bit immediate is split into high 7 bits (immediate) and low 5 bits (immed). (maintain regularity)
- $J$ type 20 bits immediate.
LUI loads a 20‑bit immediate into the upper bits of a register; AUIPC adds a 20‑bit immediate to the current PC for position‑independent addressing. (When the lower 12 bits of a 32‑bit constant are $\geq 0x800$ , addi sign‑extends them, causing an incorrect result. The assembler automatically adjusts the upper 20 bits when using the li pseudo‑instruction.) (Range: $4 GB$ )
Jump
- jal
  - $r d \leftarrow PC + 4$ , $PC \leftarrow PC + sign-extend ({inst [31], inst [19 : 12], inst [20], inst [30 : 21]}) \times 2$ (Jumps are 2-byte aligned, so last bit is implicit zero).
  - Range: $\pm 1 MB$ .
  - Beyond 1MB: Use AUIPC + JALR.
- jalr
  - $r d \leftarrow PC + 4$ , $PC \leftarrow (rs 1 + sign-extend (inst [31 : 20])) & \sim 1$ ( $\sim 1$ clears LSB to ensure 2‑byte alignment).

Why do RISC-V loads/stores use base+immediate instead of base+index?

Simpler hardware Only one address adder needed, reducing complexity.
Faster address generation Immediate available at decode, no wait for index register.
Fewer pipeline stalls Early hazard detection reduces bubbles.

Data Alignment

No cache line crossing
No double TLB (Translation Lookaside Buffer) misses
No double page faults come from a single memory instruction execution
Better cache line utilization

Stack / Frame Alignments

Defined by ABI (e.g., RISC‑V psABI: 16‑byte)
CRT aligns sp before main ()

Compiler Reordering

Independent Variables (Rearrangement Allowed)
Struct Fields (Rearrangement Forbidden)
- Binary compatibility
- Pointer arithmetic
- Interoperability

Branch instructions use PC‑relative addressing with a 12‑bit signed offset (in 2‑byte units). Since branch targets are always multiples of 2, encoding in 2‑byte units doubles the reachable range without losing information.

$Target Address = PC + sign_extend (12 -bit immediate) \times 2$

Calling Convention

RV Registers	ABI Name	Caller/Callee	Purpose
x0	`zero`	—	Always zero
x1	`ra`	Caller	Return address
x2	`sp`	Callee	Stack pointer
x3	`gp`	—	Global pointer (not used in typical code)
x4	`tp`	—	Thread pointer (TLS)
x5–x7	`t0`–`t2`	Caller	Temporary registers
x8	`s0` / `fp`	Callee	Saved register / Frame pointer
x9	`s1`	Callee	Saved register
x10–x11	`a0`–`a1`	Caller	Arguments / Return values
x12–x17	`a2`–`a7`	Caller	Arguments
x18–x27	`s2`–`s11`	Callee	Saved registers
x28–x31	`t3`–`t6`	Caller	Temporary registers

Leaf routines use args & caller‑saved only (no save/restore).

CISC

Size of Data Type in IA32

C declaration	Intel data type	Assembly-code suffix	Size (bytes)
$char$	Byte	$b$	$1$
$short$	Word	$w$	$2$
$int$	Double word	$l$	$4$
$long$	Quad word	$q$	$8$
$char *$	Quad word	$q$	$8$
$float$	Single precision	$s$	$4$
$double$	Double precision	$l$	$8$

Register

Operands

Immediate
Register
Memory Reference Immediate & Base Register & Index Register & Scale Factor (1,2,4,8)

Type	Form	Operator Value	Name
Immediate	$$ I mm$	$I mm$	Immediate
Register	$r_{a}$	$R [r_{a}]$	Register
Memory	$I mm (r_{b}, r_{i}, s)$	$M [I mm + R [r_{b}] + R [r_{i}] \cdot s]$	Scaled Indexed
Memory	$I mm$	$M [I mm]$	Absolute
Memory	$(r_{a})$	$M [R [r_{a}]]$	Indirect
Memory	$I mm (r_{b})$	$M [I mm + R [r_{b}]]$	Base + Displacement
Memory	$(r_{b}, r_{i})$	$M [R [r_{b}] + R [r_{i}]]$	Indexed
Memory	$I mm (r_{b}, r_{i})$	$M [I mm + R [r_{b}] + R [r_{i}]]$	Indexed
Memory	$(, r_{i}, s)$	$M [R [r_{i}] \cdot s]$	Scale Indexed
Memory	$I mm (, r_{i}, s)$	$M [I mm + R [r_{i}] \cdot s]$	Scale Indexed
Memory	$(r_{b}, r_{i}, s)$	$M [R [r_{b}] + R [r_{i}] \cdot s]$	Scale Indexed

Operation code

Movement
- $mov S,D$ [movb, movw, movl, movq, movabsq]
  
  Source operand (Immedita, Register, Memory); Destination operand (Register, Memory)
  
  Two operands of a move instruction cannot both be memory references. (It is necessary to first move the source memory reference into a register, and then move it to the destination memory reference.)
  
  movl writes 32 bits of data to the lower half and then automatically zero-extends the value into the upper 32 bits. (In x86-64, any instruction that generates a 32-bit result for a register will set the upper 32 bits of that register to zero.)
  
  movq can only be 32 bits (sign-extended to 64 bits), whereas movabsq can be 64-bit but the destination must be a register.
- $movz S, R$ [movzbw, movzbl, movzwl, movzbq, movzwq]
  
  movzlq does not exist because of movl instruction.
- $movs S, R$ [movsbw, movsbl, movswl, movsbq, movswq, movslq, cltq]
  
  cltq same as movslq %eax, %rax

Stack

let a quad number be the example

$pushq S$
```
subq $8 %rsp
movq %rbq,(%rsp)
```
$popq S$
```
movq (%rsq),%rax
addq $8, %rsp
```

Load Effective Address

$leaq S, D$

Computes effective address and stores it in destination register

Unary Operations

Instructions	Effect
`inc D`	$D \leftarrow D + 1$
`dec D`	$D \leftarrow D - 1$
`neg D`	$D \leftarrow - D$
`not D`	$D \leftarrow\sim D$

Binary Operations

Instructions	Effect
`add S D`	$D \leftarrow D + S$
`sub S D`	$D \leftarrow D - S$
`imul S D`	$D \leftarrow D \times S$
`xor S D`	$D \leftarrow D \oplus S$
`or S D`	$D \leftarrow D ∣ S$
`and S D`	$D \leftarrow D & S$

Shift Operations

Instructions	Effect
`sal D`	$D \leftarrow D << k$
`shl D`	$D \leftarrow D << k$
`sar D`	$D \leftarrow D > >_{A} k$ Arithmetic Right Shift
`shr D`	$D \leftarrow D > >_{L} k$ Logical Right Shift

Shift instructions can shift by an immediate value, or by a value placed in the single-byte register %cl.

Condition Code

$CF$ Carry Flag [check for overflow in unsigned operations]
$ZF$ Zero Flag
$SF$ Sign Flag
$OF$ Overflow Flag
$cmp S D$

Similar to sub, but it only sets the condition codes without changing the value of the destination register.

When S == D, $ZF = 1$ .

It will set the condition codes based on the result of $D - S$ .
$test S D$

Similar to and, but it only sets the condition codes without changing the value of the destination register.

When S & D == 0, $ZF = 1$ .

Set Instructions

Instruction	Effect	Description
`sete D`	$D \leftarrow ZF$	$=$
`setne D`	$D \leftarrow\sim ZF$	$\neq =$
`sets D`	$D \leftarrow SF$	negative
`setns D`	$D \leftarrow\sim SF$	nonnegative
`setg D`	$D \leftarrow\sim (SF \oplus OF) & \sim ZF$	Signed $>$
`setge D`	$D \leftarrow\sim (SF \oplus OF)$	Signed $\geq$
`setl D`	$D \leftarrow SF \oplus OF$	Signed $<$
`setle D`	$D \leftarrow (SF \oplus OF) ∣ ZF$	Signed $\leq$
`seta D`	$D \leftarrow\sim CF & \sim ZF$	Unsigned $>$
`setae D`	$D \leftarrow\sim CF$	Unsigned $\geq$
`setb D`	$D \leftarrow CF$	Unsigned $<$
`setbe D`	$D \leftarrow CF ∣ ZF$	Unsigned $\leq$

Jump Instructions

Instruction	Jump Condition	Description
`jmp Label`	$true$	Direct Jump
`jmp *Operand`	$true$	Indirect Jump
`je Label`	$ZF$	$=$
`jne Label`	$\sim ZF$	$\neq =$
`js Label`	$SF$	Negative
`jns Label`	$\sim SF$	Nonnegative
`jg Label`	$\sim (SF \oplus OF) & \sim ZF$	Signed $>$
`jge Label`	$\sim (SF \oplus OF)$	Signed $\geq$
`jl Label`	$SF \oplus OF$	Signed $<$
`jle Label`	$(SF \oplus OF) ∣ ZF$	Signed $\leq$
`ja Label`	$\sim CF & \sim ZF$	Unsigned $>$
`jae Label`	$\sim CF$	Unsigned $\geq$
`jb Label`	$CF$	Unsigned $<$
`jbe Label`	$CF ∣ ZF$	Unsigned $\leq$

Conditional Move Instructions

Instruction	Move Condition	Description
`cmove S, R`	$ZF$	$=$
`cmovne S, R`	$\sim ZF$	$\neq =$
`cmovs S, R`	$SF$	Negative
`cmovns S, R`	$\sim SF$	Nonnegative
`cmovg S, R`	$\sim (SF \oplus OF) & \sim ZF$	Signed $>$
`cmovge S, R`	$\sim (SF \oplus OF)$	Signed $\geq$
`cmovl S, R`	$SF \oplus OF$	Signed $<$
`cmovle S, R`	$(SF \oplus OF) ∣ ZF$	Signed $\leq$
`cmova S, R`	$\sim CF & \sim ZF$	Unsigned $>$
`cmovae S, R`	$\sim CF$	Unsigned $\geq$
`cmovb S, R`	$CF$	Unsigned $<$
`cmovbe S, R`	$CF ∣ ZF$	Unsigned $\leq$

Loop

do-while, while, and for loops can be implemented by combining conditional tests and jumps.

Switch

switch is compiled into a jump table. The execution time of the switch statement is irrelevant to the number of cases.

Transfer Control

Instruction	Description
`call Label`	Procedure Call (Direct)
`call *Operand`	Procedure Call (Indirect)
`ret`	Return from Procedure Call

Data Transfer

The registers' names depend on the size of the data type being passed.

If a function has more than 6 integer parameters, the additional arguments must be passed on the stack. Note that the argument 7 is located at the top of the stack. All stack-passed arguments are aligned to multiples of 8 bytes.

Pointer

Expression	Type	Value	Assembly Code
`E`	`int*`	$x_{E}$	`movq %rdx, %rax`
`E[i]`	`int`	$M [x_{E} + 4 i]$	`movl (%rdx, %rcx, 4), %eax`
`&E[i]`	`int*`	$x_{E} + 4 i$	`leaq 8(%rdx, %rcx, 4), %rax`
`&E[i] - E`	`long`	$i$	`movq %rcx, %rax`

Notes:

&E[i] - E yields type long (specifically ptrdiff_t) because pointer subtraction returns the number of elements between them as a signed integer.
int* uses movq (not movl) because pointers are 64-bit addresses in x86-64.

Nested Arrays

T D[R][C] $& D [R] [C] = x_{D} + L (C \cdot i + j)$ , $L$ is the size of data type T in bytes.

Struture & Union

Structure

$K$	Types
$1$	$char$
$2$	$short$
$4$	$int, float$
$8$	$long, double, char *$

Notes:

The address of any K-byte basic object must be a multiple of $K$ .
The offset of the first member is always 0 in a structure.
The compilers probably add some bytes at the end of a structure to ensure that each element in an array of structures could satisfy its alignment requirements.

Union

The total size of a union equals the size of its largest field.

Application

Buffer Overflow

Three protection mechanisms to thwart buffer overflow attacks:

Stack Randomization Address-Space Layout Randomization (ASLR)
Stack Corruption Detection Place a random canary value between local buffers and critical stack data to detect buffer overflows.
Limiting Executable Code Regions

Dynamic Stack Frame

%rbp is base pointer/frame pointer dynamic stack frame

Floating-point Code

Chapter 4 Processor Architecture

Logic Design and Hardware Control Language (HCL)

Logic Gate

CMOS

Process

$Fetch \to Decode \to Execute \to Memory \to Write Back \to Update$

For Y86-84 :

Fetch
- Read instruction bytes from memory using PC:
  - icode:ifun (instruction specifier)
  - rA, rB (register specifiers)
  - valC (8-byte constant)
- Compute next instruction address → valP
Decode
- Read up to two operands from registers
- rA, rB → valA, valB
- For popq, pushq, call, ret: also read from %rsp
Execute
- Compute (ALU/address/stack) → valE
- Set condition codes
- Conditional moves: check CC, update dest register
- Jumps: decide branch
Memory
- Write to memory
- Read from memory → valM
Write Back
- Write up to two results to registers
Update
- Set PC to next instruction address

Pipelining

Throughput the number of instructions in one second.

GIPS: Giga-Instructions Per Second
Circuit Retiming

Repositions registers within a design to optimize performance or area, without altering the system’s input/output behavior. It changes the state encoding by moving delays (registers) across combinational logic elements.

Hazards

The dependency between instructions may lead to incorrect computation results.

Data Hazards

Stalling

Stalling prevents data hazards by pausing the pipeline until required operands are ready. This approach is simple but reduces performance.
Data Forwarding (Bypassing)

Data forwarding allows a result computed in a later pipeline stage to be passed directly to an earlier stage as a source operand.
Load/Use Hazards

A load/use hazards occurs when an instruction requires a value that is still being loaded from memory by the previous instruction. Since the memory read completes late in the pipeline, forwarding cannot resolve the dependency in time, causing a stall.

Control Hazards

Control hazards occur when the processor cannot determine the next instruction's address during the fetch stage. In a pipelined processor, this typically happens for ret instructions and mispredicted conditional jumps.

ret Inserting bubbles to wait for the return address to be determined.
Conditional Jump Branch mispredictions are handled by cancelling the erroneously fetched instructions through the insertion of pipeline bubbles (or via instruction squashing) in the decode and execute stages.

Pipelined Y86-64 Implementations

Fetch Stage

Sequential Execution

halt, nop, rrmovq, irmovq , mrmovq, Opq, pushq, popq, covXX, addq

Jump Execution

call, jxx (The Select PC mechanism is employed to correct branch mispredictions.)

ret

Decode Stage

The selection logic chooses between the merged valA/valP signals and one of five forwarding sources (e_valE, m_valM, M_valE, W_valM, or W_valE) based on the instruction type and hazard detection. If no forwarding is triggered, d_rvalA (the register file read) is used as the default value.

Pipeline Control Logic

Stall

Keep the state of the pipeline registers unchanged. When the stall signal is set to 1, the register will retain its previous state.
Insert Bubbles

Setting the state of the pipeline registers to a value equivalent to a nop instruction. When the bubble signal is set to 1, the state of the register will be set to a fixed reset configuration.

Load/Use Data

A one-cycle pipeline stall is required between an instruction that reads a memory value in the execute stage and a subsequent instruction that uses that value in the decode stage.

Mispredicted Branches

When a branch is mispredicted as taken, the pipeline must squash the incorrectly fetched instructions from the target path and redirect fetching to the correct fall-through address.

Processing `ret`

The pipeline must stall until the ret instruction reaches the write-back stage.

Exception Handling

When an instruction triggers an exception, all subsequent instructions must be prevented from updating programmer-visible state, and execution must halt once the exception-causing instruction reaches the write-back stage.

Condition	F	D	E	M	W
Handling `ret`	Stall	Bubble	Normal	Normal	Normal
Load/Use Hazard	Stall	Stall	Bubble	Normal	Normal
Mispredicted Branch	Normal	Bubble	Bubble	Normal	Normal

Chapter 5 Optimizing Program Performance

Capabilities and Limitations of Optimizing Compilers

Memory Aliasing

The two pointers probably reference the same memory address.

Function Calls

Performance

CPE: Cycles Per Element.

Understanding Modern Processors

Instruction Control Unit (ICU) & Execution Unit (EU)

The ICU reads instructions from the instruction cache and uses branch prediction to speculatively fetch and decode future instructions in advance. This speculative execution mechanism allows the processor to start executing operations along the predicted program path before the actual branch outcome is known. If the prediction is incorrect, all speculative results are discarded and execution restarts from the correct branch target.

Decoded instructions are broken down into micro-operations, which are then sent to the EU for execution. The EU includes arithmetic units and dedicated load/store units, each with address calculation logic, to perform memory operations via the data cache.

All along, the retirement unit tracks instruction progress to ensure sequential program semantics are preserved. Instructions remain in a queue until their results are ready and all relevant branch predictions are verified. Only then are results committed to the architectural registers. If a branch was mispredicted, the affected instructions are flushed and their results discarded, preventing any erroneous state change in the program.

Register renaming allows out-of-order execution by assigning unique tags to instruction results. When a later instruction needs a register value, it uses the tag to get the result directly once ready, avoiding waits for register writes and enabling speculative execution.

Each operation is characterized by the following metrics:

latency It represents the total time required to complete the operation.
Issue time It indicates the minimum number of clock cycles needed between two consecutive operations of the same type.
Capacity It refers to the number of functional units capable of executing that operation.

CUHKSZ CSC Notes