Catalogue

Course Notes

CSC3060

CSC3060 Notes — Core lecture notes
CSC3060 CSAPP Reference — Supplementary notes based on Computer Systems: A Programmer's Perspective

CSC3060 Introduction to Computer Systems

Textbook Computer Systems A Programmer’s Perspective, 3e, INTERNATIONAL EDITION written by Randal Bryant and David R. O'Hallaron.
Reference book Computer Organization and Design: The Hardware/Software Interface (RISC-V Edition) written by David A. Patterson and John L. Hennessy.

Chapter 1 A Tour of Computer System

Compilation System

Take C language as an example linux > gcc hello.c -o hello.

Pre-processor (cpp) $.c \to .i$ Handles include/define, strips off comments and conditional compilation #ifdef
Compiler (cc1) $.i \to .s$ Scan/parse/semantic check/code gen/opt
Assembler (as) $.s \to .o$ From assembly to machine code
Linker (ld) $.o \to Executable$ Relocation & reference resolution

Hardware Organization of a System

DMA (Direct Memory Access) is usually used for high-spped and bulk data transfers, controlled by system programmers.

System Bus

Transfer data in fixed-size blocks called words (4/8 bytes on a 32/64 bit system)

Memory Hierarchy

The localities of cache: Temporal Locality & Spatial Locality.

Abstractions in Computer Systems (Virtualization)

Virtualization is often related to multiplicity, fake versions, and sharing.

Process & Thread

Multiple processes can run concurrently on the same system, and each process appears to have exclusive use of the hardware.

Context switching with OS Kernel.

A process can actually consist of multiple execution units, called threads. Threads shares the same code and global data.
Virtual Memory

Program Code & Data, Shared Libraries, Heap, Stack, Kernel Virtual Memory from bottom to top.
- Virtual address space canve greater than the physical memory.
- Memory serves as a cache for virtaul memory.
- Support multiprogramming.
- Allow multiple processes to share data.
File

Import Themes

Amdahl's Law (Quantifying the performance improvement ceiling)

$T_{n e w} = (1 - α) T_{o l d} + \frac{α T _{o l d}}{k} = T_{o l d} (1 - α + \frac{α}{k})$

$S = \frac{T _{o l d}}{T _{n e w}} = \frac{1}{1 - α + \frac{α}{k}}$

Concurrency and Parallelism

Concurrency It refers to the general concept of a system having multiple, simultaneous activities, which do not necessarily execute at the same time, but may interleave in time to create the logical impression of simultaneity.
Parallelism It refers to the use of concureency to make a system run faster.

ILP and DLP

Instruction-Level e.g. Parallelism pipelining, superscalar, out-of-order execution.
Data-Level Parallelism e.g. Single Insruction Multiple Data (SIMD), Vector instructions.

Computer Architecture (e.g. Intel x86, IBM 360, ARM, RISC-V) is the interface between hardware and software, it includes ISA and Microarchitectures (e.g. Different implementations of the same ISA: Single cycle, Multi-cycle, Pipelined, Superscalar, Out-Of-Order, Speculative Execution, Cache Hierarchies, and Various Predictions).

Chapter 2 Representing and Manipulating Information

Information Storage

Words

$w$ word size $⟺$ $[0, 2^{w})$ virtual address space
Addressing and Byte Ordering

Big endian (e.g. IBM Mainframes) & Little endian (e.g. Intel x86, ARM, RISC-V).

Big-endian means most significant byte first, and Little-endian means least significant byte first.

Integer Representaions

Sizes of Data Type in C/C++

Signed	Unsigned	32-bit (Bytes)	64-bit (Bytes)
$[signed] char$	$unsigned char$	1	1
$short$	$unsigned short$	2	2
$int$	$unsigned$	4	4
$long$	$unsigned long$	$4$	$8$
$int32_t$	$uint32_t$	4	4
$int64_t$	$uint64_t$	8	8
$char *$	—	$4$	$8$
$float$	—	4	4
$double$	—	8	8

Unsigned Encodings

Suppose a vector $x = [x_{w - 1}, x_{w - 2}, \dots, x_{0}]$ , then $B2 U_{w} (x) = i = 0 \sum w - 1 x_{i} \cdot 2^{i}$

Two's Complement Encodings

Inverse form (1's Complement) For a negative number, keep the signed the same and invert the rest.
two zero exits & end-round carry out issues (hte end carry-out bit needs to add back to the LSB)

2's Complement For a negative number, keep the sign the same, invert the rest and add 1.

Suppose a vector $x = [x_{w - 1}, x_{w - 2}, \dots, x_{0}]$ , then $B2 T_{w} (x) = - x_{w - 1} \cdot 2^{w - 1} + i = 0 \sum w - 2 x_{i} \cdot 2^{i}$

Conversions between Signed and Unsigned

$T2 U_{w} (x) = {x + 2^{w}, x < 0 x, x \geq 0 U2 T_{w} (u) = {u, x \leq T ma x_{w} u - w^{w}, x > T ma x_{w}$

For a $n$ -bit 2's complemetn signed binary numeral system:

Minimum $- 2^{n - 1}$ , corresponding to binary $100 \dots 0$ (A special case that does not satisfy the invert bits and add 1 rule used for other negative numbers).
Maximum $2^{n - 1} - 1$ , corresponding to binary $011 \dots 1$ .

Sign Extension

Small to Big

Zero extension of unsigned numbers
Sign extension of two's complement numbers

$B2 T_{w} ([x_{w - 1}, x_{w - 2}, \dots, w_{0}]) = B2 T_{w + k} ([x_{w - 1}, x_{w - 1}, \dots, x_{w - 1}, x_{w - 1}, x_{w - 2}, \dots, x_{0}])$

Since $B2 T_{w + 1} - B2 T_{w} = (- x_{w - 1} \cdot 2^{w} + x_{w - 1} \cdot 2^{w - 1}) - x_{w - 1} \cdot 2^{w - 1} = 0$ , by induction, we can proof it.

Big to Small

$B2 U_{k} (x) = B2 U_{w} (x) mod 2^{k}$
$B2 T_{k} (x) = U2 T_{w} (B2 U_{w} (x) mod 2^{k})$

Integer Arithmetic

Addition

$x + y_{w}^{u} = {x + y, x + y < 2^{w} x + y - 2^{w}, 2^{w} \leq x + y < 2^{w + 1}$

$x + y_{w}^{t} = ⎩ ⎨ ⎧ x + y - 2^{w}, 2^{w - 1} \leq x + y x + y, - 2^{w - 1} \leq x + y < 2^{w - 1} x + y + 2^{w}, x + y < - 2^{w - 1}$

Additive Inverse

$x + x^{'} = 0 or 2^{w}$

$- x_{w}^{u} = {x, x = 0 2^{w} - x, x > 0 - x_{w}^{t} = {x, x > T mi n_{w} T mi n_{w}, x = T mi n_{w}$

For two numbers A, B, the overflow occurs when (NOT (sign_A XOR sign_B)) AND (sign_A XOR new_sign) == 1. (Two operands have the same sign but the result has a different sign.)
One Bit Full Adder

$S_{i} = A_{i} + B_{i} + carry_in carry_out = (A_{i} & B_{i}) ∣ (carry_in & (A_{i} \oplus B_{i}))$
Carry Lookahead Adder

$⎩ ⎨ ⎧ g_{i} = (A_{i} \cdot B_{i}) p_{i} = (A_{i} \oplus B_{i}) C_{1} = g_{0} + (p_{0} \cdot C_{0}) C_{2} = g_{1} + (p_{1} \cdot g_{0}) + (p_{1} \cdot p_{0} \cdot C_{0}) C_{3} = g_{2} + (p_{2} \cdot g_{1}) + (p_{2} \cdot p_{1} \cdot g_{0}) + (p_{2} \cdot p_{1} \cdot p_{0} \cdot C_{0}) ⋮$

Multipilication

$x \times y_{w}^{u} = (x \cdot y) mod 2^{k} x \times y_{w}^{t} = U2 T_{w} ((x \cdot y) mod 2^{k})$

Division (by a power of 2)

Unsigned numbers use logical shift, while two's complement numbers use arithmetic shift to achieve sign-preserving extension.

Right shift performs integer division by powers of two :

$x \leq 0 x >> k = ⌊ \frac{x}{2 ^{k}} ⌋$
$x < 0 (x + (1 << k) - 1) >> k = ⌈ \frac{x}{2 ^{k}} ⌉$

Floating Point

Floating-Point Representation

$V = (- 1)^{s} \times M \times 2^{E}$

s (sign) : The number is positive ( $s = 0$ ) or negative ( $s = 1$ ).

M (fraction) : A binary fraction.

E (exponent) : $2^{E}$ weight.

$bias (float) = 127 bias(double) = 1023$ ( $bias = 2^{k - 1} - 1$ )

Category	Exponent $e$	Fraction $f$	Value Formula
Normalized	$1 \leq e \leq 254$	any	$V = (- 1)^{s} \times (f + 1) \times 2^{e - 127}$
Denormalized	$e = 0$	$f \neq = 0$	$V = (- 1)^{s} \times f \times 2^{- 126}$
Zero	$e = 0$	$f = 0$	$V = \pm 0.0$
Infinity	$e = 255$	$f = 0$	$V = \pm \infty$
NaN (Not a Number)	$e = 255$	$f \neq = 0$	`NaN`

When interpreted as signed integers, the bit representation of IEEE 754 floating-point numbers (excluding NaN) preserves the same sorting order.

Comparison (postive number)

Format	Minimum	Maximum
Single Precision Normalized $V = (- 1)^{s} \times \overline{1. f} \times 2^{e - 127}$	$e = 00000001$ $E_{m i n} = - 126$ $f = 0$ $V = 1.0 \times 2^{- 126}$	$e = 11111110$ $E_{m a x} = 127$ $f = 0. 23 ones 11 \dots 1$ $M = 1 + f = 1 + (1 - 2^{- 23})$ $V = 1.0 \times 2^{127} \times (2 - 2^{- 23}) \approx 3.4 \times 1 0^{38} (0x7F7FFFFF)$
Single Precision Denormalized $V = (- 1)^{s} \times \overline{0. f} \times 2^{- 126}$	$e = 00000000$ $f = 2^{- 23}$ $V = 1.0 \times 2^{- 149} (0x00800000)$	$e = 00000000$ $f = 0. 23 ones 11 \dots 1$ $V = 1.0 \times 2^{- 126} \times (1 - 2^{- 23})$

Rounding

Round-down	Round-up	Round-toward-zero	Round-to-even
$1.40 \to 1$ $- 1.5 \to - 2$	$1.40 \to 2$ $- 1.5 \to - 1$	$1.40 \to 1$ $- 1.5 \to - 1$	$1.40 \to 1$ $1.6 \to 2$ $1.5 \to 2$ $2.5 \to 2$ Non-midpoint round to the nearest representable value Midpoint choose the $even$ one

Floating Point Operations

Lack of Associativity & Lack of Distributivity
Example Questions

Let $x, f, d$ are of type int, float, double (Their values are arbitrary, except that neither $f$ nor $d$ equals $+ \infty$ , $- \infty$ , or $NaN$ .). Then:
- x == (int)(double) x True
- x == (int)(float) x False $2^{24} - 1 \to (1 + 0. 23 ones 11 \dots 1) \times 2^{23}$ (e.g. $16777217$ )
- d == (double)(float) d False (e.g. $1.234$ )
- f == (float)(double) f True
- d*d >= 0.0 True
Float Point Addition
- Align decimal points (small to big)
- Add significands
- Normalize
- Round and renormalize if needed
Example

$9.999 \times 1 0^{1} + 1.610 \times 1 0^{-1} = 9.999 \times 1 0^{1} + 0.01610 \times 1 0^{1} = 10.015 \times 1 0^{1} = 1.0015 \times 1 0^{2} = 1.002 \times 1 0^{2}$

Precision (IEEE 754-2008 Standard Formats)

Format	Sign Bits	Exponent Bits	Mantissa Bits
Quad precision	1	15	112
Double precision	1	11	52
Single precision (FP32)	1	8	23
Half precision (FP16)	1	5	10

bfloat16

The format is 1 sign, 8 exponent, 7 mantissa bits, Same exponent range as FP32.
AI/LLM is based on predictions, approximation does not require high precision.Therefore reduced precision is acceptable for AI.
Lower precision format enables higher memory bandwidth and computational. bandwidth.

FMA/FMAC

Fused Multiply and Add (FMA) or Fused Multiply and Accumulate (FMAC) combines multiply and add in one instruction ( $A \times B + C$ ).

Multiplication and addition are parallel.
Normalization and rounding are combined at the end.

Use -mfma -ffp-contract=fast to enable FMA.

Chapter 3 Machine-Level Representation of Programs

Machine-Level Representation of Programs

RISC [optimized for processor speed]

Principles

Simplicity favors regularity
Smaller is faster
Make common cases fast
Good design demands good compromises

[RSA should be scalable, flexible, and extensible.]

Variants of RV

RV32, RV64, RV128 Different data widths (addressing capability)

Extension	Description
I	Base integer instructions
E	Base for embedded systems (e.g., only 16 registers)
M	Integer multiplication and division
A	Atomic memory instructions
C	Compressed extension (16-bit instructions)
F	Single-precision floating point
D	Double-precision floating point
V	Vector extension

Instruction Types in RV32I

Types	Instructions
ALU	`add`, `sub`, `and`, `or`, `xor`, `slt`, `sltu`, `sll`, `srl`, `sra`, and all with immediate (No `subi` in RV32I)
Control Instructions	`beq`, `bne`, `blt`, `bge`, `bltu`, `bgeu`, `jal`, `jalr`
Memory Instructions	`lw` (No `lwu` in RV32I), `lh`, `lb`, `sw`, `sh`, `sb`, `lbu`, `lhu`
Privileged Instructions	Interrupt, Memory Management, System Calls, Control and Status Registers (CSR), Mode Change

Format	Name	Instructions
R-type	Register	`add`, `sub`, `sll`, `srl`, `sra`, `xor`, `or`, `and`
I-type	Immediate	`addi`, `slli`, `srli`, `srai`, `xori`, `ori`, `andi`
I-type	Load	`lb`, `lh`, `lw`, `lbu`, `lhu`
I-type	Jump and Link Register	`jalr`
S-type	Store	`sb`, `sh`, `sw`
B-type	Branch	`beq`, `bne`, `blt`, `bge`, `bltu`, `bgeu`
U-type	Upper Immediate	`lui`, `auipc`
UJ-type	Unconditional Jump	`jal`

Three operand instruction format.
X0 is always zero.
32 means Address cability/Integer register length.
Immediate
- $R$ type No immediate.
- $B,I$ type 12 bits immediate. Shift instructions are I-type but repurpose the immediate field as a 5-bit shift amount for RV32I (6-bit for RV64I). inst[30] distinguishes arithmetic from logical right shift, while inst[31:26] are fixed to zero except inst[30] (Range: $\pm 4 KB$ ). The immediate in the branch instruction is an offset relative to PC.
- $S$ type The 12‑bit immediate is split into high 7 bits (immediate) and low 5 bits (immed) (maintain regularity). The immediate in the store instruction is an offset relative to rs1.
- $J$ type 20 bits immediate.
Jump
- jal
  - $r d \leftarrow PC + 4$ , $PC \leftarrow PC + sign-extend ({inst [31], inst [19 : 12], inst [20], inst [30 : 21]}) \times 2$ (Jumps are 2-byte aligned, so last bit is implicit zero).
  - Range: $\pm 1 MB$ .
  - Beyond 1MB: Use AUIPC + JALR.
- jalr
  - $r d \leftarrow PC + 4$ , $PC \leftarrow (rs 1 + sign-extend (inst [31 : 20])) & \sim 1$ ( $\sim 1$ clears LSB to ensure 2‑byte alignment).
Load and Store

lw rd, offset (rs1) means load data from memory and write into rd.

sw rs2, offset (rs1) means store data from register rs2 into memory.
LUI loads a 20‑bit immediate into the upper bits of a register; AUIPC adds a 20‑bit immediate to the current PC for position‑independent addressing.
How to load a 32b const into a register?
1. Use a LW instruction (cost more)
2. lui and addi (When the lower 12 bits of a 32‑bit constant are $\geq 0x800$ , addi sign‑extends them, causing an incorrect result. The assembler automatically adjusts the upper 20 bits when using the li pseudo‑instruction.) (Range: $4 GB$ )

Instruction Encoding List

Regularity

Field	Bit Position	Note
Opcode	$[0, 6]$	Always opcode
rd	$[7, 11]$	Destination reg
rs1	$[15, 19]$	Source reg 1
rs2	$[20, 24]$	Source reg 2
Length	-	Fixed 16/32 bit
Immediate	High bits	Uses rd field for branch/store

Pseudo Instrcution

Pseudoinstruction	Actual Instruction Sequence	Operation
`nop`	`addi x0, x0, 0`	No operation
`li rd, imm`	`lui rd, imm[31:12] + imm[11]` `addi rd, rd, imm[11:0]`	Load 32-bit immediate
`mv rd, rs`	`addi rd, rs, 0`	Copy register
`not rd, rs`	`xori rd, rs, -1`	Bitwise NOT
`neg rd, rs`	`sub rd, x0, rs`	Two's complement negation
`seqz rd, rs`	`sltiu rd, rs, 1`	Set if $= 0$
`snez rd, rs`	`sltu rd, x0, rs`	Set if $\neq = 0$
`sltz rd, rs`	`slt rd, rs, x0`	Set if $< 0$
`sgtz rd, rs`	`slt rd, x0, rs`	Set if $> 0$
`beqz rs, offset`	`beq rs, x0, offset`	Branch if $= 0$
`bnez rs, offset`	`bne rs, x0, offset`	Branch if $\neq = 0$
`blez rs, offset`	`bge x0, rs, offset`	Branch if $\leq 0$
`bgez rs, offset`	`bge rs, x0, offset`	Branch if $\geq 0$
`bltz rs, offset`	`blt rs, x0, offset`	Branch if $< 0$
`bgtz rs, offset`	`blt x0, rs, offset`	Branch if $> 0$
`bgt rs, rt, offset`	`blt rt, rs, offset`	Branch if $rs > r t$
`ble rs, rt, offset`	`bge rt, rs, offset`	Branch if $rs \leq r t$
`bgtu rs, rt, offset`	`bltu rt, rs, offset`	Branch if unsigned $rs > r t$
`bleu rs, rt, offset`	`bgeu rt, rs, offset`	Branch if unsigned $rs \leq r t$
`j offset`	`jal x0, offset`	Unconditional jump
`jal offset`	`jal x1, offset`	Jump and link (return address in `x1`)
`jr rs`	`jalr x0, 0(rs)`	Jump to address in `rs`
`jalr rs`	`jalr x1, 0(rs)`	Jump and link to address in `rs`
`ret`	`jalr x0, 0(x1)`	Return from subroutine
`call offset`	`auipc x1, offset[31:12] + offset[11]` `jalr x1, offset[11:0](x1)`	Far call to subroutine

Why do RISC-V loads/stores use base+immediate instead of base+index?

Simpler hardware Adding scaling to load instructions requires a shifter and complicates the address calculation datapath, increasing cost and cycle time.
ISA regularity violation Three operands (rs1, rs2, rd) would force an R‑type format, but R‑type is reserved for ALU operations only. Mixing in memory access would break the clean encoding scheme.
Compiler can optimize it away In loops, the compiler just increments the base register by the element size each iteration, making scaled indexing unnecessary.

Why do RISC-V instructions place immediate bits in seemingly "random" positions (e.g., B-Type and J-Type)?

Fixed sign bit at position 31 across all formats.
Decoder reuse (B-Type reuses S-Type logic, J-Type reuses U-Type logic).
Fixed register fields (rs1, rs2, rd) across formats.

Application Binary Interface (ABI)

It is based on three key components : the computer ISA, the OS, and the calling convention. ABI incompatibility will occur due to differences between operating systems.

Data Alignment

No cache line crossing; no double TLB (Translation Lookaside Buffer) misses; no double page faults come from a single memory instruction execution.
Better cache line utilization.

RISC-I to RISC-V do not explicitly enforce data alignments. But their compilers often enforce data alignments.

Crossing a boundary makes the memory reference difficult to handle. The hardware needs to go through two exceptions instead of one.

Padding Rule

Internal Padding Every member must start at an address that is a multiple of its own alignment requirement.
Trailing Padding The total size of the struct must be a multiple of the largest alignment requirement among its members.

Exercise

struct S1 {
    int white;
    long long count;
    char c;
    int red;
    int blue;
} A[];

Suppose $A$ is an array of structure $S 1$ , and the starting address of $A$ is stored in register $X 1$ . what instruction should be used to access A[100].red?

Member	Size	Start	Padding	End	Reason
`white`	$4$	$0$	$0$	$3$	`int` aligns at $4$
`count`	$8$	$8$	$+ 4$	$15$	needs 8-byte align, $4$ pad $4$ to $8$
`c`	$1$	$16$	$0$	$16$	`char` aligns at $1$
`red`	$4$	$20$	$+ 3$	$23$	needs 4-byte align, $17$ pad $3$ to $20$
`blue`	$4$	$24$	$0$	$27$	`int` aligns at $4$

Current size is $28$ . Largest alignment in struct is $8$ , $28$ is not a multiple of $8$ , so pad $4$ bytes at the end. Total struct size is $32$ bytes.

A[100].red : $X 1 + 100 \times 32 + 20 = X 1 + 3220$ , so the answer is LW X2, 3220(X1).

Stack / Frame Alignments

Defined by ABI (e.g., RISC‑V psABI: 16‑byte)
CRT aligns sp before main ()

Compiler Reordering

Independent Variables (Rearrangement Allowed)
Struct Fields (Rearrangement Forbidden)
- Binary compatibility
- Pointer arithmetic
- Interoperability

PC-relative Addressing

Branch instructions use PC‑relative addressing with a 12‑bit signed offset (in 2‑byte units). Since branch targets are always multiples of 2, encoding in 2‑byte units doubles the reachable range without losing information.

$Target Address = PC + sign_extend (12 -bit immediate) \times 2$

CSR

Name	Abbreviation	Description
Machine Trap-Vector Base-Address Register	`mtvec`	Exception handler entry address
Machine Status Register	`mstatus`	Global interrupt enable (MIE) and status
Machine Cause Register	`mcause`	Reason of last exception/interrupt
Machine Exception Program Counter	`mepc`	Saves PC when exception occurs
Machine Interrupt Enable Register	`mie`	Enables specific interrupt sources
Machine Interrupt Pending Register	`mip`	Shows pending interrupts

Exception Handling

Save context Save current PC value to mepc register
Update status Update mstatus, disable further interrupts (clear MIE bit) to prevent nesting
Set cause Write exception/interrupt reason to mcause
Jump to handler Jump to exception handler entry point based on mtvec configuration

Frame Pointer (fp)/Stack Pointer (sp)

sp points to the top of stack.
fp points to a fixed location within the current stack frame, typically the address where the old fp is stored.
When the function returns, the saved old fp is restored to the fp register, making fp point back to the caller's stack frame.

Calling Convention

RV Registers	ABI Name	Caller/Callee	Purpose
x0	`zero`	—	Always zero
x1	`ra`	Caller	Return address
x2	`sp`	Callee	Stack pointer
x3	`gp`	—	Global pointer
x4	`tp`	—	Thread pointer
x5–x7	`t0`–`t2`	Caller	Temporary registers
x8	`s0` / `fp`	Callee	Saved register / Frame pointer
x9	`s1`	Callee	Saved register
x10–x11	`a0`–`a1`	Caller	Arguments / Return values
x12–x17	`a2`–`a7`	Caller	Arguments
x18–x27	`s2`–`s11`	Callee	Saved registers
x28–x31	`t3`–`t6`	Caller	Temporary registers

Caller-Saved Caller decides whether to save based on whether the value will be used after the call.
Callee-Saved Callee must always save these registers before using them and restore them before returning (absolutely safe).
Array variables are typically allocated in memory (stack or static data section), not in registers.

Aspect	Caller-Saved	Callee-Saved
Decision maker	Caller	Callee
Mandatory	On demand (only if value is needed after the call)	Yes
Save timing	Before calling a subroutine	At function entry
Restore timing	After subroutine returns	Before function returns

Why caller registers are allocated to temporaries?

Temporaries are short-lived and do not need to survive across function calls. Placing them in caller‑save registers avoids unnecessary save/restore code.

Why callee registers are allocated to local variables?

Local variables live across function calls. Using callee registers ensures they are preserved automatically by the callee, avoiding repeated saves at each call site.

Why callee registers are allocated to CSEs (Common Sub-Expressions)?

Local variables and CSEs are tend to live longer (may be as long as the procedure invocation).

CISC [optimized for compact code size]

Chapter 4 Processor Architecture & Logic Design

Logic Designs

Major components

Combinational element
State (Sequential) elements $Write = 0$ , cannot write to register (B-type and S-type).
Clock signals

What is the difference between combinational logic and sequential logic?

The former one is stateless, output purely depends on the inputs (e.g. ALU). The latter one has states, output depends on the inputs and the states (e.g. register).

The propagation delay of a combinational circuit is determined by the delay of the critical path, which is the longest path of logic gates from any input to any output.

Processor

Core

Data Path (Data Flow)

e.g. ALU, Register, Memory interface, Buses.

Control (Instruction Flow)

e.g. PC, Instruction fetch, and Control signal generation.

RV32I Data Path and Control

Memory reference instrutions lw, sw
Arithmetic/logical instrutions add, sub, and, or
Control flow instruction beq, jal

Instruction Execution Cycle

Cycle

Three Stages
- Fetch Use the PC to supply the instruction address and fetch the instruction from memory.
- Decode Decode the instruction (and read registers).
- Execute
  - Use ALU to calculate (Arithmetic result & Memory address for load/store & branch target address)
  - Access data memory for load/store
  - Update PC (target address or PC + 4 (next word) or PC + 2 (if using compressed instructions RV32IC))
Multi-cycle Instructions are broken down into multiple steps, each taking one clock cycle.

Time

CPU time is determined by the following:

$CPU Time = Instructions count \times Cycles per instruction (CPI) \times Clock cycle time$

Program execution time is determined by the following:

$\frac{time}{program} = \frac{instructions}{program} \times \frac{cycles}{instructions} \times \frac{time}{cycle}$

CISC machines Lower instruction count, higher CPI, longer cycle time
RISC machines Higher instruction count, lower CPI, shorter cycle time

Abstract View of RV32I Subset

How to select between PC+4 and PC+immediate?

If a branch is taken or a jal instruction, select pc+immed. Otherwise select PC+4.

Note that jal and jalr use pc + imm as the target address, while writing pc + 4 back to the register. In contrast, auipc writes pc + imm directly back to the register.

How to select data from the memory or the ALU?

If load select Mem. If R-type select ALU.

How to select data from Immediate or reg[rs2]?

If R-type select rs2. If I-type select immed.

Processor Designs

Analyze the requirements

Data Path Requirements selections

Combinational Components Adder & MUX & ALU (An adder only does addition (e.g., PC+4). An ALU includes an adder but also performs many other arithmetic/logic operations.)
Sequential Components

Register $N$ -bit storage with Write Enable control. Updates only at clock tick if $Write = 1$ .

Register File consists of 32 registers with two read ports (rs1/rs2) and one write port (rd).

Memory. Read ( $WE = 0$ ): Address → Data Out. Write ( $WE = 1$ ): Next clock tick, Data In → Address.
Assemble
- Instruction Fetch Unit Fetch the instruction and Update the program counter.
- Branch Operations Using ALU subtraction for branches risks overflow corrupting the sign, so RISC-V processors internally use flags (overflow and carry) or dedicated comparators for correct branch decisions without hardware traps.
- Add and Subtract R[rd] <- R[rs1] op R[rs2]
- Load/Store Operations A single ALU and register file need two multiplexers: one for ALU's second input (e.g. add rd, rs1, rs2 and addi rd, rs1, imm), another for register file's write data source (e.g. add x3, x1, x2 from ALU and lw x3, 8(x1) from memory).

Control Points and Signals

Control Signal	Description	Expression
`PCsrc`	Select the next PC value PC + 4 [ $0$ ] Branch / jump target address [ $1$ ]	`(Branch or jal/jalr) ? 1 : 0`
`RegWr`	Write enable for register file	`(Branch or Store) ? 0 : 1`
`MemRd`	Enable signal for reading data memory	`Load ? 1 : 0`
`MemWr`	Enable signal for writing data memory	`Store ? 1 : 0`
`ALUsrc`	Select the second ALU input register [ $0$ ] immediate value [ $1$ ]	`R-type ? 0 : 1`
`ALUctr`	Determines the ALU operation	/
`WBsel`	Selects the data written to the register file memory data [ $0$ ] ALU result [ $1$ ] PC + 4 [ $2$ ]	`Load: 0; R-type : 1; jal/jalr : 2`

On each clock cycle, the single‑cycle processor executes one instruction. State elements update at the rising edge using combinational logic outputs computed during the cycle.

Control

Asel rs1 or pc (When use jal, the target is PC + offset.)
Bsel rs2 or immed.

Why does RV32I still need a dedicated Branch Comparator despite having an ALU?

Use Branch Comparision to avoid substruction overflow (also faster than subtraction).
RV32I lacks architectural flags, it requires a dedicated comparator to enable single-cycle compare-and-branch operations by combining comparison and jump logic.
The ALU at the EXE stage is needed for R-type instructions and Load/Store instructions.

Pipelining

Five Stages

$IF$ Instrcution fetch from (instruction) memory
$ID$ Instrcution decode & register read
$EX$ Execute operation or calculate address
$MEM$ Access (data) memory operand
$WB$ Write the result back to register

Instructions	Detailed Stages
R-Type, I-Type, `jal`, `jarl`, `lui`, `auipc`	$IF \to ID \to EX \to WB$ (no $MEM$ )
Store, Branch	$IF \to ID \to EX \to MEM$ (no $WB$ )
Load	$IF \to ID \to EX \to MEM \to WB$

Throughput increases as more instructions complete per unit time, but single instruction latency (The time it takes for each instruction to be executed.) does not decrease and may even increase.
Pipeline rate is limited by the slowest stage.
Potential/ideal speedup = Number of pipeline stages (number of pipeline steps).
Need to reduce possible fill and drain (e.g., I-cache misses and branch misprediction)

Pipeline-oriented ISA Design

All instructions are fixed length.
Few and regular instruction formats.
Only load/store instructions accessing memory.
Instructions are simple.

Some principles of designing a pipelined datapath:

Multi-stage partitioning
Overlapped execution
Increased throughput
Improved resource utilization

Single Cycle & Multi Cycle & Pipeline

Implementation	Clock Cycle Time	Instruction Latency	CPI
Single-Cycle	Sum of all stage latencies	1 $\times$ Clock Cycle Time	$1$
Multi-Cycle	Longest stage latency	CPI $\times$ Clock Cycle Time	$> 1$
Pipelined	Longest stage latency	Number of Stages $\times$ Clock Cycle Time	$> 1$ (hazards)

Latency The total time required to complete one single instruction from start to finish. (Single-cycle data path actually has shorter instruction latency because of the setup time and propagation delay in pipeline.)
Throughput The total number of instructions completed per unit of time.
For superscalar implementation, CPI $< 1$ .
Single cycle as lower throughput but shorter instruction latency.
No arithmetic exceptions allow out‑of‑order retirement, avoiding stalls for long‑latency instructions.

Pipline Registers

Without pipeline registers, stages would overwrite each other’s data, causing instruction mix‑ups and loss of control information.

Pipeline Register	Stored Information
IF/ID	`PC_ID`, `inst_ID` (instrcution code like opcode)
ID/EX	`PC_EX`, `inst_EX`, `imm_EX`, `rs1_EX`, `rs2_EX`
EX/MEM	`PC_MEM`, `inst_MEM`, `rs2_MEM`, `alu_MEM`
MEM/WB	`PC+4_WB` (`rd = PC + 4`), `inst_WB`, `alu_WB`, `mem_WB`

Control signals are derived from instruction bits, that is, after the ID stage.
Control information for later stages are also stored in the pipeline registers.
Architecture States (visible to the programmer and compiler)
- Registers (general purpose, fp, vector, flags)
- PC
- Memory
It was saved during a context switch:
- user-level switch (e.g. procedure call) saved on the runtime program stack (caller and callee convention).
- system-level switch (e.g. process switch) saved on the Process Control Block (PCB) or the kernel stack.
Micro-architecture states (invisible to the programmer)
- $Piplined registers$ Pipelined registers hold transient data between stages for a few cycles.
- Branch predictors
- Caches
- Buffers and Quenes
- Counters

Stage	Data Path	Control
$ID$	`rs1`, `rs2` designators	`immed_sel`
$EX$	`rs1`, `rs2` contents, `PC`, `immed_value`	`Asel`, `Bsel`, `ALUsel`
$MEM$	`ALUout` (Address), `rs2` contents	`MemRW` (read/write), `BrLt`, `Breq`
$WB$	`ALUout`, `PC+4` (for link reg), `MEMout`	`Wbsel`, `rd` designator, `WRen`

Hazards

Structure Hazards

A conflict arising due to hardware resourece limitations within the pipeline.

Pipeline stalls
Multiple resources
Instruction Reordering

Static scheduling (by compiler) reorders instructions at compile time to avoid hazards, while dynamic scheduling (by hardware) reorders them at runtime based on actual data and resource availability.
ISA design

Split Instruction/Data caches can avoid structural hazards in the pipeline and increase the cache access bandwidth.

Data Hazards

A conflict arising because the current instruction depends on the result of a previous instruction that has not yet been computed or written back.

Scenario	Data Ready Stage	Data Used Stage	Example
Register Access Issues	$EX$	$ID$ It bypasses the ALU, read in ID and held until MEM for memory write.	`add t0, t1, t2` `sw t0, 4(t3)`
ALU Result Access Issue	$EX$	$EX$	`add s0, t0, t1` `sub t2, s0, t0`
Load Hazard	$MEM$	$EX$	`lw s1, 4(s0)` `add t0, s1, t1`

Solutions

Flow dependence (RAW) is a true data dependence, while anti-dependence (WAR) and output dependence (WAW) are name dependences that can be eliminated by register renaming.

Stall pipeline (Interlocking)
Data Forwarding (Bypassing)

Add forwarding control logic to make extra connections in the datapath.

Hazard detection compares $EX/MEM$ and $MEM/WB$ destination registers with current instruction’s source registers. (EXE-EXE EX/MEM.RegRd = ID/EX.RegRs1/2 and MEM-EXE MEM/WB.RegRd = ID/EX.RegRs1/2)

Forwarding is skipped when RegWr == 0 or when the destination register is x0.
Compiler Code Transformations

Scheduling (reordering) scope is often limited by branches, indirect branches, and call/ret. Memory aliasing (e.g., $p$ and $q$ may point to same address) and unknown latencies (e.g. cache hit/miss unpredictable) further restrict reordering. Compiler must conservatively preserve dependencies.

Control Hazards

Control hazards are due to branch instructions (conditional jumps) and JAL/JALR instructions (unconditional jumps).

Condition and Target Address are Ready at the $EXE$ Stage.

Static Branch Prediction (Compile Time)

Predict a backward branch as taken (loop back branches) and predict a forward branch as fall-through (the compiler always puts the then part in the fall-through path).
Conditional branch's (e.g. blt) behavior is entirely program‑dependent and cannot be accurately predicted using simple static rules. Static branch prediction does not read registers at runtime. It only considers whether the branch target is forward or backward.
Unconditional branches (e.g. jal) always jump, static prediction should always predict taken.
The ISA may reserve a bit in branch instructions as a prediction bit.
When branch prediction is wrong, pipeline flushing is performed.

Dynamic Branch Prediction (Run Time)

Branch History Table (BHT) (Branch prediction should occur at the instruction fetch (IF) stage.)

A branch history table stores past outcomes of branches, indexed by address, to predict future behavior and flush on misprediction.

Tag identifies which address occupies a cache slot. BHT omits tags because prediction bits (1–2) are tiny, tags huge (30–46 bits).
- 1-bit Predictor Flips on every misprediction, fails on nested loops.
- 2-bit Predictor Four-state FSM, changes only after two consecutive mispredictions, handles loops better.
Correlated Predictor

Global history register selects a 2-bit counter entry in the pattern table. Current PC selects which table to use, separating histories for different branches.

The Location Prediction

Branch Type	Prediction Mechanism	Example
Direct branch	Branch Target Buffer (BTB), fixed target	`beq`, `jal`
Indirect branch	Hard to predict	`jalr x0, 4(x1)` (Return branch, Switch statements and Function pointers)
Function return	Return Address Stack (RAS)	`ret`

Chapter 5 Optimizing Program Performance

Compiler Optimization

Compiler Optimization Levels

Level	Description
`-O0`	No optimizations, fast compile, for debugging.
`-O1`	Basic optimizations.
`-O2`	Recommended default: safe, stable, efficient.
`-O3`	Aggressive (loop unroll, SIMD). May increase code size, compile time, and even hurt performance.
`-O4`	`-O3` + Link-Time Optimization (LTO).

General Goals

Minimize the Number of Instructions

Common Subexpression Elimination (CSE) Calculate the same expression once and reuse the result.
Dead Code Elimination (DCE) Don't calculate values that are never used.
Strength Reduction (SR) Avoid slow instructions (multiplication/division).

Minimize the Execution Cycles

Register Allocation (RA) Keep frequently used variables in registers.
Code Scheduling Reorder instructions to avoid stalls.
Locality Improvement
Preload & Redundant Load Elimination Load a value once and reuse it, instead of reloading from memory.

Avoid Branching

Avoid Branching Conditional move in x86 (cmov, e.g. a = (b > c) ? b : c), conditional execution in ARM (e.g. addgt r0, r1, r2).
Loop Unrolling Reduce the number of branch instructions. Loop unrolling improves performance by reducing the number of branch evaluations and branch mispredictions, also creates opportunities for CSE, code motion, and scheduling.
Procedure Inlining Reduce the overhead of the call. However, inlining can cause register or I‑cache spilling when code grows too large. For recursive functions, inlining may lead to infinite expansion unless the compiler enforces a depth limit.
Unswitching Move a condition outside the loop to avoid branching inside the loop. (e.g. if (cond) { for (...) { ... } } else { for (...) { ... } })

Limitations

Compilers cannot change the algorithm.
Compilers must obey the rules and semantics of the programming language.
- Memory Aliasing Two pointers may point to the same memory location. Use a local variable for the intermediate value to avoid redundant loads or use the restrict qualifier to promise no aliasing.
- FP Associativity Floating-point addition is not associative.
- Function Side Effects A function may modify global state or have observable effects beyond its return value.
- Volatile Variables A variable declared as volatile can be changed by external factors.
- Memory Consistency In multi-threaded programs, compilers must respect memory ordering constraints.
Lack of runtime and domain knowledge.
Many specific optimization problems are NP-hard.
Boundary crossing issues (separate compilation).

Several Optimization Options

Flag	Description
`-O3 -flto`	Link-Time Optimization: enables whole-program optimization.
`-march=native`	Generate code optimized for the current CPU architecture (enables all supported instruction sets).
`-mtune=native`	Optimize instruction scheduling for the current CPU without breaking compatibility with older CPUs.
`-fprofile-generate`	Compile instrumented code to generate runtime profiles.
`-fprofile-use`	Use profile data to guide optimizations.
`-Ofast`	Alias for `-O3 -ffast-math`; allows aggressive FP reordering (may break IEEE compliance).
`-fopt-info-missed`	Report which optimizations were missed and why.

Chapter 6 The Memory Hierarchy

Storage Technologies

Access time is the same for all locations.

All unconventional DRAM chips offer much higher bandwidth, but the latency remains the same (The first data still takes the same time to arrive).

Feature	SRAM	DRAM
Transistors per bit	$[4, 8]$	$1$
Access time	1×	10×
Refresh	No	Yes
Error Detection and Correction	Optional	Yes
Cost	High	Low
Main applications	cache memories	main memory, frame buffers

Static Random Access Memory (SRAM)

Dynamic Random Access Memory (DRAM)

Reading DRAM Supercell Select row $i$ via RAS, load into buffer. Select column $j$ via CAS, output data, then rewrite row to refresh.
Memory Modules A 64-bit word is stored across eight $8 M \times 8$ DRAM chips in parallel, with each chip providing one byte (8 bits) at the same row and column address.

Memory Wall

The gap between the speed of processors and the main memory. The bottleneck has shifted from how fast memory can respond (latency) to how much data it can deliver per second (bandwidth).

Latency

Reduction local memory, NUMA, PIM (Reduce the waiting time for each visit.)
Hiding multi-threading/Hyper-threading, WARP interleaving, chip-multithreading (Keep computation units busy.)

Bandwidth

Memory bandwidth multi-banks and interleaved memory, SDRAM, HBM
Communication bandwith wider bus, interconnection network

Locality

Principle

Many Programs tend to use data and instructions with addresses near or equal to those they have used recently.

emporal locality

Recently referenced items are likely to be referenced again in the near future.

Spatial locality

Items with nearby addresses tend to be referenced close together in time.

What locality is exhibited by the following C loop?

while (A != NULL) A = A->next;

A. Data temporal locality B. Data spatial locality C. Structural locality D. Instruction temporal locality E. Instruction spatial locality

Answer

D and E. The loop repeatedly executes the same small set of contiguous instructions (instruction spatial locality) and reuses those instruction addresses across iterations (instruction temporal locality); data locality is absent because each list node is accessed only once and nodes may not be contiguous in memory.

Memory Hierarchy

Level Transfer	Staging Unit	Typical Size	Controlled By
Registers ↔ Memory	Instruction Operands	Bits / words (e.g., 32/64 bits)	Compiler (Programmer)
Cache / Local Memory ↔ Memory	Blocks / Lines	64 B (cache line)	Hardware (Cache Controller) / Compiler or Programmer (Local Memory)
Memory ↔ Disks	Pages	4 KB (page)	Hardware & OS (Virtual Memory) / Programmer (Files)
Disks ↔ Tapes	Files	Variable (e.g., 64 KB–1 MB)	Hardware / Operator or Programmer

This gives you Large, Cheap memory, but Fast access.

Caches provide automatic (transparent) data movement, while local memory requires explicit programmer-controlled data management.

Cache Management

General Cache Memory Organization

Direct Mapped Caches

Valid Bit

A cache line is invalid (valid bit equals to $0$ ) when:
- When a cache line of data has not come back from memory yet
- Cache line is currently being replaced
- Cache is flushed
Three Steps
- Set Selection Use the set index as an unsigned binary number to locate the cache set.
- Line Matching Compare the tag to determine if the desired block is present in the set.
- Word Extraction Use the block offset as a binary number to select the appropriate word within the cache line.

Why is the set index typically taken from the middle bits of the address rather than the high bits?

Using high bits as the set index causes consecutive memory blocks to map to the same set, leading to more conflict misses, whereas middle bits distribute blocks across sets to better exploit spatial locality.

Set Associative Caches

$N$ -way set-associative has $N$ lines per set while direct-mapped is one way and fully associative is one set.

Given a cache with total size $C$ measured in kilobytes, associativity $E$ , and block size $B$ measured in bytes. The number of sets $S$ is calculated as $S = \frac{1024 C}{BE}$ .

Performance Impact of Cache Parameters

Feature	L1 Cache	L2 Cache
Primary goal	Minimize hit time	Minimize miss rate
Locality preference	Spatial locality	Temporal locality
Typical write policy	Write-through	Write-back
Reason	L1 hit directly affects pipeline	L2 miss leads to high memory penalty
Associativity	Low	High
Size	Small	Large

When the block size is too small, the cache cannot fetch enough contiguous data in a single miss (lower spatial locality -- most accesses to nearby addresses still result in cache misses).
When the block size is too large, the fixed-size cache holds fewer blocks, leading to more conflicts, frequent replacements and higher miss penalty (lower temporal locality -- the recently used data is quickly evicted before it can be reused).
Number of cache lines is determined by the cache size and the line size.

Parameter / Change	Block Offset Bits	Set Index Bits	Tag Bits
Formula (64‑bit address)	$lo g_{2} B$	$lo g_{2} S$	$64 - lo g_{2} B - lo g_{2} S$
Increase block size $B$	$↑$	$↓$	—
Increase cache size $C$	—	$↑$	$↓$
Increase associativity $E$	—	$↓$	$↑$

Cache Hit and Miss

Cache Read

Read Hit
Read Miss When a read miss occurs, the CPU stalls the pipeline, fetches the missing block from the next memory level, and then resumes execution.

Cache Write

Write Hit
- Write Through
  
  It simultaneously updated to cache and memory. It ensures data isn't lost if the cache is disrupted, but increases memory traffic and write latency.
  
  Solution: write buffer. A write buffer hides write latency, but if store frequency exceeds the DRAM write cycle, the buffer saturates and stalls the CPU (Write buffer allows write-coalescing/combining to reduce traffic to memory.).
- Write Back
  
  The CPU writes data only to the cache initially, and main memory is not updated until the cache line is eventually replaced (need Dirty Bit). This reduces memory traffic and write latency, but creates cache-memory inconsistency, requiring cache coherence protocols (e.g., MESI) in multi-core systems. Data loss risk on cache failure is mitigated by Error Correcting Code (ECC) protection.
Write Miss
- Write Allocate First read the data from main memory and loaded into the cache, and then the write operation is performed on the cached copy.
- No Write Allocate It writes the data directly to main memory without loading the missing block into the cache (Better for streaming writes, e.g. writing database logs).

Aspect	Write Through + No Write Allocate	Write Back + Write Allocate
Write Miss Behavior	Bypass cache, write directly to main memory (optionally via write buffer + write-combining)	Load missing block from main memory into cache first, then write to cache
Typical Partner	Write Through Cache	Copy Back (Write Back) Cache
Initial Cost	Low (no main memory read)	High (one read miss)
Reuse Cost	High (data not in cache, still miss on reuse)	Low (subsequent reads/writes hit in cache, only set dirty bit)
Best For	Write once, never reused data (e.g., writing logs)	Repeated reads/writes to same data (temporal locality)

Instruction Cache Miss Handling

On an instruction cache miss, the CPU sends the PC to memory, waits for the read to complete, writes the fetched data into cache with its tag and valid bit set, then restarts the instruction fetch. (Instruction cache prefetching is a commonly used hardware optimization to reduce instruction cache misses.)

Can you prefetch for instruction cache in software?

Software cannot directly prefetch instructions (the fetch stage is PC-driven), unlike data prefetching. It can only indirectly affect I-cache hit rate via code layout or execution order.

If we prefetch on every miss, why not just use a larger cache line size?

Branch-sensitive Prefetching follows the predicted path, fetching only useful instructions. Large lines load all adjacent instructions regardless of branches, wasting bandwidth and cache space on dead code.
False sharing Large lines reduce effective cache capacity, more likely to cause conflict.
Bandwidth-aware Large line increases the cost of every miss. Prefetch is more flexible.

Why does the GPU prefetch for the next 10-12 lines, but the CPU only prefetches for 1-2 lines?

GPU prefetches 10–12 lines because it is throughput‑oriented with highly linear instruction flow, making deep prefetch safe, while CPU, being latency‑sensitive with frequent branches, only prefetches 1–2 lines to avoid polluting the cache and wasting bandwidth on wrong paths.

Types of Cache Misses

Compulsory (or Cold) Miss The first access to a block that has never been loaded into the cache before. (Hardware prefetching & larger cache line size to reduce compulsory misses.)
Conflict (or Collision) Miss Multiple references are mapping to the same set, and the set is not large enough to hold them.
Capacity Miss Cache is not large enough to hold needed blocks.
Coherence Miss Caused by invalidation from other processors in a multiprocessor system to maintain cache coherence.

Ways to Reduce Cache Miss Penalty

Technique	Core Idea	How It Hides Penalty
Critical Word First and Early Restart	Fetch requested word first; resume execution immediately	Reduces wait time; overlaps load with execution
Non-Blocking Cache	Continue on miss; track multiple misses with MSHRs	Overlaps multiple misses
Cache Prefetching	Fetch data before it is needed	Avoids miss entirely
Write Buffer	Hold dirty blocks temporarily; write back later	CPU doesn't wait for write
Cache-Aware Code Scheduling	Compiler reorders instructions	Overlaps miss with computation

Non-Blocking Cache and Blocking Cache

Blocking Cache

A cache miss stalls the CPU pipeline immediately (stall on miss).

Non-Blocking Cache (Lockup-Free Cache)

The CPU continues executing other instructions on a miss and stalls only when the data is needed, using Miss Status Holding Register (MSHRs) to track outstanding misses (stall on use). Non-blocking caches support MLP (Memory Level Parallelism).

Hit under Miss During one miss, the cache can still handle subsequent hits.
Miss under Miss (better) During one miss, the cache can also handle another miss, allowing multiple outstanding misses.

Cache Block Replacement

Direct Mapped

Each set has only one line, so the new block always replaces the existing block in that set.

Set ssociative or Fully Associative

Random
Least Recently Used (LRU) [stack algorithm]

Hardware keeps track of the access history and replaces the block that has not been used for the longest time.

Each block has a counter. On a hit, reset its counter to 0 and increment all other counters in the set. On a miss, replace the block with the highest counter value.
First In First Out (FIFO) The block that has been in the cache the longest is replaced, regardless of access history.
Belady’s Algorithm [upper bound]

Replace the block that will not be used for the longest time in the future. It is optimal but requires perfect knowledge of future accesses, making it impractical for real hardware.

Victim Cache

A small fully associative cache placed between the main cache and memory to hold recently evicted blocks, reducing conflict misses by providing a second chance for recently replaced data. Since such conflicts occur in only a few sets, just 4–8 entries suffice for fast access, combining the speed of direct-mapped L1 with the benefit of full associativity.

Snoopy Cache Coherence Schemes

Snoopy cache maintains coherence in bus-connected multi-processors by broadcasting all coherence operations over a shared bus. Every cache controller snoops the bus and reacts based on its local cache state. Each controller is a bidirectional state machine that processes CPU requests and bus snoop events, updating states according to a transition diagram.

Multiple controllers form a distributed algorithm operating at cache block granularity. The MESI protocol, with its four states (Modified, Exclusive, Shared, Invalid), tracks each block to enable coherence while minimizing bus traffic and memory accesses.

Exclusive/Inclusive Caches

Inclusive L2 caches simplify coherence invalidation in multiprocessor systems by checking at the L2 level, despite wasting capacity, which modern processors tolerate for easier consistency management, making inclusive more common than exclusive today.

Inclusive caches

Every line in L1 must be presented in L2 ( $L1 \subseteq L2$ ). When L2 evicts a line, it must invalidate the corresponding line in L1, and L2 can have a larger line size than L1.

Exclusive caches

If a line, A, is in L1, it must NOT be presented in L2. AMD had adopted this for L2/L3. When L1 evicts a line, it moves to L2. On an L1 miss that hits in L2, the line migrates from L2 to L1. And L1 and L2 must have the same line size.

NINE (Non-Inclusive and Non-Exclusive)

More suitable for L2/L3.

Average Memory Access Time (AMAT)

Usually, %instr is higher than %data, but I-cache hits better because instruction accesses are highly sequential with strong spatial locality, have no write misses (read-only), and exhibit simpler access patterns than data references, which often involve random pointers or strided array accesses.

AMAT was accurate for blocking caches and in-order CPUs, but lost relevance as out-of-order execution, non-blocking caches, prefetching hid miss latency and spliting I/D caches.

Low Hit Latency

Small and Simple
Direct-map or low associativity Number of comparators needed equals total cache entries for fully associative, but only $n$ for $n$ -way set associative.
Prediction Predict which way to compare first.
Virtual address cache or virtual index and physical tagged cache Avoid or overlap with address translation delay.
Cache layout closer to CPU to minimize signal delay

Chapter 7 Linking

Why Linking Matters

Modularity

Separate compilation allows developers to work on different modules independently, improving productivity and enabling code reuse.

Efficiency

Time Separate compilation & Parallel compilation.
Space Static linking (Executable files and running memory images contain only the library code they actually use.) & Dynamic linking (Executable files contain no library code. During execution, a single copy of the library code is shared among all executing processes.)

Static Linking

Symbol Resolution

Symbol tables in object files record symbol definitions (name, size, location), and the linker matches each symbol reference to exactly one definition during symbol resolution.

Link Symbols

Type	Description	Example
Global symbols	Defined by module m, can be referenced by other modules.	`int global_var = 10;` `void func() { }`
External symbols	Referenced by module m but defined by some other module.	`extern int global_var;` `extern void func();`
Local symbols	Defined and referenced exclusively by module m.	`static int local_var = 10;` `static void helper() { }`

Strong symbols are procedures and initialized globals, while weak symbols are uninitialized globals. Linker's Symbol Rules: Multiple strong symbols are forbidden; one strong wins over weak; multiple weak are arbitrary, unless -fno-common forces an error.

Example

file1.c

int x = 10; // strong symbol   
int y = 5;  // strong symbol
void p1() {printf("x = %d, y = %d\n", x, y);}

file2.c

double x; // weak symbol
void p2() {printf("x = %f\n", x);}

Writes to $x$ in p2 might overwrite $y$ .

Avoid global symbols if possible, otherwise use static to limit scope, initialize globals to make them strong, and reference external globals with extern to avoid linker errors.

Static Libraries

Concatenate related relocatable object files into a single file with an index (called an archive). For example, ar rs libfoo.a a.o b.o c.o creates a static library libfoo.a containing a.o, b.o, and c.o.

Linkers scan files left to right, maintain an unresolved reference list, and only extract archive members that resolve current entries, so libraries must come at the end of the command line.

The solution to deal with the order sensitivity and duplicate symbol definitions is to introduce Shared libraries are object files whose code and data are loaded and linked into an application dynamically, either at load time or at run time.

Relocation

It first merges all separate code and data sections into single sections. Then, it relocates symbols from their relative offsets in .o files to final absolute addresses in the executable. Finally, it updates every reference to these symbols to reflect their new positions.

Object Files

Classification

Type	Extension (Linux/Unix)	Description
Relocatable Object File	`.o`	Contains code and data in a form that can be combined with other relocatable object files to form an executable object file. Each `.o` file is produced from exactly one source (`.c`) file.
Executable Object File	`a.out`	Contains code and data in a form that can be copied directly into memory and then executed.
Shared Object File	`.so`	Special type of relocatable object file that can be loaded into memory and linked dynamically, at either load time or run-time. It also called Dynamic Link Libraries (DLLs) by Windows.

ELF Object File Format

Component	Description
ELF Header	Word size, byte ordering, file type (`.o` / executable / `.so`), machine type, etc.
Segment Header Table (for OS)	Page size, virtual address memory segments (sections), segment sizes.
.text	Code
.rodata	Read-only data: jump tables, string constants, etc.
.data	Initialized global variables and initialized local static variables.
.bss	Uninitialized global variables and uninitialized local static variables. Has section header but occupies no space.
.symtab	Symbol table: procedure names, global variable names, static variable names, external symbols, section names, and their locations.
.rel.text	Relocation info for `.text` section: addresses of instructions that will need to be modified in the executable, plus instructions for modifying.
.rel.data	Relocation info for `.data` section: addresses of pointer data that will need to be modified in the merged executable.
.debug	Info for symbolic debugging (generated with `gcc -g`).
Section Header Table (for execution and debugging)	Offsets and sizes of each section.

Using the command readelf -s <file> to view the symbol table of an Executable and Linkable Format (ELF) file.
Note that Local non-static variables are stored on the stack.
It will create local symbols in the symbol table with unique names.

Executable Object Files

Load Executable Object Files

Dynamic Linking with Shared Libraries

Load-Time Linking

It occurs when the executable is first loaded and run. The standard C library (libc.so) is typically dynamically linked this way.

LD_PRELOAD forces the dynamic linker to load a user-specified shared library first, allowing interception of standard functions, but cannot intercept statically linked _start.

Run-Time Linking

It occurs after the program has begun, using calls to the dlopen() interface on Linux.

Position-Independent Code (PIC)

To exploit the fact that the distance between any instruction in the code segment and any variable in the data segment is a runtime constant, the compiler creates a table called the Global Offset Table (GOT) that holds the absolute addresses of global variables.

Global Offset Table (GOT)

Resides in the data segment and stores the actual absolute addresses of external functions and global variables. These addresses are filled in by the dynamic linker at load time or on the first call.

Procedure Linkage Table (PLT)

A memory section containing small executable stubs for each external function the program calls. It lives in a read-only code section (.text or .plt), providing a trampoline that jumps through the GOT to the actual function.

.got stores addresses of global variables for direct access, while .got.plt stores addresses of external functions specifically for PLT stubs

Whole Process

Step	Stage	Action
1	Compile (static time)	Compiler generates pseudo instruction: call printf
2	Compile (static time)	Assembler expands to `auipc+jalr`, adds reloc entries
3	Link (walk time)	Linker creates .plt stub (`auipc+ld+jalr`)
4	Link (walk time)	Linker allocates GOT entry (44B for 32b, 8B for 64b)
5	Dynamic Linking (run time)	Dynamic linker fills GOT with real printf address

Chapter3

CISC [optimized for compact code size]

Size of Data Type in IA32

C declaration	Intel data type	Assembly-code suffix	Size (bytes)
$char$	Byte	$b$	$1$
$short$	Word	$w$	$2$
$int$	Double word	$l$	$4$
$long$	Quad word	$q$	$8$
$char *$	Quad word	$q$	$8$
$float$	Single precision	$s$	$4$
$double$	Double precision	$l$	$8$

Register

Operands

Immediate
Register
Memory Reference Immediate & Base Register & Index Register & Scale Factor (1,2,4,8)

Type	Form	Operator Value	Name
Immediate	$$ I mm$	$I mm$	Immediate
Register	$r_{a}$	$R [r_{a}]$	Register
Memory	$I mm (r_{b}, r_{i}, s)$	$M [I mm + R [r_{b}] + R [r_{i}] \cdot s]$	Scaled Indexed
Memory	$I mm$	$M [I mm]$	Absolute
Memory	$(r_{a})$	$M [R [r_{a}]]$	Indirect
Memory	$I mm (r_{b})$	$M [I mm + R [r_{b}]]$	Base + Displacement
Memory	$(r_{b}, r_{i})$	$M [R [r_{b}] + R [r_{i}]]$	Indexed
Memory	$I mm (r_{b}, r_{i})$	$M [I mm + R [r_{b}] + R [r_{i}]]$	Indexed
Memory	$(, r_{i}, s)$	$M [R [r_{i}] \cdot s]$	Scale Indexed
Memory	$I mm (, r_{i}, s)$	$M [I mm + R [r_{i}] \cdot s]$	Scale Indexed
Memory	$(r_{b}, r_{i}, s)$	$M [R [r_{b}] + R [r_{i}] \cdot s]$	Scale Indexed

Operation code

Movement
- $mov S,D$ [movb, movw, movl, movq, movabsq]
  
  Source operand (Immedita, Register, Memory); Destination operand (Register, Memory)
  
  Two operands of a move instruction cannot both be memory references. (It is necessary to first move the source memory reference into a register, and then move it to the destination memory reference.)
  
  movl writes 32 bits of data to the lower half and then automatically zero-extends the value into the upper 32 bits. (In x86-64, any instruction that generates a 32-bit result for a register will set the upper 32 bits of that register to zero.)
  
  movq can only be 32 bits (sign-extended to 64 bits), whereas movabsq can be 64-bit but the destination must be a register.
- $movz S, R$ [movzbw, movzbl, movzwl, movzbq, movzwq]
  
  movzlq does not exist because of movl instruction.
- $movs S, R$ [movsbw, movsbl, movswl, movsbq, movswq, movslq, cltq]
  
  cltq same as movslq %eax, %rax

Stack

let a quad number be the example

$pushq S$
```
subq $8 %rsp
movq %rbq,(%rsp)
```
$popq S$
```
movq (%rsq),%rax
addq $8, %rsp
```

Load Effective Address

$leaq S, D$

Computes effective address and stores it in destination register

Unary Operations

Instructions	Effect
`inc D`	$D \leftarrow D + 1$
`dec D`	$D \leftarrow D - 1$
`neg D`	$D \leftarrow - D$
`not D`	$D \leftarrow\sim D$

Binary Operations

Instructions	Effect
`add S D`	$D \leftarrow D + S$
`sub S D`	$D \leftarrow D - S$
`imul S D`	$D \leftarrow D \times S$
`xor S D`	$D \leftarrow D \oplus S$
`or S D`	$D \leftarrow D ∣ S$
`and S D`	$D \leftarrow D & S$

Shift Operations

Instructions	Effect
`sal D`	$D \leftarrow D << k$
`shl D`	$D \leftarrow D << k$
`sar D`	$D \leftarrow D > >_{A} k$ Arithmetic Right Shift
`shr D`	$D \leftarrow D > >_{L} k$ Logical Right Shift

Shift instructions can shift by an immediate value, or by a value placed in the single-byte register %cl.

Condition Code

$CF$ Carry Flag [check for overflow in unsigned operations]
$ZF$ Zero Flag
$SF$ Sign Flag
$OF$ Overflow Flag
$cmp S D$

Similar to sub, but it only sets the condition codes without changing the value of the destination register.

When S == D, $ZF = 1$ .

It will set the condition codes based on the result of $D - S$ .
$test S D$

Similar to and, but it only sets the condition codes without changing the value of the destination register.

When S & D == 0, $ZF = 1$ .

Set Instructions

Instruction	Effect	Description
`sete D`	$D \leftarrow ZF$	$=$
`setne D`	$D \leftarrow\sim ZF$	$\neq =$
`sets D`	$D \leftarrow SF$	negative
`setns D`	$D \leftarrow\sim SF$	nonnegative
`setg D`	$D \leftarrow\sim (SF \oplus OF) & \sim ZF$	Signed $>$
`setge D`	$D \leftarrow\sim (SF \oplus OF)$	Signed $\geq$
`setl D`	$D \leftarrow SF \oplus OF$	Signed $<$
`setle D`	$D \leftarrow (SF \oplus OF) ∣ ZF$	Signed $\leq$
`seta D`	$D \leftarrow\sim CF & \sim ZF$	Unsigned $>$
`setae D`	$D \leftarrow\sim CF$	Unsigned $\geq$
`setb D`	$D \leftarrow CF$	Unsigned $<$
`setbe D`	$D \leftarrow CF ∣ ZF$	Unsigned $\leq$

Jump Instructions

Instruction	Jump Condition	Description
`jmp Label`	$true$	Direct Jump
`jmp *Operand`	$true$	Indirect Jump
`je Label`	$ZF$	$=$
`jne Label`	$\sim ZF$	$\neq =$
`js Label`	$SF$	Negative
`jns Label`	$\sim SF$	Nonnegative
`jg Label`	$\sim (SF \oplus OF) & \sim ZF$	Signed $>$
`jge Label`	$\sim (SF \oplus OF)$	Signed $\geq$
`jl Label`	$SF \oplus OF$	Signed $<$
`jle Label`	$(SF \oplus OF) ∣ ZF$	Signed $\leq$
`ja Label`	$\sim CF & \sim ZF$	Unsigned $>$
`jae Label`	$\sim CF$	Unsigned $\geq$
`jb Label`	$CF$	Unsigned $<$
`jbe Label`	$CF ∣ ZF$	Unsigned $\leq$

Conditional Move Instructions

Instruction	Move Condition	Description
`cmove S, R`	$ZF$	$=$
`cmovne S, R`	$\sim ZF$	$\neq =$
`cmovs S, R`	$SF$	Negative
`cmovns S, R`	$\sim SF$	Nonnegative
`cmovg S, R`	$\sim (SF \oplus OF) & \sim ZF$	Signed $>$
`cmovge S, R`	$\sim (SF \oplus OF)$	Signed $\geq$
`cmovl S, R`	$SF \oplus OF$	Signed $<$
`cmovle S, R`	$(SF \oplus OF) ∣ ZF$	Signed $\leq$
`cmova S, R`	$\sim CF & \sim ZF$	Unsigned $>$
`cmovae S, R`	$\sim CF$	Unsigned $\geq$
`cmovb S, R`	$CF$	Unsigned $<$
`cmovbe S, R`	$CF ∣ ZF$	Unsigned $\leq$

Loop

do-while, while, and for loops can be implemented by combining conditional tests and jumps.

Switch

switch is compiled into a jump table. The execution time of the switch statement is irrelevant to the number of cases.

Transfer Control

Instruction	Description
`call Label`	Procedure Call (Direct)
`call *Operand`	Procedure Call (Indirect)
`ret`	Return from Procedure Call

Data Transfer

The registers' names depend on the size of the data type being passed.

If a function has more than 6 integer parameters, the additional arguments must be passed on the stack. Note that the argument 7 is located at the top of the stack. All stack-passed arguments are aligned to multiples of 8 bytes.

Pointer

Expression	Type	Value	Assembly Code
`E`	`int*`	$x_{E}$	`movq %rdx, %rax`
`E[i]`	`int`	$M [x_{E} + 4 i]$	`movl (%rdx, %rcx, 4), %eax`
`&E[i]`	`int*`	$x_{E} + 4 i$	`leaq 8(%rdx, %rcx, 4), %rax`
`&E[i] - E`	`long`	$i$	`movq %rcx, %rax`

Notes:

&E[i] - E yields type long (specifically ptrdiff_t) because pointer subtraction returns the number of elements between them as a signed integer.
int* uses movq (not movl) because pointers are 64-bit addresses in x86-64.

Nested Arrays

T D[R][C] $& D [R] [C] = x_{D} + L (C \cdot i + j)$ , $L$ is the size of data type T in bytes.

Struture & Union

Structure

$K$	Types
$1$	$char$
$2$	$short$
$4$	$int, float$
$8$	$long, double, char *$

Notes:

The address of any K-byte basic object must be a multiple of $K$ .
The offset of the first member is always 0 in a structure.
The compilers probably add some bytes at the end of a structure to ensure that each element in an array of structures could satisfy its alignment requirements.

Union

The total size of a union equals the size of its largest field.

Application

Buffer Overflow

Three protection mechanisms to thwart buffer overflow attacks:

Stack Randomization Address-Space Layout Randomization (ASLR)
Stack Corruption Detection Place a random canary value between local buffers and critical stack data to detect buffer overflows.
Limiting Executable Code Regions

Dynamic Stack Frame

%rbp is base pointer/frame pointer dynamic stack frame

Floating-point Code

Chapter 4 Processor Architecture

Logic Design and Hardware Control Language (HCL)

Logic Gate

CMOS

Process

$Fetch \to Decode \to Execute \to Memory \to Write Back \to Update$

For Y86-84 :

Fetch
- Read instruction bytes from memory using PC:
  - icode:ifun (instruction specifier)
  - rA, rB (register specifiers)
  - valC (8-byte constant)
- Compute next instruction address → valP
Decode
- Read up to two operands from registers
- rA, rB → valA, valB
- For popq, pushq, call, ret: also read from %rsp
Execute
- Compute (ALU/address/stack) → valE
- Set condition codes
- Conditional moves: check CC, update dest register
- Jumps: decide branch
Memory
- Write to memory
- Read from memory → valM
Write Back
- Write up to two results to registers
Update
- Set PC to next instruction address

Pipelining

Throughput the number of instructions in one second.

GIPS: Giga-Instructions Per Second
Circuit Retiming

Repositions registers within a design to optimize performance or area, without altering the system’s input/output behavior. It changes the state encoding by moving delays (registers) across combinational logic elements.

Hazards

The dependency between instructions may lead to incorrect computation results.

Data Hazards

Stalling

Stalling prevents data hazards by pausing the pipeline until required operands are ready. This approach is simple but reduces performance.
Data Forwarding (Bypassing)

Data forwarding allows a result computed in a later pipeline stage to be passed directly to an earlier stage as a source operand.
Load/Use Hazards

A load/use hazards occurs when an instruction requires a value that is still being loaded from memory by the previous instruction. Since the memory read completes late in the pipeline, forwarding cannot resolve the dependency in time, causing a stall.

Control Hazards

Control hazards occur when the processor cannot determine the next instruction's address during the fetch stage. In a pipelined processor, this typically happens for ret instructions and mispredicted conditional jumps.

ret Inserting bubbles to wait for the return address to be determined.
Conditional Jump Branch mispredictions are handled by cancelling the erroneously fetched instructions through the insertion of pipeline bubbles (or via instruction squashing) in the decode and execute stages.

Pipelined Y86-64 Implementations

Fetch Stage

Sequential Execution

halt, nop, rrmovq, irmovq , mrmovq, Opq, pushq, popq, covXX, addq

Jump Execution

call, jxx (The Select PC mechanism is employed to correct branch mispredictions.)

ret

Decode Stage

The selection logic chooses between the merged valA/valP signals and one of five forwarding sources (e_valE, m_valM, M_valE, W_valM, or W_valE) based on the instruction type and hazard detection. If no forwarding is triggered, d_rvalA (the register file read) is used as the default value.

Pipeline Control Logic

Stall

Keep the state of the pipeline registers unchanged. When the stall signal is set to 1, the register will retain its previous state.
Insert Bubbles

Setting the state of the pipeline registers to a value equivalent to a nop instruction. When the bubble signal is set to 1, the state of the register will be set to a fixed reset configuration.

Load/Use Data

A one-cycle pipeline stall is required between an instruction that reads a memory value in the execute stage and a subsequent instruction that uses that value in the decode stage.

Mispredicted Branches

When a branch is mispredicted as taken, the pipeline must squash the incorrectly fetched instructions from the target path and redirect fetching to the correct fall-through address.

Processing `ret`

The pipeline must stall until the ret instruction reaches the write-back stage.

Exception Handling

When an instruction triggers an exception, all subsequent instructions must be prevented from updating programmer-visible state, and execution must halt once the exception-causing instruction reaches the write-back stage.

Condition	F	D	E	M	W
Handling `ret`	Stall	Bubble	Normal	Normal	Normal
Load/Use Hazard	Stall	Stall	Bubble	Normal	Normal
Mispredicted Branch	Normal	Bubble	Bubble	Normal	Normal

Chapter 5 Optimizing Program Performance

Capabilities and Limitations of Optimizing Compilers

Memory Aliasing

The two pointers probably reference the same memory address.

Function Calls

Performance

CPE: Cycles Per Element.

Understanding Modern Processors

Instruction Control Unit (ICU) & Execution Unit (EU)

The ICU reads instructions from the instruction cache and uses branch prediction to speculatively fetch and decode future instructions in advance. This speculative execution mechanism allows the processor to start executing operations along the predicted program path before the actual branch outcome is known. If the prediction is incorrect, all speculative results are discarded and execution restarts from the correct branch target.

Decoded instructions are broken down into micro-operations, which are then sent to the EU for execution. The EU includes arithmetic units and dedicated load/store units, each with address calculation logic, to perform memory operations via the data cache.

All along, the retirement unit tracks instruction progress to ensure sequential program semantics are preserved. Instructions remain in a queue until their results are ready and all relevant branch predictions are verified. Only then are results committed to the architectural registers. If a branch was mispredicted, the affected instructions are flushed and their results discarded, preventing any erroneous state change in the program.

Register renaming allows out-of-order execution by assigning unique tags to instruction results. When a later instruction needs a register value, it uses the tag to get the result directly once ready, avoiding waits for register writes and enabling speculative execution.

Each operation is characterized by the following metrics:

latency It represents the total time required to complete the operation.
Issue time It indicates the minimum number of clock cycles needed between two consecutive operations of the same type.
Capacity It refers to the number of functional units capable of executing that operation.

CSC4303 Network Programming

Reference

Reference

Network Model

Layer	Description	Example
Application	Programs that use network service	HTTP, DNS, CDNs
Transport	Provides end-to-end data delivery	TCP, UDP
Network	Send packets over multiple networks	IP, NAT, BGP
Link	Send frames over one or more links	Ethernet, 802.11
Physical	Send bits using signals	wires, fiber, wireless

Transport Layer

User Datagram Protocol (UDP)

Connectionless

Each packet is independent.

Sender    Time       Receiver
  |                     |
  |----- Packet1 -----> |
  |                     |
  |----- Packet2 -----> |
  |                     |

Buffering

UDP buffering is a "temporary storage area" maintained by the operating system for each UDP port, where arriving packets queue up when applications can't process them immediately.

Application A  Application B   Application C
    ↓               ↓               ↓
[Port X]        [Port Y]        [Port Z]      ← Port mapping
    ↓               ↓               ↓
[Queue 1]       [Queue 2]       [Queue 3]     ← Independent UDP message queues
    │               │               │
    └───────┬───────┴───────┬───────┘
            ↓               ↓
    [Port Multiplexer/Demultiplexer]          ← Routes by port number
            ↓
    [Incoming UDP packets]

Header

Note that the Datagram length up to 64K.

     32-bit width (4 bytes per row)
0                  16                31
┌──────────────────┬──────────────────┐
│ Source Port      │ Destination Port │ ← Port addressing
│   (16 bits)      │   (16 bits)      │
├──────────────────┴──────────────────┤
│    Length        │    Checksum      │ ← Size & integrity
│    (16 bits)     │    (16 bits)     │
├─────────────────────────────────────┤
│                                     │
│           Application Data          │
│                                     │
└─────────────────────────────────────┘

Transmission Control Protocol (TCP)

Header

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          Source Port          |       Destination Port        | ← Identifies sockets
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                        Sequence Number (32)                   | ← Byte-based sequencing
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Acknowledgment Number (32)                 | ← Next expected byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Data Offset  |  0  |  Flags  |       Window Size (16)        | ← Control & flow control
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     Checksum (16)             |       Urgent Pointer (16)     | ← Integrity & urgent data
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         Options (variable)                    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           Data                                |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Connection Establishment (Setup)

Three-Way Handshake

Both parties send Initial Sequence Numbers (ISNs) via SYNchronize segments. Each party acknowledges the other's sequence number using ACKnowledge segments.

Step	Initiator (Client)	Receiver (Server)	Description
First Handshake	Sends $SYN = 1, seq = x$	Waits for connection request	Client requests to establish a connection and sends its ISN $x$ .
Second Handshake	Waits for acknowledgment	Sends $SYN = 1, ACK = 1, seq = y, ack = x + 1$ It acknowledges receipt through sequence number $x$ .	Server agrees to connect, sends its own ISN $y$ , and acknowledges the receipt of $x$ .
Third Handshake	Sends $ACK = 1, seq = x + 1, ack = y + 1$	Connection established	Client acknowledges the receipt of $y$ , and the connection is formally established.

Three-way handshake prevents a server from wasting resources on stale or duplicate connection requests by requiring the client's final acknowledgment.

State Machine
- Client Path CLOSED → connect() → SYN_SENT → Receive SYN+ACK → ESTABLISHED
- Server Path CLOSED → listen() → LISTEN → Receive SYN → SYN_RCVD → Receive ACK → ESTABLISHED
- Both parties run instances of this state machine, and TCP allows for simultaneous open.

Connection Release (Teardown)

Four-Way Handshake (symmetric)

Step	Initiator (Active Closer)	Receiver (Passive Closer)	Description
First Wave	Sends $FIN = 1, seq = u$	Waits for close request	Initiator has finished sending data and requests to close its sending channel.
Second Wave	Waits for remaining data	Sends $ACK = 1, ack = u + 1$	Receiver acknowledges the close request but may still have data to send.
Third Wave	Waits for confirmation	Sends $FIN = 1, seq = v$	Receiver has finished sending data and requests to close its sending channel.
Fourth Wave	Sends $ACK = 1, ack = v + 1$	Connection closed	Initiator acknowledges the close request.

State Machine
- Active Closer Path ESTABLISHED → close() → FIN_WAIT_1 → Receive ACK → FIN_WAIT_2 → Receive FIN → TIME_WAIT → Timeout → CLOSED
- Passive Closer Path ESTABLISHED → Receive FIN → CLOSE_WAIT → close() → LAST_ACK → Receive ACK → CLOSED
- TIME_WAIT State
  - $2 \times MSL$ (Maximum Segment Lifetime).
  - Lost ACKs can be recovered.
  - Old segments won't confuse new connections.

Flow Control (Receiver processing limitation)

Automatic Repeat Query

ARQ with one message at a time is Stop-and-Wait. It allows only a single message to be outstanding from sender.
Sliding Window

It allows $W$ outstanding packets, enabling pipelining to send multiple packets per RTT for improved performance.

Sender buffers up to $W$ segments until they are acknowledged. The last frame sent minus last ack rec'd should no more than $W$ .
- Go-Back-N
  
  Only buffers next expected packet (LAS). Accepts only sequential packets, discards others, sends cumulative ACK.
- Selective Repeat
  
  Buffers entire window. Stores out-of-order packets, sends individual ACKs, retransmits only lost packets.
- Sequence Number
  
  $n$ bit counter wraps around at $2^{n} - 1$ . Let $LAR$ be Last Acknowledgement Received, $LAS$ be Last Acknowledgement Sent.
  
  Method Sender's Range Receiver's Range Min Number Needed to Avoid Overlap
  
  Go-Back-N $[LAR + 1, LAR + W]$ $LAS + 1$ $W + 1$
  
  Selective Repeat $[LAR + 1, LAR + W]$ $[LAS + 1, LAS + W]$ $2 W$

Method	Sender's Range	Receiver's Range	Min Number Needed to Avoid Overlap
Go-Back-N	$[LAR + 1, LAR + W]$	$LAS + 1$	$W + 1$
Selective Repeat	$[LAR + 1, LAR + W]$	$[LAS + 1, LAS + W]$	$2 W$

Pacing

ACK Clocking (sender)

ACK clocking is a feedback mechanism where the network itself determines the sending pace, preventing queue buildup and ensuring efficient, low-latency data flow.
Flow Control (Receiver)

Flow control uses the WIN field, calculated as WIN = ReceiveBuffer - (LastByteRcvd - LastByteRead), to dynamically limit the sender's window, preventing receiver buffer overflow.

$SEQ + length < ACK + WIN$

Adaptive Timeout

Name	Formula
$SRTT$ (average round‑trip time)	$SRTT_{n + 1} = (1 - α) \cdot SRTT_{n} + α \cdot RTT_{n + 1}$
$Svar$ (variability of RTT)	$Svar_{n + 1} = (1 - β) \cdot Svar_{n} + β \cdot ∣ RTT_{n + 1} - SRTT_{n + 1} ∣$
$Timeout$	$Timeout_{n} = SRTT_{n} + 4 \cdot Svar_{n}$

Congestion Control (Network capacity limitation) TCP congestion control uses a sliding window (cwnd), interprets packet loss as a congestion signal, and adjusts the window via AIMD to achieve an efficient and roughly fair bandwidth allocation.
- Max-Min fairness
  
  Increse the rate of one flow will decrease the rate of a smaller flow.
  
  Step 1 Initialize all flows at zero
  Step 2 Increase all flows equally
  Step 3 Freeze bottlenecked flows
  Step 4 Repeat for remaining flows
- Bandwidth allocation
  
  Network layer provides direct feedback & Transport layer reduces load [Network is distributed, no single party has an overall picture of its state.]
  - Models
    - Open loop [reserve] & Closed loop [use feedback to adjust rates]
    - Host support & Network support
    - Window based & Rate based
    TCP is a closed loop,host-driven, and window-based
  - AIMD(Additive Increase Multiplicative Decrease)
    - Hosts additively increase rate while network not congested
    - Hosts multiplicatively decrease rate when congested
- TCP Tahoe/Reno
  - Slow Start
    
    For each ACK received: $cwnd = cwnd + 1$ , $cwnd$ doubles every RTT.
  - Later Additive Increase (AI)
    
    For each ACK received: $cwnd \leftarrow cwnd + \frac{1}{cwnd}$ , roughly adds 1 packet per RTT.
  - Switching Threshold (Initially Infinity)
    
    Switch to AI when When $cwnd > ssthresh$ , and $ssthresh = \frac{cwnd}{2}$ after loss. Begin with slow start after timeout ( $cwnd = 1$ ).
  - Fast Retransmit
    
    TCP's cumulative ACKs allow duplicate ACKs to signal lost packets for fast retransmission.(Treat three duplicate ACKs as a loss.)
  - Fast Recovery (TCP Reno)
    
    Fast recovery keeps data flowing during retransmission by maintaining the ACK clock. It avoids resetting to Slow Start after packet loss. Set $ssthresh = \frac{cwnd}{2}$ and then continue AI directly.

UDP & TCP

Feature	TCP	UDP
Mode	Connections	Datagrams
Reliability	No loss, no duplicates, in-order delivery	May lose, reorder, or duplicate packets
Data Size	Unlimited	Limited
Flow Control	Flow control matches sender to receiver	Send regardless of receiver state
Congestion Control	Congestion control matches sender to network	Send regardless of network state
Example	Files, Web pages	Voice, video, DNS

Socket

An endpoint for network communication that allows an application to attach to a specific port on the local network interface, enabling data exchange with other applications across the network.

Process

Cilent socket () ------------------------> connect () -> I/O -> close ()

Server socket () -> bind () -> listen () -> accept () -> I/O -> close ()
1. The server blocks in accept() on listenfd for incoming connections.
2. The client calls connect() to initiate the TCP handshake, which also blocks.
3. Upon completion, the server's accept() returns a connfd and the client's connect() returns, establishing a bidirectional channel between clientfd and connfd.
Create int socket (int domain, int type, int protocol);
- domain AF_INET IPv4 / AF_INET6 IPv6.
- type SOCK_STREAM TCP / SOCK_DGRAM UDP.
- protocol 0 (automatically select the default protocol).
Bind int bind (int sockfd, sockaddr *addr, socklen_t addrlen);

When a server binds a socket to a specific address, it establishes that data arriving at that address is read from this socket, and data written to this socket is sent from that address.
- socket address
```
struct sockaddr
{
    uint_16 ss_family; // protocol family
    char ss_data[14]; // address data
}
```
- sockaddr_in
```
struct sockaddr_in
{
    uint16_t sin_family; // always AF_INET
    uint16_t sin_port; // in network byte order
    struct in_addr sin_addr; // in network byte order
    unsigned char sin_zero[8]; // pad to sizeof (struct sockaddr), placeholder
}
```
  Network data is transmitted in big-endian byte order. Use htons() and htonl() to convert host-ordered values into network-ordered short and long integers, respectively.
Listen int listen (int sockfd, int backlog);

listen () transforms a socket descriptor into a listening socket capable of accepting client connections. The backlog parameter suggests how many pending connections the kernel may queue before refusing new requests (typically around 128).
Accept int accpet (int listenfd, sockaddr *addr, int *addrlen);

accept() blocks until a client connects via listenfd, stores the client's address in addr, and returns a connfd for communication using standard Unix I/O functions.

Connect int connect (int clientfd, sockaddr *addr, socklen_t addrlen);

Attempts to establish a connection with server at socket address addr.

To convert a string-formatted IP address (e.g., SERVER_IP), use inet_pton (int af, const char *src, void *dst) to transform it into binary network format.

Feature	`listenfd`	`connfd`
Purpose	Accepts connection requests	Handles data exchange with a client
Creation	Once at server start	Per accepted connection
Lifetime	Entire server runtime	Duration of client service
Quantity	One per port	One per active client
Concurrent Use	Listens for all clients	Dedicated to single client

Send and Receive

ssize_t read (int fd, void *buf, size_t len); returns the number of bytes actually read (0 indicates the connection is closed, -1 indicates an error).

ssize_t write (int fa, const void *buf, size_t len); returns the number of bytes actually written (-1 indicates an error).
Close int close (int sockfd)
Concurrency
- Threading/Process Race conditions increase complexity!
- I/O Multiplexing
  - select()
  - poll()
    
    poll() is a system call used to monitor multiple file descriptors concurrently for I/O readiness. Unlike select(), it imposes no fixed limit on the number of file descriptors that can be monitored.
```
struct pollfd
{
    int fd; // file descriptor
    short events; // requested events
    short revents; // returned events
}   
```

Network Layer

Internet Protocol (IP)

IPV4 Header

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| IHL   |Type of Service|        Total Length (16)      | ← Basic identification (IHL : Header)
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|        Identification (16)    |Flags|   Fragment Offset (13)  | ← Fragmentation control
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Time to Live |    Protocol   |        Header Checksum (16)   | ← Routing & integrity (Time to Live : TTL)
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                       Source Address (32)                     | ← Sender IP
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Destination Address (32)                   | ← Receiver IP
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Options (if any, variable)                 | ← Optional features
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           Payload (Data)                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

IP Addresses An IP address is assigned to each network interface. Consequently, routers possess multiple interfaces, while most hosts have just one or two (wired and wireless).
IP Prefixes In a prefix of length L, the top L bits are identical across all addresses. In CIDR notation, such a prefix is denoted as IP address/prefix length (e.g. 128.13.0.0/16 is 128.13.0.0 to 128.13.255.255.)
IP Forwarding It uses longest prefix matching to select the most specific route.

IP fragmentation

MTU Max Transfer Size
MSS Maximum Segment Size

Each fragment is encapsulated with its own IP header.

Field	Size (bits)	Purpose
Identification	16	Unique identifier for the original datagram; all fragments of the same datagram share this value.
Flags	3	Control bits for fragmentation DF (Don't Fragment) If set 1, routers must not fragment the packet. MF (More Fragments) If set 1, more fragments follow, otherwise indicates that it is the last fragment.
Fragment Offset	13	Indicates the position of this fragment's data in 8‑byte units relative to the start of the original datagram.

Dynamic Host Configuration Protocol (DHCP) [Application Layer]

Uses UDP ports 67 (server) and 68 (client).

Step	Message	IP Header (Src → Dst)	Ethernet (Src → Dst)	Purpose
1	DISCOVER	`0.0.0.0` → `255.255.255.255`	`Client MAC` → `FF:FF:FF:FF:FF:FF`	Client broadcasts to find DHCP servers
2	OFFER	`Server's IP` → `255.255.255.255`	`Server MAC` → `Client MAC`	Server proposes IP configuration
3	REQUEST	`0.0.0.0` → `255.255.255.255`	`Client MAC` → `FF:FF:FF:FF:FF:FF`	Client accepts the offer
4	ACK	`Server's IP` → `255.255.255.255`	`Server MAC` → `Client MAC`	Server confirms and finalizes lease

Address Resolution Protocol (ARP)

Every network device possesses a unique MAC (Media Access Control) address, which operates at the link layer.

ARP resolves IP to MAC addresses locally. A node broadcasts an ARP request, receives a private reply, and caches the mapping in its ARP table.

Internet Control Message Protocol (ICMP)

When an error occurs, an ICMP error report is sent back to the source IP address and the problematic packet is discarded. The source host must then rectify the issue.

Message Format
- IP Header Src = router, Dst = A, Protocol = 1
- ICMP Header Type = X, Code = Y
- ICMP Data Src = A, Dst = B, ...
Type

Name	Code	Usage
Dest. Unreachable (Net or Host)	`3/0` or `3/1`	Lack of connectivity
Dest. Unreachable (Fragment)	`3/4`	Path MTU Discovery
Time Exceeded (Transit)	`11/0`	Traceroute ( $TTL = 0$ )
Echo Request or Reply	`8/0` or `0/0`	Ping

Traceroute

Traceroute sends probe packets with incrementing TTL. Each router that decrements TTL to zero replies with an ICMP Time Exceeded error, revealing its address.

Network Address Translation (NAT)

NAT maps multiple private IP:port pairs to a single public IP with unique external ports via a stateful translation table, enabling many internal devices to share one external address.

Routing and Forwarding

Fully Distributed Routing
- No central controller All routers are equal and make independent decisions.
- Local knowledge only Routers learn about the network by exchanging messages with directly connected neighbors.
- Concurrent operation
- Failure tolerance There may be node/link/message failures.
Hierarchical Routing

Introduce a larger routing unit. Route first to the region, then to the IP prefix within the region.
- Routing Table
  - Static Routing
  - Dynamic Routing RIP (Routing Information Protocol), OSPF (Open Shortest Path First), BGP (Border Gateway Protocol).
- IP Prefix Aggregation and Subnets (Longest prefix matching)
  
  Routers can change prefix lengths without affecting hosts.
  - Subnetting Split large prefix into smaller ones internally.
  - Aggregation Join small prefixes into one large prefix externally.

Routing

Sink Tree

Sink tree for a destination is the union of all shortest paths towards the destination. Each node only stores next hop (parent) towards root.
Distance Vector Routing

Each node shares distance vectors with neighbors, updates paths using shortest heard distance, and forwards via best next hop.

Slow convergence, count-to-infinity, and routing loops occur because nodes only know neighbor info, not full topology, causing outdated routes to propagate and persist.
- RIP (Routing Information Protocol)
Link-State Routing
- Flooding
  - Send an incoming message on to all other neigbors.
  - Remember the message so that it is only flooded once.
- Dijkstra

Application Layer

Peer to Peer

Leverage peer resources: computation, storage, bandwidth. Also emerging: mobility, coins (tokens), sensors. Decentralized, scalable architecture.

Storage networks replicate files anywhere, making search difficult, especially with node churn.

Framework
- Join How to start participating
- Publish How to advertise files
- Search How to find files
- Fetch How to retrieve files
Napster

Server does all porcessing & Server maintains $O (N)$ state & Single point of failure
- Join Client contacts central server on startup
- Publish Reports list of files to central server
- Search Query server → returns peer storing requested file
- Fetch Get file directly from peer
Gnutella

Complete decentralization & Query flooding

Search scope is $O (N)$ & Unpredictable search time & No guarantee of finding files (TTL-limited search only works well for haystacks)
- Join Client contacts a few existing nodes on startup → these become its "neighbors"
- Publish No need (no central index)
- Search Ask neighbors → neighbors ask their neighbors → when/if found, reply back along the path; TTL limits propagation
- Fetch Get file directly from peer
Gnutella/KaZaA [Two-level hierarchy]

Kept a centralized registration

Supernodes have better connection to Internet and act as temporary indexing servers for other nodes to help improve the stability fo the network.

Standard nodes connect to supernodes and report list od files.

Dimension	Napster	Gnutella	Gnutella/KaZaA (Two-Layer)
Index	Central server	No index	Supernodes maintain index (temporary central)
Search	Query central server	Flooding (ask neighbors)	Flooding between supernodes
Transfer	Direct peer-to-peer	Direct peer-to-peer	Direct peer-to-peer
Join	Contact central server	Find a few neighbors	Find and connect to a supernode

BitTorrent

Central tracker server needed to bootstrap swarm.
- File swarming Files split into chunks; download from multiple peers, upload to others simultaneously.
- Tracker Central server coordinates peers (joins, maintains lists), but doesn't store files.
- Tit-for-tat Upload to the fastest $N$ peers you download from. Encourages fairness, discourages freeloading.
- Optimistic unchoke Randomly let new peers download to prevents starvation, discovers better partners.
- Rarest first Prioritize distributing scarce chunks to improve file availability and resilience.
- Out-of-band search Find files via Google/torrent sites. Scalable, no single point of failure.
DHT (Distributed Hash Table)

Decentralized system using consistent hashing: keys and nodes hashed onto a ring (Chord). Each key stored on its successor node. Finger tables enable $O (lo g N)$ lookup (Entry $i$ which starts from $1$ points to the first node on the ring that is $\geq x + 2^{i - 1}$ ), improving from naive $O (N)$ .

Amazon Dynamo

Architecture
Partition
- Hash partitioning partition = hash(key) % partition_count
- Consistent hashing (better)
- Virtual nodes (vnodes) Virtual nodes assign multiple ring positions to each physical node, distributing data migration and improving load balancing across all nodes.
- Heterogeneity More powerful nodes can have more capacity, thus more vnodes.
- Replication Dynamo skips nodes to ensure replicas reside on different nodes (set Replication Factor, i.e. $RF$ ).
- Dynamo Reads/Writes $R / W$ is configurable. When $R + W > RF$ , the read and written will overlap, with last write wins (LWW) resolving conflicts.
- Dynamo API
  - get (k) returns value(s) and context. It returns multiple versions if conflicts exist. The context contains Timestamps to track version history.
  - put (k,context,value) uses the provided context to indicate which previous version(s) this new version supersedes or merges.

HyperText Transfer Protocol (HTTP)

Request

Method	Description
$GET$	Read a Web page
$HEAD$	Read a Web page's header
$POST$	Append to a Web page
$PUT$	Store a Web page
$DELETE$	Remove the Web page
$TRACE$	Echo the incoming request
$CONNECT$	Connect through a proxy
$OPTIONS$	Query options for a page

Response

Meaning	Example
Information	$100$ Server agrees to handle client's request $101$ Switching protocols
Success	$200$ Request succeeded $201$ New resource created $204$ No content present
Redirection	$301$ Page moved permanently $302$ Temporary redirect $304$ Cached page still valid
Client error	$400$ Bad request $401$ Authentication required $403$ Forbidden page $404$ Page not found $429$ Too many requests
Server error	$500$ Internal server error $502$ Bad gateway $503$ Service unavailable, try again later $504$ Gateway timeout

Page Load Time (PLT)

PLT measures web performance from click to page visible, influenced by page structure, HTTP/TCP protocols, and network RTT/bandwidth.

HTTP/1.0 used one TCP connection per web resource.
- Parallel Connections Browser runs multiple parallel HTTP instances, pulls in completion time of last fetch.
- Persistent Connections Parallel connections compete with each other for network resources. Make one TCP connection to one server and use it multiple HTTP requests.
Web Caching and Proxies
- Cached Content
  - Locally expiry information (expires header) & heuristics (cacheable, fresh, not modified recently) [Content is then available right away]
  - Server Last-Modified header & ETag” header [Content is available after 1 RTT (if connection open)]
- Web Proxies Place intermediary between clients and server.

Domain Name System (DNS)

DNS is a naming service to map between host names and their IP addresses (Distributed directory based on a hierarchical namespace and automated protocol to tie pieces together).

Top-Level Domains (TLDS)
DNS Resolution Start with the root nameserver and work down zones.
Local Nameservers and Root Nameservers

Local nameservers handle client queries and recursion, while root nameservers direct them to the appropriate top-level domain servers.

Root (dot) is served by 13 server names (a.root-servers.net to m.root-servers.net)
Caching Cached data periodically times out (set appropriate TTL).
DNS Security Extensions (DNSSEC)

To spoof, Trudy returns a fake DNS response that appears to be true.
- RRSIG for digital signatures of records
- DNSKEY for public keys for validation
- DS for public keys for delegation

Content Delivery Networks (CDNs)

A CDN uses DNS to direct each client to the nearest replica server based on their IP address.

Remote Prcedure Call (RPC)

Interface description languague Mechanism to pass procedur parameters and return values in a machine-independent way (use IDL compier, marshal and unmarshal).
At-Least-Once Scheme Upon timeout, the client stub retransmits the request. After several failed attempts, it returns an error.
At-Most-Once Scheme

Since identical calls may come from different clients, duplicate detection requires a unique xid per request.

The client includes "seen all replies $\leq x$ " with every RPC, allowing the server to discard outdated xids and prevent unbounded state growth.

A pending flag per executing RPC avoids re-running duplicates.

To survive crashes, both client and server persist pending/completed RPCs to disk.
Exactly-Once retransmission (At-Least-Once) + deduplication (At-Most-Once) + persistence on both sides for crash recovery. Limitation: not possible with external physical actions.

Distributed Storage System

The Google File System (GFS)

Target environment
- Files are huge, but not many.
- Write once, read many.
- I/O bandwidth is more important than latency.
Typical workloads : Bulk Synchronous Processing (BSP).
Design Decisions
- Chunk Use 64MB chunks amortize seek costs, approach disk transfer limits for streaming workloads, and reduce both metadata overhead on the master and RPC overhead for large reads/writes.
- Replication Replicate each chunk three times across racks, with one replica in the same rack for low write latency and two or more in other racks for fault tolerance against rack-level failures, trading off cross-rack bandwidth.
- Single master Centralize metadata only to avoid making the master a bottleneck, while keeping data transfer peer-to-peer between clients and chunkservers to maximize throughput and scalability.
- Record append Support concurrent appends.

General Architecture

Single master

Shadow masters provide read-only access with slightly stale metadata by replicating the primary's operation log, ensuring availability during primary failure.

GFS prevents master overload by minimizing its involvement: data never passes through it, large chunks reduce metadata ops, and chunk leases delegate write coordination authority to primary replicas.

Category	Responsibility	Description
Regular Management	Metadata Storage	Stores namespaces, file-to-chunk mappings, chunk locations in memory.
	Namespace Locking	Locks directory operations to handle concurrent accesses.
	Periodic Communication	Heartbeats with chunkservers to monitor health and exchange state.
	Chunk Lifecycle	Creates chunks (rack-aware); re-replicates when replicas low; rebalances load/space.
Background Cleanup	Garbage Collection	Logs deletion, renames files as hidden, lazily reclaims space.
Background Cleanup	Stale Replica Deletion	Uses version numbers to detect and remove outdated replicas.

Metadata
- file and chunk namespaces
- mappings from files to chunks (unique ID)
- locations of each chunk’s replicas (ask for chunkserver)
Chunkserver Chunkservers store 64MB chunks locally with version numbers and checksums while clients read/write via chunk handle and byte range from master without data caching.
Client GFS client sends control requests to master, caches metadata, but transfers data directly to/from chunkservers without caching file data.

File read and write
- Read Client asks master for chunk locations, then reads data directly from the nearest chunkserver and returns it to the application (choose one to read).
- Write Client gets locations, pushes data to all replicas' buffers, then primary serializes writes and coordinates secondaries before acknowledging (write all primary and secondaries).
Fault Tolerance

GFS uses heartbeats to detect failures, reduces replica counts, and re-replicates chunks elsewhere to restore full replication.

Time in Distributed Systems

Cristian Algorithm

$T_{c l i e n t} = T_{ser v er} + \frac{T _{1} - T _{0}}{2}$
The Lamport Clock Algorithm

Lamport clocks ignore physical time and only capture the happens-before relationship between events.

Each process $P_{i}$ maintains a local clock $C_{i}$ . For a local event, $C_{i} \leftarrow C_{i} + 1$ , then $C_{e v e n t} = C_{i}$ . When process $P_{i}$ sending an message $k$ , set $C_{k} = C_{i}$ . When process $P_{j}$ receives an message $k$ , $C_{j} \leftarrow max (c_{j}, c_{k}) + 1$ .

If $a \to b$ , then $T_{a} < T_{b}$ . However, the converse does not hold — Lamport timestamps alone cannot be used to infer causal relationships between events.
Totally-Ordered Multicast

Totally-ordered multicast ensures identical processing order. Replicas sort queues by timestamps (dynamically updating the head), broadcast ACKs strictly for this head, and execute it once globally acknowledged.
Vector clock

Initially all vectors are $[0, 0, \dots, 0]$ .

For each local event on process $i$ , increment local entry $c_{i}$ .

If process $j$ receives message with vector $[d_{1}, d_{2}, \dots, d_{n}]$ , then $c_{k} = max (c_{k}, d_{k})$ , and increment local entry $c_{j}$ .

Comparison	Lamport Clock	Vector Clock
Given	$C (a) < C (z)$	$V (a) < V (z)$
Conclusion	Either $a \to z$ or $a ∥ z$	$a \to z$
Precision	Cannot distinguish causality from concurrency	Precisely captures happens-before relation

Parallelism Basics and Collective Communication

Communication patterns
- Point-to-point communication
- Collective communication

Model

Cost of Communication $α + n β$ (or $α + M n β$ if a message encounters a link that simulaneously accommodates $M$ messages) ( $α$ Latency, $β$ Transfer time per byte, $n$ Message size in bytes)

Primitive

Primitive	Pattern	Description
Broadcast	One-to-all	One process sends the same data to all processes
Reduce(-to-one)	All-to-one	All processes perform a reduction operation (sum, max, etc.) and send the result to one process
Scatter	One-to-all	One process distributes distinct data chunks to all processes
Gather	All-to-one	All processes send their data to a single process
Allgather	All-to-all	All processes collect data from all processes (each process gets the full data set)
Reduce-scatter	All-to-all	Perform local reduction first, then scatter results (first half of Allreduce)
Allreduce	All-to-all	All processes perform a reduction operation, and the result is distributed to all processes

Minimum Spanning Tree Algorithm (emphasize low latency for small message)

Recursive halving broadcast splits nodes in half sends to opposite half recurses achieving $lo g N$ rounds optimizing latency for small messages.

Totoal cost:
- Broadcast $⌈ lo g p ⌉ (α + n β)$
- Reduce $⌈ lo g p ⌉ (α + n β + nγ)$ Compute overhead is required for reduce.
- Scatter/Gather $k = 1 \sum ⌈ l o g p ⌉ α + \frac{n}{2 ^{k}} β = α ⌈ lo g p ⌉ + \frac{p - 1}{p} n β$
Ring Algorithm (emphasize bandwidth utilization for large message)

The bandwidth term $n β$ now dominates.

ring algorithm can not be better for Scatter and Gather.

LLM

Parallelism in Distributed Machine Learning

Continuous batching

DDA3020 Machine Learning

Information Theory

Informaion

$I (x_{k}) = - lo g (p_{k})$

When $p_{k} = 1$ , $I (x_{k}) = 0$ , which means there is no uncertainty, then no information.

Entropy

Entropy is deﬁned as the expected value of information from a source.

$H_{p} (X) = E [I (x_{k})] = x_{k} \in X \sum p_{k} \cdot I (x_{k}) = - x_{k} \in X \sum p_{k} \cdot lo g (p_{k})$

Cross-entropy

Cross-entropy measures average bits needed to encode events from true distribution $P (X)$ using estimated distribution $Q (X)$ .

$H_{P, Q} (X) = E_{P} [I_{Q} (x)] = - x_{k} \in X \sum P (X = x_{k}) \cdot lo g (Q (X = x_{k}))$

Property 1 $H_{P, Q} (X) \geq 0$

Proof
$P (X = x_{k}) \in [0, 1], P (X = x_{k}) \in (0, 1] ⟹ H_{P, Q} \geq 0$ .

Property 2 $H_{P, Q} (X) \geq H_{P} (X)$

Proof

$H_{P, Q} (X) - H_{P} (X) = x_{k} \in X \sum P (x_{k}) lo g (\frac{P ( x _{k} )}{Q ( x _{k} )})$ By Jensen equality, for any convex function $f$ , $E [f (x)] \geq f (E [x])$ .
Since $lo g x$ is concave, we can implies $H_{P, Q} (X) - H_{P} (X) = - x_{k} \in X \sum P (x_{k}) lo g (\frac{Q ( x _{k} )}{P ( x _{k} )}) = E_{p} [- lo g (\frac{Q ( x _{k} )}{P ( x _{k} )})]$ Let $T = \frac{Q ( x )}{P ( x )}$ , then $E_{p} [- lo g T] \geq - lo g [E_{p} T] = - lo g [k \sum P (x_{k}) \frac{Q ( k )}{P ( x _{k} )}] = 0$ The equality holds only when $P = Q$ .

Kullback-Leibler divergence (KL-divergence)

The distance between two distributions.

$D_{K L} (P ∥ Q) = H_{P, Q} (X) - H_{P} (X) = \int_{x \in X} p_{X} (x) lo g (\frac{p _{X} ( x )}{q _{X} ( x )}) d x$

Property 1 $D_{K L} (P ∥ Q) \geq 0$

Proof $D_{K L} (P ∥ Q) = - \int_{x \in X} p_{X} (x) lo g (\frac{q _{X} ( x )}{p _{X} ( x )}) d x \geq - \int_{x \in X} p_{X} (x) (\frac{q _{X} ( x )}{p _{X} ( x )} - 1) d x = - \int_{x \in X} p_{X} (x) - q_{X} (x) d x = \int_{x \in X} q_{X} (x) d x - \int_{x \in X} p_{X} (x) d x = 0$

Property 2 $D_{K L} (P ∥ Q) \neq = D_{K L} (Q ∥ P)$

Linear Regression

Modeling of Linear Regression

Deterministic Perspective

Find $w$ which minimizes the following expression:

$\frac{1}{m} i = 1 \sum m (f_{w} (x_{i}) - y_{i})^{2}$

Probabilistic Perspective

$y = w^{⊤} x + e, where e \sim N (0, σ^{2})$

where $e$ is called obervation noise or residual error. The output $y$ can also be seen as a random variable, whose conditional probability is :

$p (y ∣ x, w) = N (w^{⊤} x, σ^{2})$

Given the training dataset $D = {(x_{i}, y_{i})}_{i = 1}^{m}$ , by maximum log-likehood estimaiton, we get

$lo g L (w; D) = lo g (i = 1 \prod m p (y_{i} ∣ x_{i}, w)) = i = 1 \sum m lo g N (w^{⊤} x, σ^{2}) = i = 1 \prod n lo g (\frac{1}{2 π σ ^{2}} exp (- \frac{( y _{i} - w ^{⊤} x _{i} ) ^{2}}{2 σ ^{2}})) = - m lo g (σ (2 π)^{\frac{1}{2}}) - \frac{1}{2 σ ^{2}} i = 1 \sum m (y_{i} - w^{⊤} x_{i})^{2}$

$w_{M L E} = arg max lo g L (w; D) = w arg min \frac{1}{2} i = 1 \sum m (y_{i} - w^{⊤} x_{i})^{2}$

Solutions of Linear Regression

Analytical solution

Let $e = Xw - y$ , then

$J (w) = i = 1 \sum m (f_{w} (x_{i}) - y_{i})^{2} = e^{⊤} e = w^{⊤} X^{⊤} Xw - 2 y^{⊤} Xw + y^{⊤} y$

Differentiating $J (w)$ with respect to $w$ and setting the result to $0$ :

$\frac{\partial}{\partial w} = 0 (X^{⊤} X is symmetric) ⟹ w^{⊤} X^{⊤} Xw = y^{⊤} Xw$

If $X^{⊤} X$ is invertible then $w = (X^{⊤} X)^{- 1} X^{⊤} y$

For linear regression with multiple outputs, $E = XW - Y$ , then

$J (W) = trace (E^{⊤} E)$

Gradient Descent

$J (w) = \frac{1}{2} i = 1 \sum m (y_{i} - w^{⊤} x_{i})^{2} = \frac{1}{2} ∥ Xw - y ∥^{2}$ , $w$ can be updated by gradient descent algorithm:

$w \leftarrow w - α \frac{\partial J ( w )}{\partial w} = w - α X^{⊤} (Xw - y)$

Classification

Binary Classification

If $X^{⊤} X$ is invertible then $w = (X^{⊤} X)^{- 1} X^{⊤} y$ , $f_{w} (x_{n e w}) = sgn (x_{n e w}^{⊤} w)$ .

Multi-Category Classification

Each ro in $Y$ has a one-hot assignment, then If $X^{⊤} X$ is invertible then $w = (X^{⊤} X)^{- 1} X^{⊤} Y$ , $f_{w} (x_{n e w}) = i = 1, \dots, C arg max (x_{n e w}^{⊤} W)$ .

Linear Regression Models

Ridge Regression

$J (w) = i = 1 \sum m (y_{i} - x_{i}^{⊤} w)^{2} + λ w^{⊤} w$

where $w = [0, w_{1}, w_{2}, \dots, w_{d}]^{⊤}$ (The bias $w_{0}$ should not be normalized).

To get $w min (Xw - y)^{⊤} (Xw - y) + λ w^{⊤} w$ ,

$\frac{\partial}{\partial w} (Xw - y)^{⊤} (Xw - y) + λ w^{⊤} w = 0 ⟹ w = (X^{⊤} X + λ \hat{I_{d}})^{- 1} X^{⊤} y (λ > 0)$ Where $\hat{I_{d}}$ is the identity matrix with the top-left element set to 0 to avoid penalizing the bias term $w_{0}$ .

We can assume that the parameter $w$ follow a zero-mean Gaussian prior $p (w) = N (w ∣ 0, τ^{2} I)$ (omit the bias $w_{0}$ ).

Utilizing this prior, we obtain the maximum a posteriori (MAP) estimation

$w_{M A P} z = p (w ∣ x, y) \propto p (y ∣ x, w) \cdot p (w) = w arg max [i = 1 \sum m lo g p (y_{i} ∣ x_{i}, w) + lo g p (w)] = w arg max [C - \frac{1}{2 σ ^{2}} i = 1 \sum m (y_{i} - x_{i}^{⊤} w)^{2} - \frac{1}{2 τ ^{2}} ∥ w ∥^{2}] \equiv w arg min [i = 1 \sum m (y_{i} - x_{i}^{⊤} w)^{2} + λ ∥ w ∥^{2}] (λ = \frac{σ ^{2}}{τ ^{2}})$

Polynomial Regression

$ϕ (x) = [1, x_{1}, \dots, x_{d}, \dots, x_{i} x_{j}, \dots, x_{i} x_{j} x_{k}, \dots]^{⊤} w = [w_{0}, w_{1}, \dots, w_{d}, \dots, w_{i, j}, \dots, w_{i, j, k}, \dots]^{⊤} f_{w} (x) = ϕ (x)^{⊤} w$

Note that $f_{w} (x)$ is still a linear function $w.r.t. w$ .

Lasso Regression

$p (w) = Lap (w ∣ 0, λ) = \frac{1}{2} exp (- \frac{∥ w ∥ _{1}}{λ})$ , then we have $w_{M A P} \equiv w arg min [i = 1 \sum m (y_{i} - x_{i}^{⊤} w)^{2} - α ∥ w ∥_{1}]$ .

Ridge uses L2 penalty, shrinking weights towards zero without feature selection. In contrast, Lasso uses L1 penalty, forcing some weights to exactly zero, achieving model sparsity and automatic feature selection.

Robust Linear Regression

Adpot the $l_{1}$ loss as follows $J (w) = i = 1 \sum m ∣ x_{i}^{⊤} w - y_{i} ∣$ .

Assuming that $p (y ∣ x, w, b) = (y ∣ w^{⊤} x, b) \propto exp (- \frac{1}{b} ∣ y - w^{⊤} x ∣)$ , $w_{M L E} \equiv w arg min \frac{1}{b} i = 1 \sum m ∣ x_{i}^{⊤} w - y_{i} ∣$ .

Two ways to make it differentiable:

$w, t min i \sum m t_{i} s.t. - t_{i} \leq x_{i}^{⊤} w - y_{i} \leq t_{i}, 1 \leq i \leq m$ .
By using the following equation $∣ a ∣ = μ > 0 min \frac{1}{2} (\frac{a ^{2}}{μ} + μ)$ .

Then it can be reformulated as follows:

$w min μ_{1}, \dots, μ_{m} > 0 min \frac{1}{2} i = 1 \sum m (\frac{( x _{i}^{⊤} w - y _{i} ) ^{2}}{μ _{i}} + μ_{i})$

Given $w$ , $μ_{i} = ∣ x_{i}^{⊤} w - y_{i} ∣$ ; Given $μ$ , $w = w arg min i = 1 \sum m \frac{1}{2} \frac{( x _{i}^{⊤} w - y _{i} ) ^{2}}{μ _{i}}$ .

Comparison

$p (y ∣ x, w)$	$p (w)$	Regression Method
Gaussian	Uniform	Least squares
Gaussian	Gaussian	Ridge regression
Gaussian	Laplace	Lasso regression
Laplace	Uniform	Robust regression
Student	Uniform	Robust regression

Logistic Regression

Hypothesis Function

$f_{w} (x) = P (y = 1 ∣ x; w) = g (w^{⊤} x) \in [0, 1] g (z) = \frac{1}{1 + exp ( - z )} g^{'} (z) = g (z) (1 - g (z))$

The decision boundary satisfies $g (z) \geq 0.5$ when $z \geq 0$ , and $g (z) < 0.5$ when $z < 0$ , $i.e.$ , it is determined by $w^{⊤} x + b = 0$ .

Cost Function

$ℓ_{2}$ Loss
- Linear Regression
  
  $J (w) = \frac{1}{2 m} i = 1 \sum m (f_{w} (x_{i}) - y_{i})^{2} = \frac{1}{2 m} ∥ Xw - y ∥^{2}$ , $H = \nabla_{w}^{2} J (w) = \frac{1}{m} X^{⊤} X ⪰ 0 ⟹ convex$
- Logistic Regression
  
  $J (w) = \frac{1}{2 m} i = 1 \sum m (g (w^{⊤} x_{i}) - y_{i})^{2} ⟹ non-convex$
Cross-entropy Loss

$cost (y (x), f_{w} (x)) = - P (y = 1 ∣ x) \cdot lo g P (y = 1 ∣ x; w) - P (y = 0 ∣ x) \cdot P (y = 0 ∣ x; w) = {- lo g (f_{w} (x)) y (x) = 1 - lo g (1 - f_{w} (x)) y (x) = 0$

$J (w) = \frac{1}{m} i = 1 \sum m cost (y_{i}, f_{w} (x_{i}))$

For $y_{i} = 1, J_{i} (w) = - lo g (σ_{i})$ , $\nabla_{w} J_{i} = - (1 - σ_{i}) x_{i}$ , $\nabla_{w}^{2} J_{i} = σ_{i} (1 - σ_{i}) x_{i} x_{i}^{⊤}$ .

For $y_{i} = 0, J_{i} (w) = - lo g (1 - σ_{i})$ , $\nabla_{w} J_{i} = σ_{i} \cdot x_{i}$ , $\nabla_{w}^{2} J_{i} = σ_{i} (1 - σ_{i}) x_{i} x_{i}^{⊤}$ .

$⟹ J (w) convex$

Gradient Descent

Assume $m$ samples in total. Learning $w$ by minimizing $J (w)$

$w^{*} = w arg min J (w) = - \frac{1}{m} i = 1 \sum m [y_{i} lo g (f_{w} (x_{i})) + (1 - y_{i}) lo g (1 - f_{w} (x_{i}))]$

By gradient descent

$w \leftarrow w - α \nabla J (w) = w - \frac{α}{m} i = 1 \sum m [\frac{y _{i}}{f _{w} ( x _{i} )} \cdot \nabla f_{w} (x_{i}) + \frac{1 - y _{i}}{1 - f _{w} ( x _{i} )} \cdot \nabla f_{w} (x_{i})] = w - \frac{α}{m} i = 1 \sum m [\frac{y _{i}}{f _{w} ( x _{i} )} - \frac{1 - y _{i}}{1 - f _{w} ( x _{i} )}] \cdot [f_{w} (x_{i}) (1 - f_{w} (x_{i}))] \cdot x_{i} = w - \frac{α}{m} i = 1 \sum m [f_{w} (x_{i}) - y_{i}] \cdot x_{i}$

Multi-class Classification

Softmax

$f_{W}^{(j)} (x) = \frac{exp w _{j}^{⊤} x}{c = 1 \sum C exp ( w _{c}^{⊤} x )} = P (y = j ∣ x, W)$

$\frac{\partial f _{w_{c}}}{\partial w _{j}} = {f_{w_{j}} (1 - f_{w_{j}}) x_{i} c = j - f_{w_{j}} f_{w_{c}} x_{i} c \neq = j$

Cost Function

$J (W) = - \frac{1}{m} i \sum m j \sum C [I (y_{i} = j) lo g (f_{w_{j}} (x_{i}))]$

$\frac{\partial J ( W )}{\partial w _{j}} = - \frac{1}{m} i \sum m \frac{I ( y _{i} = j )}{f _{w_{j}} ( x _{i} )} \cdot \frac{\nabla f _{w_{j}} ( x _{i} )}{\nabla w _{j}} + c \neq = j \sum C \frac{I ( y _{i} = c )}{f _{w_{c}} ( x _{i} )} \cdot \frac{\nabla f _{w_{c}} ( x _{i} )}{\nabla w _{j}} ⟹ \frac{1}{m} i \sum m (f_{w_{j}} (x_{i}) - I (y_{i} = j)) x_{i}$

Regularized Logistic Regression

$\overline{J} (w) = J (w) + \frac{λ}{2 m} j = 1 \sum d w_{j}^{2} = - \frac{1}{m} i \sum m [y_{i} lo g f_{w} (x_{i}) + (1 - y_{i}) lo g (1 - f_{w} x_{i})] + \frac{λ}{2 m} i = 1 \sum d w_{j}^{2}$

$w_{0} \leftarrow w_{0} - α \frac{\partial J ( w )}{\partial w _{j}} = w_{j} - \frac{α}{m} i = 1 \sum m (f_{w} (x_{i}) - y_{i}) x_{i} (0) where x_{i} (0) = 1$

Linear Regression vs Logistic Regression

Dimension	Linear Regression	Logistic Regression
Task Type	Regression task predicting continuous real values	Classification task predicting discrete class labels
Model Hypothesis	Linear function $f_{w} (x) = w^{⊤} x$ Output range $(- \infty, + \infty)$	Sigmoid-mapped linear function $f_{w} (x) = g (w^{⊤} x)$ Output range $[0, 1]$ (posterior probability of positive class)
Loss Function	Residual sum of squares (MSE) $\frac{1}{2 m} i = 1 \sum m (y_{i} - w^{⊤} x_{i})^{2}$	Cross-entropy loss $- \frac{1}{m} i = 1 \sum m [y_{i} lo g f_{w} (x_{i}) + (1 - y_{i}) lo g (1 - f_{w} (x_{i}))]$
Solution Method	Closed-form solution $w^{*} = (X^{⊤} X)^{- 1} X^{⊤} y$ or gradient descent	No closed-form solution, optimized via gradient descent

Supoort Vector Machine

Larger Marigin

Denote $f_{w, b} (x) = w^{⊤} x + b = 0$ is a hyperplane, then $w$ is orthogonal to the hyperplane, and $x$ has the distance $\frac{∣ f _{w, b} ( x ) ∣}{∥ w ∥}$ to the hyperplane.

Margin is defined as the distance between the hyperplane and the closest data point, which can be formulated as $γ = i min \frac{y _{i} f _{w, b} ( x )}{∥ w ∥}$ with $y_{i} \in {- 1, 1}$ .

Since $w, b max γ = w, b max i min \frac{y _{i} f _{w, b} ( x )}{∥ w ∥}$ . Set $y_{i} f_{w, b} (x) = 1$ , then $γ = \frac{1}{∥ w ∥}$ ,

Then the optimization problem can be formulated as follows:

$w, b min \frac{1}{2} ∥ w ∥^{2} s.t. y_{i} (w^{⊤} x_{i} + b) \geq 1, i = 1, 2, \dots, m$

Hinge Loss

Let $f_{w, b} (x) = \frac{1}{1 + exp - ( w ^{⊤} x + b )} = g (z)$ where $z = w^{⊤} x + b$ be the hypothesis function, then the cost function can be defined as follows:

$J (w) = \frac{1}{m} i \sum m [- δ_{y = 1} lo g (f_{w, b} (x)) - δ_{y = - 1} lo g (1 - f_{w, b} (x))] + \frac{λ}{2 m} j = 1 \sum n w_{j}^{2} \equiv C \cdot \frac{1}{m} i \sum m [- δ_{y = 1} lo g (f_{w, b} (x)) - δ_{y = - 1} lo g (1 - f_{w, b} (x))] + \frac{1}{2} j = 1 \sum n w_{j}^{2} (C = \frac{1}{λ})$

The hinge loss is defined as follows

$max (0, 1 - y_{i} (w^{⊤} x_{i} + b))$

Then the optimization problem can be reformulated as follows

$w, b min C \cdot i \sum m max (0, 1 - y_{i} (w^{⊤} x_{i} + b)) + \frac{1}{2} j = 1 \sum n w_{j}^{2}$

Largrange Duality

The objective function of SVM is

$w, b min \frac{1}{2} ∥ w ∥^{2} s.t. y_{i} (w^{⊤} x_{i} + b) \geq 1, i = 1, 2, \dots, m$

Its Lagrangian function is defined as follows

$L (w, b, α) = \frac{1}{2} ∥ w ∥^{2} - i = 1 \sum m α_{i} [1 - y_{i} (w^{⊤} x_{i} + b)]$

The KKT conditions are as follows

$⎩ ⎨ ⎧ Stationarity ⎩ ⎨ ⎧ \frac{\partial L ( w , b , α )}{\partial w} = 0 ⟹ w = i = 1 \sum m α_{i} y_{i} x_{i} \frac{\partial L ( w , b , α )}{\partial b} = 0 ⟹ i = 1 \sum m α_{i} y_{i} = 0 Feasibility α_{i} \geq 0, 1 - y_{i} (w^{⊤} x_{i} + b) \leq 0, \forall i Complementary Slackness α_{i} \cdot (1 - y_{i} (w^{⊤} x_{i} + b)) = 0, \forall i$

Then the Largangian function can be reformulated as follows

$L (w, b, α) = \frac{1}{2} ∥ w ∥^{2} + i = 1 \sum m α_{i} - i = 1 \sum, α_{i} y_{i} (j = 1 \sum m α_{j} y_{j} x_{j})^{⊤} x_{i} - i = 1 \sum m α_{i} y_{i} b = i = 1 \sum m α_{i} - \frac{1}{2} i = 1 \sum m j = 1 \sum m α_{i} α_{j} y_{i} y_{j} x_{i}^{⊤} x_{j}$

Then the dual problem is obtained as follows

$α max i = 1 \sum m α_{i} - \frac{1}{2} i = 1 \sum m j = 1 \sum m α_{i} α_{j} y_{i} y_{j} x_{i}^{⊤} x_{j} s.t. α_{i} \geq 0, i = 1 \sum m α_{i} y_{i} = 0, \forall i$

The data points with $α_{i} > 0$ are called support vectors, which locate on the margin and determine the decision boundary. Let $S = {i ∣ α_{i} > 0}$ be the set of support vectors, then the decision boundary can be determined by any support vector as follows:

$y_{i} (w^{⊤} x_{i} + b) = 1 ⟹ b = \frac{1}{∣ S ∣} i \in S \sum (y_{i} - w^{⊤} x_{i}) = \frac{1}{∣ S ∣} i \in S \sum (y_{i} - j = 1 \sum m α_{j} y_{j} x_{i}^{⊤} x_{j})$

SVM with Slack Variables

In this case, the optimization problem can be formulated as follows

$w, b, ξ min \frac{1}{2} ∥ w ∥^{2} + C i = 1 \sum m ξ_{i} s.t. 1 - ξ_{i} - y_{i} (w^{⊤} x_{i} + b) \leq 0, - ξ_{i} \leq 0, \forall i$

Its Lagrangian function is defined as follows

$L (w, b, ξ, α, β) = \frac{1}{2} ∥ w ∥^{2} + C i = 1 \sum m ξ_{i} - i = 1 \sum m α_{i} (1 - ξ_{i} - y_{i} (w^{⊤} x_{i} + b)) - i = 1 \sum m β_{i} (- ξ_{i})$

The KKT conditions are as follows

$⎩ ⎨ ⎧ Stationarity ⎩ ⎨ ⎧ \frac{\partial L ( w , b , ξ , α , β )}{\partial w} = 0 ⟹ w = i = 1 \sum m α_{i} y_{i} x_{i} \frac{\partial L ( w , b , ξ , α , β )}{\partial b} = 0 ⟹ i = 1 \sum m α_{i} y_{i} = 0 \frac{\partial L ( w , b , ξ , α , β )}{\partial ξ _{i}} = 0 ⟹ α_{i} + β_{i} = C, \forall i Feasibility α_{i} \geq 0, β_{i} \geq 0, 1 - ξ_{i} - y_{i} (w^{⊤} x_{i} + b) \leq 0, - ξ_{i} \leq 0, \forall i Complementary Slackness α_{i} \cdot (1 - ξ_{i} - y_{i} (w^{⊤} x_{i} + b)) = 0, β_{i} \cdot (- ξ_{i}) = 0, \forall i$

The the Lagrange function can be reformulated as follows

$L (α, β) = i = 1 \sum m α_{i} - \frac{1}{2} i = 1 \sum m j = 1 \sum m α_{i} α_{j} y_{i} y_{j} x_{i}^{⊤} x_{j}$

The dual problem can be reformulated as follows

$α max i = 1 \sum m α_{i} - \frac{1}{2} i = 1 \sum m j = 1 \sum m α_{i} α_{j} y_{i} y_{j} x_{i}^{⊤} x_{j} s.t. 0 \leq α_{i} \leq C, i = 1 \sum m α_{i} y_{i} = 0, \forall i$

Let $M = {i ∣ 0 < α_{i} < C}$ . The bias parameter $b$ can be determined by any support vector with $0 < α_{i} < C$ as follows

$i = 1 \sum m α_{i} y_{i} x_{i}^{⊤} x_{j} + b = y_{j}, where j \in M ⟹ b = \frac{1}{∣ M ∣} j \in M \sum (y_{j} - i = 1 \sum m α_{i} y_{i} x_{i}^{⊤} x_{j})$

Condition on $α_{i}$	Slack $ξ_{i}$	Contributes to classifier?	Correctly classified?	Location relative to margin / decision boundary
$α_{i} = 0$	$0$	No	Yes	Outside the margin
$0 < α_{i} < C$	$0$	Yes	Yes	On the margin
$α_{i} = C$	$0 \leq ξ_{i} \leq 1$	Yes	Yes	Inside the margin, not crossing decision boundary
$α_{i} = C$	$ξ_{i} > 1$	Yes	No	Crossing decision boundary

SVM with Kernels

Define the kernel function $k (x_{i}, x_{j}) = ϕ (x_{i})^{⊤} ϕ (x_{j})$ , then the dual problem can be reformulated as follows

$α max i = 1 \sum m α_{i} - \frac{1}{2} i = 1 \sum m j = 1 \sum m α_{i} α_{j} y_{i} y_{j} k (x_{i}, x_{j}) s.t. α_{i} \geq 0 i = 1 \sum m α_{i} y_{i} = 0, \forall i$

The solution $b$ becomes

$b = \frac{1}{∣ S ∣} j \in S \sum (y_{j} - i = 1 \sum m α_{i} y_{i} k (x_{i}, x_{j}))$

The prediction of new data $x$ becomes

$w^{⊤} ϕ (x) + b = i = 1 \sum m α_{i} y_{i} k (x_{i}, x) + b$

Some kernel functions

$⎩ ⎨ ⎧ Polynomial kernel k (x_{i}, x_{j}) = (1 + \frac{x _{i}^{⊤} x _{j}}{σ ^{2}})^{p}, p > 0 Gaussian RBF kernel k (x_{i}, x_{j}) = exp (- \frac{∥ x _{i} - x _{j} ∥ ^{2}}{2 σ ^{2}}) Sigmoid kernel k (x_{i}, x_{j}) = \frac{1}{1 + exp ( - γ x _{i}^{⊤} x _{j} + c )}$

Multi-class SVM

Predict the label of $x$ as $k \in {1, 2, \dots, K} arg max (w^{(k)})^{⊤} x + b^{(k)}$ , $i.e.$ , use one-vs-rest strategy to train $K$ binary SVM classifiers.

Tree-based Methods

Decsion Tree

A decision tree (non-parametric model) is a hierarchical model for supervised learning whereby the local region is identified in a sequence of recursive splits.

Classification Tree

Impurity Measure (Choose the attribute that maximizes the reduction in impurity.)

Let $S_{i}$ be the seef of instances of $S$ belonging to class $C_{i}$ ( $i = 1, 2, \dots, K$ ), then $\hat{P} (C_{i} ∣ x, S) \equiv p_{i} = \frac{∣ S _{i} ∣}{∣ S ∣}$ is the proportion of class $C_{i}$ in $S$ .

Entropy is the measurement of impurity, which is defined as follows

$Info (S) = - i = 1 \sum K p_{i} lo g p_{i}$

And the entropy of $S$ given the condition of attribute $A$ is defined as follows (Suppose attribute $A \in {a_{1}, a_{2}, \dots, a_{V}}$ to split $S$ into $V$ subsets $D_{1}, D_{2}, \dots, D_{V}$ )

$Info (S ∣ A) = v = 1 \sum V \frac{∣ S _{v} ∣}{∣ S ∣} Info (D_{v})$

Information gained by branching on atribute $A$ is defined as follows

$Gain (A) = Info (S) - Info (S ∣ A)$
Gini Index

Gini index is another measurement of impurity, which is defined as follows

$Gini (S) = 1 - i = 1 \sum K p_{i}^{2}$

It means the probability of different class labels for two randomly selected instances from $S$ .

And the Gini index of $S$ given the condition of attribute $A$ is defined as follows (Suppose attribute $A \in {a_{1}, a_{2}, \dots, a_{V}}$ to split $S$ into $V$ subsets $D_{1}, D_{2}, \dots, D_{V}$ )

$Gini (S ∣ A) = v = 1 \sum V \frac{∣ S _{v} ∣}{∣ S ∣} Gini (D_{v})$

Gini gain by branching on atribute $A$ is defined as follows

$Δ Gini (A) = Gini (S) - Gini (S ∣ A)$

Regression Tree

The regression tree grows by recursively selecting the binary split on any variable that most reduces the sum of squared errors.

$\overline{y}_{c} = \frac{i \in c \sum y _{i}}{N _{c}}$ is the prediction for leaf $c$ , where $N_{c}$ is the number of instances in leaf $c$ .

Mean Squared Error (MSE)

$MSE (c) = \frac{1}{N _{c}} i \in c \sum (y_{i} - \overline{y}_{c})^{2}$

The total MSE is $S = c \in leaves \sum MSE (c)$ .
Sum of squared Errors (SSE)

The total SSE is $S = c \in leaves \sum i \in c \sum (y_{i} - \overline{y}_{c})^{2}$ .

Overfitting and Pruning

Pruning of the decision tree is one by replacing a subtree with a leaf node. The subtree is replaced if the error of the subtree on the validation set is larger than the error of the leaf node on the validation set.

Ensemble Models

Bagging

Sample records with replacement from the training set to create multiple bootstrap samples, and train a base model on each bootstrap sample. The final prediction is obtained by averaging the predictions of all base models.

It aggreates the predictions of all single trees and produces many correlated trees, which the diversity of the trees is not high.

Random Forest

It chooses a random subset of features to split each node, which increases the diversity of the trees and reduces the correlation between trees.

Neural Networks

Activation Function

Perceptron Model

$y = f (w^{⊤} x + b) = Sign (w^{⊤} x + b)$ , the objective function is $J (w) = \frac{1}{2} (y - t)^{2}$ , and the update rule is $w \leftarrow w - η (y - t) x$ .

For a new data point $x$ , the prediction is $y = Sign (w^{⊤} x + b)$ , which is a linear classifier.

Multi-layer Feedforward Neural Networks

To solve XOR, a one‑hidden‑layer network learns non‑linear input features. Each layer re‑expresses the previous layer’s output, progressively extracting more abstract and discriminative features.

$y = f^{(L)} \circ f^{(L - 1)} \circ \dots \circ f^{(1)} (x)$ , where $f^{(l)} (z) = g (W^{(l)} z + b^{(l)})$ is the transformation of layer $l$ .

Backpropagation (using computational graph (CG))

Forward Pass

For each $i = 1, \dots, N$ , compute $v_{i}$ as a function of its parents.

Backward Pass

$\overline{v}_{N} = 1$ , then for $i = N - 1, \dots, 1$ , $\overline{v}_{i} = j \in children (i) \sum \overline{v}_{j} \cdot \frac{\partial v _{j}}{\partial v _{i}}$ . Also update parameters $w, b$ by

$w \leftarrow w - η \cdot \frac{\partial C _{i}^{(n + 1)}}{\partial w _{i}^{(n)}} b \leftarrow b - η \cdot \frac{\partial C _{i}^{(n + 1)}}{\partial b _{i}^{(n)}}$

where $C_{i}^{(n + 1)}$ is the cost function of layer $n + 1$ .

Convolutional Neural Networks

Convolutional Layer

$f \circ g (x) = \int f (t) g (x - t) d t$

Input volume is $W_{1} \times H_{1} \times C$ .

A convolutional layer has four hyperparameters: number of filters $K$ , filter size $F$ , stride $S$ , and zero padding $P$ (per side).

Output spatial size is $W_{2} = ⌊ \frac{W _{1} - F + 2 P}{S} ⌋ + 1$ and $H_{2} = ⌊ \frac{H _{1} - F + 2 P}{S} ⌋ + 1$ .

Output volume is $W_{2} \times H_{2} \times K$ .

Number of parameters is $F^{2} C K$ weights plus $K$ biases. parameters is $F^{2} C K$ weights plus $K$ biases.

Pooling Layer

Pooling layer has no learnable parameters. Hyperparameters are filter size $F$ and stride $S$ .

Output spatial size is $W_{2} = ⌊ \frac{W _{1} - F}{S} ⌋ + 1$ and $H_{2} = ⌊ \frac{H _{1} - F}{S} ⌋ + 1$ .

Pooling operates on each channel independently, so output volume is $W_{2} \times H_{2} \times C$ .

Common pooling functions are $Max$ and $Average$ .

Fully Connected Layer

A fully connected layer computes the class scores, resulting in an output volume of $1 \times 1 \times N_{c}$ , where $N_{c}$ is the number of classes, which contains neurons that connect to the entire input volume.

Recurrent Neural Networks and Transformer

RNN

Basic

$h_{t} = f_{W} (h_{t - 1}, x_{t})$ , $\overset{y}{^}_{t} = g_{W^{'}} (h_{t})$ , where $f_{W}$ is the transition function and $g_{W^{'}}$ is the output function. The cost/error function is $E (θ) = \frac{1}{T} t = 1 \sum T L (\overset{y}{^}_{t}, y_{t})$ , where $θ = {W, W^{'}}$ .

However, the basic RNN suffers from vanishing/exploding gradient problem, which makes it difficult to learn long-term dependencies.

Long Short-Term Memory (LSTM)

Limitations

Difficulty with Long-Term Dependencies
Limited Parallelization

Transformer

Attention Mechanism

Let $Q = X W_{Q}, K = X W_{K}, V = X W_{V}$ be the query, key, and value matrices, where $W_{Q}, W_{K}, W_{V} \in R^{d \times d^{'}}$ and $X \in R^{n \times d}$ is the input sequence. Then the attention output is defined as follows

$Z = attention(Q, K, V) = softmax (\frac{Q K ^{⊤}}{d ^{'}}) V$

Multi-head Attention

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions, capturing diverse types of relationships in parallel.

Bias-variance Tradeoff

Let $D$ be training dataset drawn $i.i.d.$ from $P^{n}$ , $h_{D} = A (D)$ be the model trained by algorithm $A$ on $D$ .

$\overset{ˉ}{h} = E_{D \sim P^{n}} [h_{D}] = \int_{D} h_{D} p (D) d D$

expected test error for a fixed model $h_{D}$ $E_{(x, y) \sim P} [(h_{D} (x) - y)^{2}] = \int_{x} \int_{y} (h_{D} (x) - y)^{2} p (x, y) d x d y$ .
expected test error for the algorithm $A$ $E_{(x) \sim P, D \sim P^{n}} [(h_{D} (x) - y)^{2}] = \int_{D} \int_{x} \int_{y} (h_{D} (x) - y)^{2} p (x, y) p (D) d x d y d D$ .
the expected test error $E_{(x, y), D} [(h_{D} (x) - y)^{2}]$ . (mean squared error (MSE))

$E_{(x, y), D} [(h_{D} - \overset{ˉ}{h}) (\overset{ˉ}{h} - y)] = E_{(x, y)} [E_{D} [h_{D} - \overset{ˉ}{h}] \cdot (\overset{ˉ}{h} - y)] = E_{(x, y)} [(E_{D} [h_{D}] - \overset{ˉ}{h}) \cdot (\overset{ˉ}{h} - y)] = E_{(x, y)} [(\overset{ˉ}{h} - \overset{ˉ}{h}) \cdot (\overset{ˉ}{h} - y)] = 0$

$E_{(x, y), D} [(h_{D} (x) - y)^{2}] = E [(h_{D} - \overset{ˉ}{h})^{2}] + 2 E [(h_{D} - \overset{ˉ}{h}) (\overset{ˉ}{h} - y)] + E [(\overset{ˉ}{h} - y)^{2}] = E_{(x, y), D} [(h_{D} - \overset{ˉ}{h})^{2}] + E_{(x, y), D} [(\overset{ˉ}{h} - y)^{2}]$

Now decompose the second term $E [(\overset{ˉ}{h} - y)^{2}]$ , $y = t (x) + ϵ, ϵ \sim N (0, σ^{2})$ .

$E_{(x, y)} [(\overset{ˉ}{h} - t (x)) (t (x) - y)] = E_{x} [E_{y ∣ x} [(\overset{ˉ}{h} (x) - t (x)) (t (x) - y)]] = E_{x} [(\overset{ˉ}{h} (x) - t (x)) \cdot E_{y ∣ x} [t (x) - y]] = E_{x} [(\overset{ˉ}{h} (x) - t (x)) \cdot (t (x) - E [y ∣ x])] = E_{x} [(\overset{ˉ}{h} (x) - t (x)) \cdot (t (x) - t (x))] = 0$

$E [(\overset{ˉ}{h} - y)^{2}] = E [(\overset{ˉ}{h} - t (x))^{2}] + 2 E [(\overset{ˉ}{h} - t (x)) (t (x) - y)] + E [(t (x) - y)^{2}]$

$⟹ E_{(x, y), D} [(h_{D} (x) - y)^{2}] = Variance E_{(x, y), D} [(h_{D} (x) - \overset{ˉ}{h} (x))^{2}] + Bias^{2} E_{(x, y)} [(\overset{ˉ}{h} (x) - t (x))^{2}] + Noise E_{(x, y)} [(t (x) - y)^{2}]$

Comparsion

Term	Description
Variance	Measures how much the predictions of a model vary when trained on different training sets. (model's sensitivity to the specific training data)
Bias²	Measures how far the average prediction of a model is from the true value. (model's systematic error)
Noise	Measures how much the observed values vary around the true target function due to randomness in the data. (data's inherent unpredictability)

Performance Evaluation

Hyper-parameter Tuning

Determined typically outside the learning algorithms rather than earned by a learning algorithm based on the trainingset.

Cross-validation

Cross-validation is a statistical method of evaluating generalization performance that is more stable and thorough than using a split into a training and a test set.

K-fold cross validation

Split the train data into $K$ folds, try each fold as validation, while other folds as train. Note that $K$ is a hyper-parameter.

Evaluation Metrics

Regression

Mean Square Error $\frac{i = 1 \sum n ( y _{i} - y _{i} ) ^{2}}{n}$
Mean Absolute Error $\frac{i = 1 \sum n ∣ y _{i} - y _{i} ∣}{n}$

Classification

Confusion Matrix

Actual \ Predicted $P$ (Predicted) $N$ (Predicted)

$P$ (Actual) $TP$ $FN$

$N$ (Actual) $FP$ $TN$
- Accuracy The total number of correctly classiﬁed samples over all samples under evaluation.
  
  $\frac{TP + TN}{TP + TN + FP + FN}$
- Precision The proportion of correctly predicted positive samples among all samples predicted as positive.
  
  $\frac{TP}{TP + FP}$
- Recall The proportion of correctly predicted positive samples among all actual positive samples.
  
  $\frac{TP}{TP + FN}$
- Cost-sensitive Accuracy (different classes may have different importance)
  
  $Cost-Sensitive Accuracy = \frac{C _{p, p} \cdot TP + C _{n, n} \cdot TN}{C _{p, p} \cdot TP + C _{n, n} \cdot TN + C _{p, n} \cdot FN + C _{n, p} \cdot FP}$
- Binary Classification
  
  Metric Formula
  
  $TPR$ (True Positive Rate) $\frac{TP}{TP + FN}$
  
  $FNR$ (False Negative Rate) $\frac{FN}{TP + FN}$
  
  $TNR$ (True Negative Rate) $\frac{TN}{FP + TN}$
  
  $FPR$ (False Positive Rate) $\frac{FP}{FP + TN}$
  
  If positive and negative classes are balanced, then $Accuracy = \frac{TPR + TNR}{2} = 1 - \frac{FPR + FNR}{2}$ .
  
  As the decision threshold varies from $0$ to $1$ , FPR decreases while FNR increases, and their intersection point where $FPR$ equals $FNR$ is called the Equal Error Rate ( $EER$ ). The lower the $EER$ , the better the classifier.
Receiver Operating Characteristic Curve (ROC) The relationship between FNR and FPR.
Area Under the ROC (AUC) The probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example. (It means the perfect prediction if it equals to $1$ .)

Consider a set of binary samples indexed by $i = 1, \dots, m^{+}$ for positive class, $j = 1, \dots, m^{-}$ and for negative class. Let $g (x)$ be a predictor and $e_{ij} = g (x_{i}^{+}) - g (x_{j}^{-})$ .

Consider also a Heaviside step function given by

$u (e) = ⎩ ⎨ ⎧ 0, 0.5, 1, x < 0 x = 0 x > 0$

Then AUC can be expressed by $AUC = \frac{1}{m ^{+} m ^{-}} i = 1 \sum m^{+} j = 1 \sum m^{-} u (e_{ij})$ .

Unsupervised Learning

Clustering

Clustering is a problem of learning to assign labels to examples by leveraging an unlabeled dataset.

Dimensionality Reduction

Principal componentanalysis (PCA) is a technique that converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables, called principal components.

Density Estimation

Estimate the distribution based on some observed data.

Kernel Density Estimation (KDE)

$\hat{f}_{b} (x) = \frac{1}{N b} i = 1 \sum N k (\frac{x - x _{i}}{b}) k (z) = \frac{1}{2 π} exp (- \frac{z ^{2}}{2})$

where $b > 0$ is a hyper-parameter called kernel size, which can be tuned using $K$ -fold cross-validation.

Autoencoder

An autoencoder is a type of artificial neural network used to learn efficient coding of unlabeled data.

Self-supervised Learning

Self-supervised learning (SSL) is a paradigm where a model learns from unlabeled data by automatically generating pseudo-labels from the data itself, without requiring human annotation.

K-means Clustering

Initialize $K$ centroids, assign each point to nearest centroid, update centroids by mean of assigned points, repeat until convergence.

In short, K-means will minimize the within-cluster variance, as follows:

$c, r min J (c, r) = c, r min i \sum n k \sum K r_{i, k} ∥ x_{i} - c_{k} ∥^{2} Subject to r \in {0, 1}^{n \times K}, k \sum K r_{ik} = 1$

where $r_{ik} = 1$ denotes $x_{i}$ is assigned to cluster $k$ .

Coordinate Descent

Assignment (Given $c$ to update $r$ )

Note that the assignment for each data $x_{i}$ can be solved independently, $i.e.$ , pr $r_{i} min k \sum K r_{ik} ∥ x_{i} - c_{k} ∥^{2} Subject to r_{i} \in {0, 1}^{1 \times K}, k \sum K r_{ik} = 1$

The solution can be obtained as follows

$k^{*} = ar g 1 \leq k \leq K min ∥ x_{i} - c_{k} ∥^{2}, r_{i k^{*}} = 1$

Refitting (Given $r$ to update $c$ )

Note that $c_{i}$ can be optimized independently, as follows

$c_{k} min i \sum n r_{ik} ∥ x_{i} - c_{k} ∥^{2}$

By setting the derivative $w.r.t. c_{k}$ as 0, $i.e.$ ,

$\frac{\partial}{\partial c _{k}} i \sum r_{ik} ∥ x_{i} - c_{k} ∥^{2} = 0 ⟹ i \sum 2 r_{ik} (x_{i} - c_{k}) = 0 ⟹ c_{k} = \frac{\sum _{i} r _{ik} x _{i}}{\sum _{i} r _{ik}}$

Since the objective function is non-convex, K-means may converge to a local minimum, and the final solution can be sensitive to the initial centroids. To mitigate this issue, it is common practice to run K-means multiple times with different random initializations and select the solution with the lowest within-cluster variance.

Performance Evaluation

Internal Evaluation Metrics (Silhouette Coefficient)

Define $a$ as the mean distance between a point and all other points in the same cluster. Define $b$ the smallest mean distance of a point to all points in any other cluster.

The silhouette coefficient for a single sample is formulated as follows:

$s = \frac{b - a}{max ( a , b )} ⟹ s = ⎩ ⎨ ⎧ 1 - \frac{a}{b} 0 \frac{b}{a} - 1 if a < b if a = b if a > b \in (- 1, 1)$

$s$ close to $1$ indicates that the sample is well clustered.

External evaluation metrics (Rand Index)

$RI = \frac{a + b}{a + b + c + d} = \frac{a + b}{\frac{n ( n - 1 )}{2}} \in [0, 1]$

Symbol	Definition
$a$	same in $X$ , same in $Y$
$b$	different in $X$ , different in $Y$
$c$	same in $X$ , different in $Y$
$d$	different in $X$ , same in $Y$

For random clusterings, the expected value of the Rand Index is not zero. Let $n_{ij} = ∣ X_{i} \cap Y_{j} ∣$ , and Adjusted Rand Index (ARI) is defined as follows:

$ARI = \frac{\sum _{ij} ( 2 n _{ij} ) - \frac{[ \sum _{i} ( 2 a _{i} ) \sum _{j} ( 2 b _{j} ) ]}{( 2 n )}}{\frac{[ \sum _{i} ( 2 a _{i} ) + \sum _{j} ( 2 b _{j} ) ]}{2} - \frac{[ \sum _{i} ( 2 a _{i} ) \sum _{j} ( 2 b _{j} ) ]}{( 2 n )}}$

Gaussian Mixture Models

$p (x) = k = 1 \sum K π_{k} N (x ∣ μ_{k}, Σ_{k})$

where $π_{k}$ are the mixing coefficients, $k = 1 \sum K π_{k} = 1$ , $π_{k} \geq 0$ , $\forall k$ , and

$N (x ∣ μ_{k}, Σ_{k}) = \frac{1}{( 2 π ) ^{\frac{d}{2}} ∣ Σ _{k} ∣ ^{\frac{1}{2}}} exp (- \frac{1}{2} (x - μ_{k})^{⊤} Σ_{k}^{- 1} (x - μ_{k}))$

Alternating Upadating Algorithm

Derivation from MLE

The log-likelihood is given by

$L := ln p (X ∣ π, μ, Σ) = n = 1 \sum N ln {k = 1 \sum K π_{k} N (x^{(n)} ∣ μ_{k}, Σ_{k})}$

Let the derivate of $L$ with respect to $π_{k}$ be zero:

$\frac{\partial L}{\partial π _{k}} = n = 1 \sum N \frac{1}{\sum _{j = 1}^{K} π _{j} N ( x ^{(n)} ∣ μ _{j} , Σ _{j} )} \cdot π_{k} \cdot \frac{\partial}{\partial μ _{k}} N (x^{(n)} ∣ μ_{k}, Σ_{k}) = n = 1 \sum N \frac{π _{k} N ( x ^{(n)} ∣ μ _{k} , Σ _{k} )}{\sum _{j = 1}^{K} π _{j} N ( x ^{(n)} ∣ μ _{j} , Σ _{j} )} \cdot Σ_{k}^{- 1} (x^{(n)} - μ_{k}) = n = 1 \sum N γ_{k}^{(n)} \cdot Σ_{k}^{- 1} (x^{(n)} - μ_{k}) = 0$

Let $N_{k} = \sum_{n = 1}^{N} γ_{k}^{(n)}$ , then

$\frac{\partial L}{\partial π _{k}} = 0 ⟹ μ_{k} = \frac{1}{N _{k}} n = 1 \sum N γ_{k}^{(n)} x^{(n)}$

Let the derivate of $L$ with respect to $Σ_{k}$ be zero:

$\frac{\partial L}{\partial Σ _{k}} = n = 1 \sum N \frac{1}{\sum _{j = 1}^{K} π _{j} N ( x ^{(n)} ∣ μ _{j} , Σ _{j} )} \cdot π_{k} \cdot \frac{\partial}{\partial Σ _{k}} N (x^{(n)} ∣ μ_{k}, Σ_{k}) = n = 1 \sum N \frac{π _{k} N ( x ^{(n)} ∣ μ _{k} , Σ _{k} )}{\sum _{j = 1}^{K} π _{j} N ( x ^{(n)} ∣ μ _{j} , Σ _{j} )} \cdot \frac{\partial ln N ( x ^{(n)} ∣ μ _{k} , Σ _{k} )}{\partial Σ _{k}} By using the following equation \frac{\partial N}{\partial Σ _{k}} = N \cdot \frac{\partial ln N}{\partial Σ _{k}} = n = 1 \sum N γ_{k}^{(n)} \cdot \frac{1}{2} [Σ_{k}^{- 1} (x^{(n)} - μ_{k}) (x^{(n)} - μ_{k})^{⊤} Σ_{k}^{- 1} - Σ_{k}^{- 1}] = 0 ⟹ Σ_{k} = \frac{1}{N _{k}} n = 1 \sum N γ_{k}^{(n)} (x^{(n)} - μ_{k}) (x^{(n)} - μ_{k})^{⊤}$

Considering the $π_{k}$ by using the Lagrange multiplier to handle the constraint $k \sum K π_{k} = 1$ . Maximize $\tilde{L} := L + λ (\sum_{k = 1}^{K} π_{k} - 1)$ , then

$\frac{\partial L ~}{\partial π _{k}} = n = 1 \sum N \frac{N ( x ^{(n)} ∣ μ _{k} , Σ _{k} )}{\sum _{j = 1}^{K} π _{j} N ( x ^{(n)} ∣ μ _{j} , Σ _{j} )} + λ = 0 ⟹ n = 1 \sum N γ_{k}^{(n)} + λ π_{k} = 0 ⟹ k = 1 \sum K n = 1 \sum N γ_{k}^{(n)} + λ k = 1 \sum K π_{k} = 0 ⟹ λ = - N, π_{k} = \frac{N _{k}}{N}$

Add a hidden variable $z^{(n)}$ to indicate which component generates $x^{(n)}$ , then the complete-data log-likelihood is given by

$L_{c} = ln p (X, Z ∣ π, μ, Σ) = n = 1 \sum N ln p (x^{(n)}, z^{(n)} ∣ π, μ, Σ) = n = 1 \sum N ln k = 1 \sum K p (x^{(n)} ∣ z^{(n)} = k; μ, Σ) \cdot p (z^{(n)} = k ∣ π)$

Algorithm

Initialize $π, μ, Σ$ .
E-step Compute the responsibilities

$γ_{k}^{(n)} = \frac{π _{k} N ( x ^{(n)} ∣ μ _{k} , Σ _{k} )}{\sum _{j = 1}^{K} π _{j} N ( x ^{(n)} ∣ μ _{j} , Σ _{j} )}, k = 1, \dots, K, n = 1, \dots, N$
M-step

First, compute the effective number of points assigned to component $k$ :

$N_{k} = n = 1 \sum N γ_{k}^{(n)}$

Then update the parameters:

$μ_{k} = \frac{1}{N _{k}} n = 1 \sum N γ_{k}^{(n)} x^{(n)} Σ_{k} = \frac{1}{N _{k}} n = 1 \sum N γ_{k}^{(n)} (x^{(n)} - μ_{k}) (x^{(n)} - μ_{k})^{⊤} π_{k} = \frac{N _{k}}{N}$
Repeat steps 2 and 3 until convergence.

The Expectation-Maximization Algorithm

Jensen's Inequality

For a convex function $f$ , we have $f (E [X]) \leq E [f (X)]$ .

For a concave function $f$ , we have $f (E [X]) \geq E [f (X)]$ .

Proof (for convex function $f$ )
When $n = 2$ , $X \in {x_{1}, x_{2}}$ , $E [X] = λ x_{1} + (1 - λ) x_{2}$ , where $λ = P (X = x_{1})$ . Then $f (E [X]) = f (λ x_{1} + (1 - λ) x_{2}) \leq λ f (x_{1}) + (1 - λ) f (x_{2}) = E [f (X)]$ .
When $n > 2$ , proof by induction applies. Assuming the inequality holds for $n - 1$ variables, consider $n$ variables. The following relation holds $E [X] = λ x_{1} + (1 - λ) i = 2 \sum n \frac{P ( X = x _{i} )}{1 - P ( X = x _{1} )} x_{i}$ Where $λ = P (X = x_{1})$ . Then $f (E [X]) = f (λ x_{1} + (1 - λ) i = 2 \sum n \frac{P ( X = x _{i} )}{1 - P ( X = x _{1} )} x_{i}) \leq λ f (x_{1}) + (1 - λ) f (i = 2 \sum n \frac{P ( X = x _{i} )}{1 - P ( X = x _{1} )} x_{i}) \leq λ f (x_{1}) + (1 - λ) i = 2 \sum n \frac{P ( X = x _{i} )}{1 - P ( X = x _{1} )} f (x_{i}) = E [f (X)]$

Auxiliary Distribution of Latent Variables

Let $x$ denote the observed data, $z$ denote the latent variables. The log-likelihood can be expressed as follows:

$lo g p (D; θ) = n = 1 \sum N lo g p (x^{(n)}; θ) = n = 1 \sum N lo g (z^{(n)} \sum p (x^{(n)}, z^{(n)}; θ))$

Let $q_{n} (z^{(n)})$ be an auxiliary distribution over the latent variables $z^{(n)}$ and they are independent, $i.e.$ , $q (z) = n = 1 \prod N q_{n} (z^{(n)})$ with constraint $z^{(n)} \sum K q_{n} (z^{n}) = 1, \forall n$ . Then by using Jensen's inequality, we have

$lo g p (x; θ) = E_{q (z)} [lo g (\frac{p ( x ; θ )}{q ( z )} \cdot q (z))] = E_{q (z)} [lo g (\frac{p ( x , z ; θ )}{q ( z )} \cdot \frac{q ( z )}{p ( z ∣ x ; θ )})] = E_{q (z)} [lo g \frac{p ( x , z ; θ )}{q ( z )}] + E_{q (z)} [lo g \frac{q ( z )}{p ( z ∣ x ; θ )}]$

To extend the above equation to the whole dataset, we have

$lo g p (D; θ) = n = 1 \sum N lo g p (x^{(n)}; θ) = n = 1 \sum N {E_{q_{n} (z)} [lo g \frac{p ( x ^{(n)} , z ^{(n)} ; θ )}{q _{n} ( z ^{(n)} )}] + E_{q_{n} (z)} [lo g \frac{q _{n} ( z ^{(n)} )}{p ( z ^{(n)} ∣ x ^{(n)} ; θ )}]} = n = 1 \sum N E_{q_{n} (z)} [lo g \frac{p ( x ^{(n)} , z ^{(n)} ; θ )}{q _{n} ( z ^{(n)} )}] + n = 1 \sum N E_{q_{n} (z)} [lo g \frac{q _{n} ( z ^{(n)} )}{p ( z ^{(n)} ∣ x ^{(n)} ; θ )}] = n = 1 \sum N E_{q_{n} (z)} [lo g \frac{p ( x ^{(n)} , z ^{(n)} ; θ )}{q _{n} ( z ^{(n)} )}] + n = 1 \sum N KL (q_{n} (z) ∥ p (z^{(n)} ∣ x^{(n)}; θ)) \geq n = 1 \sum N lo g E_{q (z)} [\frac{p ( x ^{(n)} , z ^{(n)} ; θ )}{q _{n} ( z ^{(n)} )}] + n = 1 \sum N KL (q_{n} (z) ∥ p (z^{(n)} ∣ x^{(n)}; θ)) By using Jensen’s inequality$

E Step

Given $θ$ , the optimal $q_{n} (z)$ is given by

$q_{n} (z) max L (q; θ) \equiv q_{n} (z) max n = 1 \sum N E_{q_{n} (z)} [lo g \frac{p ( x ^{(n)} , z ^{(n)} ; θ )}{q _{n} ( z ^{(n)} )}] \equiv q_{n} (z) max n = 1 \sum N {E_{q_{n} (z)} [lo g (\frac{p ( z ^{(n)} ∣ x ^{(n)} ; θ )}{q _{n} ( z ^{(n)} )} + lo g p (x^{(n)}; θ))]} \equiv min n = 1 \sum N E_{q_{n} (z)} [lo g \frac{q _{n} ( z ^{(n)} )}{p ( z ^{(n)} ∣ x ^{(n)} ; θ )}] \equiv min n = 1 \sum N KL (q_{n} (z^{(n)}) ∥ p (z^{(n)} ∣ x^{(n)}; θ))$

The optimal $q_{n} (z)$ is given by $q_{n} (z) = p (z^{(n)} ∣ x^{(n)}; θ)$ , $i.e.$ the KL divergence is going to be zero.

M Step

Given $q_{n} (z)$ , the optimal $θ$ is given by

$θ max L (q; θ) \equiv θ max n = 1 \sum N E_{q_{n} (z)} [lo g \frac{p ( x ^{(n)} , z ^{(n)} ; θ )}{q _{n} ( z ^{(n)} )}] \equiv θ max n = 1 \sum N E_{q_{n} (z)} [lo g p (x^{(n)}, z^{(n)}; θ)]$

Substituting $q_{n} (z) = p (z^{(n)} ∣ x^{(n)}; θ^{old})$ got in E step into the above equation,

$θ^{new} = ar g θ max n = 1 \sum N E_{p (z^{(n)} ∣ x^{(n)}; θ^{old})} [lo g p (x^{(n)}, z^{(n)}; θ)]$

EM Convergence

$lo g p (D; θ^{new}) \geq L (q; θ^{new}) Since lo g p (D; θ) \geq L (q; θ) \geq L (q; θ^{old}) Since θ^{new} = ar g θ max L (q; θ) = lo g p (D; θ^{old}) Since lo g p (D; θ^{old}) = L (q; θ^{old})$

Application to GMM

E-step computes the posterior responsibility of each Gaussian for each data point, and M-step updates the parameters using weighted averages based on those responsibilities.

Set $q_{n} (z) = p (z^{(n)} ∣ x^{(n)}; θ)$ , then compute the expected log-likelihood as follows:

$E_{P (z^{(n)} ∣ x^{(n)})} [lo g p (x^{(n)}, z^{(n)}; θ)] = n \sum k \sum γ_{k}^{(n)} [lo g (P (z^{(n)} = k) \cdot p (x^{(n)} ∣ z^{(n)} = k))] = n \sum k \sum γ_{k}^{(n)} [lo g π_{k} + lo g N (x^{(n)} ∣ μ_{k}, Σ_{k})]$

Next, maximize this expected log-likelihood with respect to $π_{k}, μ_{k}, Σ_{k}$ to obtain closed-form updates for the M-step.

Principal Component Analysis

Given a dataset $D = {x^{(1)}, \dots, x^{(N)}} \subset R^{D}$ , PCA aims to find a projection matrix $W \in R^{d \times k}$ that maps the original data to a lower-dimensional space while maximizing the variance of the projected data.

Projection onto a Subspace

Let $μ = \frac{1}{N} \sum_{n = 1}^{N} x^{(n)} \in R^{D}$ and $K$ -dimensional subspace $S$ is spanned by an orthonormal basis ${u_{k}}_{k = 1}^{K}$ . The projection of a data point $x$ onto the subspace $S$ is given by

$\tilde{x} = μ + k = 1 \sum K u_{k}^{⊤} (x - μ) u_{k}$

The vector $x - μ$ is orthogonal to the subspace $S$ , $i.e.$ , $U^{⊤} (x - μ) = 0$ .

Principal Component Analysis Algorithm

Variance Maximization

$U, U^{⊤} U = I max \frac{1}{N} n = 1 \sum N ∥ x^{(n)} - μ ∥^{2}$

Since $μ = \frac{1}{N} n = 1 \sum N x^{(n)}$ , the above equation can be rewritten as follows

$U, U^{⊤} U = I max \frac{1}{N} n = 1 \sum N ∥ x^{(n)} - μ ∥^{2} = U, U^{⊤} U = I max \frac{1}{N} n = 1 \sum N ∥ x^{(n)} - [μ + \frac{1}{N} U U^{⊤} n = 1 \sum N (x^{(n)} - μ)] ∥^{2} = U, U^{⊤} U = I max \frac{1}{N} n = 1 \sum N ∥ x^{(n)} - μ ∥^{2} = U, U^{⊤} U = I max \frac{1}{N} n = 1 \sum N ∥ μ + U U^{⊤} (x^{(n)} - μ) - μ ∣^{2} = U, U^{⊤} U = I max \frac{1}{N} n = 1 \sum N ∥ U^{⊤} (x^{(n)} - μ) ∥^{2} = U, U^{⊤} U = I max \frac{1}{N} n = 1 \sum N Trace (U^{⊤} (x^{(n)} - μ) (x^{(n)} - μ)^{⊤} U)$

Note that the following derivations are equivalent by the Pythagorean theorem

$U, U^{⊤} U = I max \frac{1}{N} n = 1 \sum N ∥ x^{(n)} - μ ∥^{2} \equiv U, U^{⊤} U = I min \frac{1}{N} n = 1 \sum N ∥ x^{(n)} - x^{(n)} ∥^{2}$

Define the empirical covariance matrix as follows

$Σ = \frac{1}{N} n = 1 \sum N (x^{(n)} - μ) (x^{(n)} - μ)^{⊤}$

Then the optimization problem can be rewritten as follows

$U, U^{⊤} U = I max Trace (U^{⊤} ΣU) = U, U^{⊤} U = I max k = 1 \sum K u_{k}^{⊤} Σ u_{k}, s.t. U^{⊤} U = I$

By using the Lagrange multiplier to handle the constraint $U^{⊤} U = I$ :

$L (U, Λ_{K}) = Trace (U^{⊤} ΣU) - Trace (Λ_{K} (U^{⊤} U - I))$

where $Λ_{K} = diag ([\hat{λ}_{1}, \dots, \hat{λ}_{K}]) \in R^{K \times K}$ .

Then the optimal $U$ can be obtained by setting the derivative of $L (U, Λ_{K})$ with respect to $U$ as zero:

$\frac{\partial L ( U , Λ _{K} )}{\partial U} = 2 ΣU - 2 U Λ_{K} = 0 ⟹ ΣU = U Λ_{K}$

Utilizing the SVD decomposition of $Σ$

$Σ = Q Λ_{D} Q^{⊤} = i = 1 \sum D λ_{i} q_{i} q_{i}^{⊤}$

where $λ_{1} \geq λ_{2} \geq \dots \geq λ_{D}$ are the eigenvalues of $Σ$ , and $q_{i}$ is the corresponding eigenvector.

Substituting the SVD decomposition of $Σ$ into the above equation:

$k = 1 \sum K u_{k}^{⊤} Σ u_{k} = k = 1 \sum K u_{k}^{⊤} (i = 1 \sum D λ_{i} q_{i} q_{i}^{⊤}) u_{k} = i = 1 \sum D λ_{i} k = 1 \sum K (q_{i}^{⊤} u_{k})^{2} = t \in T \sum λ_{t}$

where $T$ is the set of indices corresponding to the eigenvectors and $T \subset {1, \dots, D}$ with $∣ T ∣ = K$ .

The Uncorrelatedness of Coveriance Matrix of Projected Data

Let $U = [q_{1}, \dots, q_{K}] \in R^{D \times K}$ , $z^{(n)} = U^{⊤} (x^{(n)} - μ) \in R^{K}$ .

The covariance matrix of the projected data is given by

$Cov (z) = \frac{1}{N} n = 1 \sum N (z^{(n)} - μ_{z}) (z^{(n)} - μ_{z})^{⊤} = \frac{1}{N} n = 1 \sum N z^{(n)} (z^{(n)})^{⊤} Since μ_{z} = \frac{1}{N} n = 1 \sum N z^{(n)} = U^{⊤} (\frac{1}{N} n = 1 \sum N (x^{(n)} - μ)) = 0 = \frac{1}{N} n = 1 \sum N U^{⊤} (x^{(n)} - μ) (x^{(n)} - μ)^{⊤} U = U^{⊤} ΣU$

Let $Q = [q_{1}, \dots, q_{D}] \in R^{D \times D}$ , then $Λ_{D} = Q^{⊤} ΣQ$ , and $Λ_{K}$ is the top-left $K \times K$ block of $Λ_{D}$ . Then

$U^{⊤} Q = q_{1}^{⊤} ⋮ q_{K}^{⊤} [q_{1} \dots q_{D}] = 100 ⋮ 0 010 ⋮ 0 001 ⋮ 0 \dots \dots \dots ⋱ \dots 000 ⋮ 1 000 ⋮ 0 \dots \dots \dots \dots 000 ⋮ 0 = [I_{K} 0_{K \times (D - K)}]$

Then the covariance matrix of the projected data can be rewritten as follows

$Cov (z) = U^{⊤} ΣU = U^{⊤} Q Λ_{D} Q^{⊤} U = [I_{K} 0_{K \times (D - K)}] Λ_{D} [I_{K} 0_{(D - K) \times K}] = Λ_{K}$

where $Λ_{K} = diag (λ_{1}, \dots, λ_{K})$ is a diagonal matrix. Therefore, the covariance matrix of the projected data is diagonal, which means that the projected data are uncorrelated, $i.e.$ , $Cov (z_{i}, z_{j}) = 0$ for $i \neq = j$ .

CUHKSZ Academic Notes