DDA3020 Machine Learning

Information Theory

Informaion

When , , which means there is no uncertainty, then no information.

Entropy

Entropy is defined as the expected value of information from a source.

Cross-entropy

Cross-entropy measures average bits needed to encode events from true distribution using estimated distribution .

Property 1

Proof
.

Property 2

Proof

By Jensen equality, for any convex function , .
Since is concave, we can implies Let , then The equality holds only when .

Kullback-Leibler divergence (KL-divergence)

The distance between two distributions.

Property 1

Proof

Property 2

Linear Regression

Modeling of Linear Regression

Deterministic Perspective

Find which minimizes the following expression:

Probabilistic Perspective

where is called obervation noise or residual error. The output can also be seen as a random variable, whose conditional probability is :

Given the training dataset , by maximum log-likehood estimaiton, we get

Solutions of Linear Regression

Analytical solution

Let , then

Differentiating with respect to and setting the result to :

If is invertible then

For linear regression with multiple outputs, , then

Gradient Descent

, can be updated by gradient descent algorithm:

Classification

Binary Classification

If is invertible then , .

Multi-Category Classification

Each ro in has a one-hot assignment, then If is invertible then , .

Linear Regression Models

Ridge Regression

where (The bias should not be normalized).

To get ,

Where is the identity matrix with the top-left element set to 0 to avoid penalizing the bias term .

We can assume that the parameter follow a zero-mean Gaussian prior (omit the bias ).

Utilizing this prior, we obtain the maximum a posteriori (MAP) estimation

Polynomial Regression

Note that is still a linear function .

Lasso Regression

, then we have .

Ridge uses L2 penalty, shrinking weights towards zero without feature selection. In contrast, Lasso uses L1 penalty, forcing some weights to exactly zero, achieving model sparsity and automatic feature selection.

Robust Linear Regression

Adpot the loss as follows .

Assuming that , .

Two ways to make it differentiable:

  • .

  • By using the following equation .

    Then it can be reformulated as follows:

    Given , ; Given , .

Comparison

Regression Method
GaussianUniformLeast squares
GaussianGaussianRidge regression
GaussianLaplaceLasso regression
LaplaceUniformRobust regression
StudentUniformRobust regression

Logistic Regression

Hypothesis Function

The decision boundary satisfies when , and when , , it is determined by .

Cost Function

  • Loss

    • Linear Regression

      ,

    • Logistic Regression

  • Cross-entropy Loss

    For , ,.

    For , ,.

Gradient Descent

Assume samples in total. Learning by minimizing

By gradient descent

Multi-class Classification

Softmax

Cost Function

Regularized Logistic Regression

Linear Regression vs Logistic Regression

DimensionLinear RegressionLogistic Regression
Task TypeRegression task predicting continuous real valuesClassification task predicting discrete class labels
Model HypothesisLinear function
Output range
Sigmoid-mapped linear function
Output range (posterior probability of positive class)
Loss FunctionResidual sum of squares (MSE)
Cross-entropy loss
Solution MethodClosed-form solution
or gradient descent
No closed-form solution, optimized via gradient descent

Supoort Vector Machine

Larger Marigin

Denote is a hyperplane, then is orthogonal to the hyperplane, and has the distance to the hyperplane.

Margin is defined as the distance between the hyperplane and the closest data point, which can be formulated as with .

Since . Set , then ,

Then the optimization problem can be formulated as follows:

Hinge Loss

Let where be the hypothesis function, then the cost function can be defined as follows:

The hinge loss is defined as follows

Then the optimization problem can be reformulated as follows

Largrange Duality

The objective function of SVM is

Its Lagrangian function is defined as follows

The KKT conditions are as follows

Then the Largangian function can be reformulated as follows

Then the dual problem is obtained as follows

The data points with are called support vectors, which locate on the margin and determine the decision boundary. Let be the set of support vectors, then the decision boundary can be determined by any support vector as follows:

SVM with Slack Variables

In this case, the optimization problem can be formulated as follows

Its Lagrangian function is defined as follows

The KKT conditions are as follows

The the Lagrange function can be reformulated as follows

The dual problem can be reformulated as follows

Let . The bias parameter can be determined by any support vector with as follows

Condition on Slack Contributes to classifier?Correctly classified?Location relative to margin / decision boundary
NoYesOutside the margin
YesYesOn the margin
YesYesInside the margin, not crossing decision boundary
YesNoCrossing decision boundary

SVM with Kernels

Define the kernel function , then the dual problem can be reformulated as follows

The solution becomes

The prediction of new data becomes

Some kernel functions

Multi-class SVM

Predict the label of as , , use one-vs-rest strategy to train binary SVM classifiers.

Tree-based Methods

Decsion Tree

A decision tree (non-parametric model) is a hierarchical model for supervised learning whereby the local region is identified in a sequence of recursive splits.

Classification Tree

  • Impurity Measure (Choose the attribute that maximizes the reduction in impurity.)

    Let be the seef of instances of belonging to class (), then is the proportion of class in .

    Entropy is the measurement of impurity, which is defined as follows

    And the entropy of given the condition of attribute is defined as follows (Suppose attribute to split into subsets )

    Information gained by branching on atribute is defined as follows

  • Gini Index

    Gini index is another measurement of impurity, which is defined as follows

    It means the probability of different class labels for two randomly selected instances from .

    And the Gini index of given the condition of attribute is defined as follows (Suppose attribute to split into subsets )

    Gini gain by branching on atribute is defined as follows

Regression Tree

The regression tree grows by recursively selecting the binary split on any variable that most reduces the sum of squared errors.

is the prediction for leaf , where is the number of instances in leaf .

  • Mean Squared Error (MSE)

    The total MSE is .

  • Sum of squared Errors (SSE)

    The total SSE is .

Overfitting and Pruning

Pruning of the decision tree is one by replacing a subtree with a leaf node. The subtree is replaced if the error of the subtree on the validation set is larger than the error of the leaf node on the validation set.

Ensemble Models

Bagging

Sample records with replacement from the training set to create multiple bootstrap samples, and train a base model on each bootstrap sample. The final prediction is obtained by averaging the predictions of all base models.

It aggreates the predictions of all single trees and produces many correlated trees, which the diversity of the trees is not high.

Random Forest

It chooses a random subset of features to split each node, which increases the diversity of the trees and reduces the correlation between trees.

Neural Networks

Activation Function

Perceptron Model

, the objective function is , and the update rule is .

For a new data point , the prediction is , which is a linear classifier.

Multi-layer Feedforward Neural Networks

To solve XOR, a one‑hidden‑layer network learns non‑linear input features. Each layer re‑expresses the previous layer’s output, progressively extracting more abstract and discriminative features.

, where is the transformation of layer .

Backpropagation (using computational graph (CG))

Forward Pass

For each , compute as a function of its parents.

Backward Pass

, then for , . Also update parameters by

where is the cost function of layer .

Convolutional Neural Networks

Convolutional Layer

Input volume is .

A convolutional layer has four hyperparameters: number of filters , filter size , stride , and zero padding (per side).

Output spatial size is and .

Output volume is .

Number of parameters is weights plus biases. parameters is weights plus biases.

Pooling Layer

Pooling layer has no learnable parameters. Hyperparameters are filter size and stride .

Output spatial size is and .

Pooling operates on each channel independently, so output volume is .

Common pooling functions are and .

Fully Connected Layer

A fully connected layer computes the class scores, resulting in an output volume of , where is the number of classes, which contains neurons that connect to the entire input volume.

Recurrent Neural Networks and Transformer

RNN

Basic

, , where is the transition function and is the output function. The cost/error function is , where .

However, the basic RNN suffers from vanishing/exploding gradient problem, which makes it difficult to learn long-term dependencies.

Long Short-Term Memory (LSTM)

Limitations

  • Difficulty with Long-Term Dependencies
  • Limited Parallelization

Transformer

Attention Mechanism

Let be the query, key, and value matrices, where and is the input sequence. Then the attention output is defined as follows

Multi-head Attention

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions, capturing diverse types of relationships in parallel.

Bias-variance Tradeoff

Let be training dataset drawn from , be the model trained by algorithm on .

  • expected test error for a fixed model .

  • expected test error for the algorithm .

  • the expected test error . (mean squared error (MSE))

    Now decompose the second term , .

  • Comparsion

    TermDescription
    VarianceMeasures how much the predictions of a model vary when trained on different training sets. (model's sensitivity to the specific training data)
    Bias²Measures how far the average prediction of a model is from the true value. (model's systematic error)
    NoiseMeasures how much the observed values vary around the true target function due to randomness in the data. (data's inherent unpredictability)

Performance Evaluation

Hyper-parameter Tuning

Determined typically outside the learning algorithms rather than earned by a learning algorithm based on the trainingset.

Cross-validation

Cross-validation is a statistical method of evaluating generalization performance that is more stable and thorough than using a split into a training and a test set.

  • K-fold cross validation

    Split the train data into folds, try each fold as validation, while other folds as train. Note that is a hyper-parameter.

Evaluation Metrics

Regression

  • Mean Square Error
  • Mean Absolute Error

Classification

  • Confusion Matrix

    Actual \ Predicted (Predicted) (Predicted)
    (Actual)
    (Actual)
    • Accuracy The total number of correctly classified samples over all samples under evaluation.

    • Precision The proportion of correctly predicted positive samples among all samples predicted as positive.

    • Recall The proportion of correctly predicted positive samples among all actual positive samples.

    • Cost-sensitive Accuracy (different classes may have different importance)

    • Binary Classification

      MetricFormula
      (True Positive Rate)
      (False Negative Rate)
      (True Negative Rate)
      (False Positive Rate)

      If positive and negative classes are balanced, then .

      As the decision threshold varies from to , FPR decreases while FNR increases, and their intersection point where equals is called the Equal Error Rate (). The lower the , the better the classifier.

  • Receiver Operating Characteristic Curve (ROC) The relationship between FNR and FPR.

  • Area Under the ROC (AUC) The probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example. (It means the perfect prediction if it equals to .)

    Consider a set of binary samples indexed by for positive class, and for negative class. Let be a predictor and .

    Consider also a Heaviside step function given by

    Then AUC can be expressed by .

Unsupervised Learning

Clustering

Clustering is a problem of learning to assign labels to examples by leveraging an unlabeled dataset.

Dimensionality Reduction

Principal componentanalysis (PCA) is a technique that converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables, called principal components.

Density Estimation

Estimate the distribution based on some observed data.

  • Kernel Density Estimation (KDE)

    where is a hyper-parameter called kernel size, which can be tuned using -fold cross-validation.

Autoencoder

An autoencoder is a type of artificial neural network used to learn efficient coding of unlabeled data.

Self-supervised Learning

Self-supervised learning (SSL) is a paradigm where a model learns from unlabeled data by automatically generating pseudo-labels from the data itself, without requiring human annotation.

K-means Clustering

Initialize centroids, assign each point to nearest centroid, update centroids by mean of assigned points, repeat until convergence.

In short, K-means will minimize the within-cluster variance, as follows:

where denotes is assigned to cluster .

Coordinate Descent

Assignment (Given to update )

Note that the assignment for each data can be solved independently, , pr

The solution can be obtained as follows

Refitting (Given to update )

Note that can be optimized independently, as follows

By setting the derivative as 0, ,

Since the objective function is non-convex, K-means may converge to a local minimum, and the final solution can be sensitive to the initial centroids. To mitigate this issue, it is common practice to run K-means multiple times with different random initializations and select the solution with the lowest within-cluster variance.

Performance Evaluation

Internal Evaluation Metrics (Silhouette Coefficient)

Define as the mean distance between a point and all other points in the same cluster. Define the smallest mean distance of a point to all points in any other cluster.

The silhouette coefficient for a single sample is formulated as follows:

close to indicates that the sample is well clustered.

External evaluation metrics (Rand Index)

SymbolDefinition
same in , same in
different in , different in
same in , different in
different in , same in

For random clusterings, the expected value of the Rand Index is not zero. Let , and Adjusted Rand Index (ARI) is defined as follows:

Gaussian Mixture Models

where are the mixing coefficients, , , , and

Alternating Upadating Algorithm

Derivation from MLE

The log-likelihood is given by

Let the derivate of with respect to be zero:

Let , then

Let the derivate of with respect to be zero:

Considering the by using the Lagrange multiplier to handle the constraint . Maximize , then

Add a hidden variable to indicate which component generates , then the complete-data log-likelihood is given by

Algorithm

  1. Initialize .

  2. E-step Compute the responsibilities

  3. M-step

    First, compute the effective number of points assigned to component :

    Then update the parameters:

  4. Repeat steps 2 and 3 until convergence.

The Expectation-Maximization Algorithm

Jensen's Inequality

For a convex function , we have .

For a concave function , we have .

Proof (for convex function )
When , , , where . Then .
When , proof by induction applies. Assuming the inequality holds for variables, consider variables. The following relation holds Where . Then

Auxiliary Distribution of Latent Variables

Let denote the observed data, denote the latent variables. The log-likelihood can be expressed as follows:

Let be an auxiliary distribution over the latent variables and they are independent, , with constraint . Then by using Jensen's inequality, we have

To extend the above equation to the whole dataset, we have

E Step

Given , the optimal is given by

The optimal is given by , the KL divergence is going to be zero.

M Step

Given , the optimal is given by

Substituting got in E step into the above equation,

EM Convergence

Application to GMM

E-step computes the posterior responsibility of each Gaussian for each data point, and M-step updates the parameters using weighted averages based on those responsibilities.

Set , then compute the expected log-likelihood as follows:

Next, maximize this expected log-likelihood with respect to to obtain closed-form updates for the M-step.

Principal Component Analysis

Given a dataset , PCA aims to find a projection matrix that maps the original data to a lower-dimensional space while maximizing the variance of the projected data.

Projection onto a Subspace

Let and -dimensional subspace is spanned by an orthonormal basis . The projection of a data point onto the subspace is given by

The vector is orthogonal to the subspace , , .

Principal Component Analysis Algorithm

Variance Maximization

Since , the above equation can be rewritten as follows

Note that the following derivations are equivalent by the Pythagorean theorem

Define the empirical covariance matrix as follows

Then the optimization problem can be rewritten as follows

By using the Lagrange multiplier to handle the constraint :

where .

Then the optimal can be obtained by setting the derivative of with respect to as zero:

Utilizing the SVD decomposition of

where are the eigenvalues of , and is the corresponding eigenvector.

Substituting the SVD decomposition of into the above equation:

where is the set of indices corresponding to the eigenvectors and with .

The Uncorrelatedness of Coveriance Matrix of Projected Data

Let , .

The covariance matrix of the projected data is given by

Let , then , and is the top-left block of . Then

Then the covariance matrix of the projected data can be rewritten as follows

where is a diagonal matrix. Therefore, the covariance matrix of the projected data is diagonal, which means that the projected data are uncorrelated, , for .