Machine Learning Syllabus

From Basic ANN to CNN

A Research-Backed, Math-Heavy Learning Path for Beginners.

How to Use This Syllabus

Each module contains:

■Concept explanation — intuitive understanding
■Mathematics — formal notation and derivations
■Key Research Papers — with links and what to read in them
■Suggested time — approximate weeks per module

Prerequisites: High school algebra, basic Python programming.

MODULE 0

Foundation Mathematics (Weeks 1–3)

"You cannot understand deep learning without understanding linear algebra, calculus, and statistics. These are not optional extras — they are the language in which neural networks are written."

0.1 Linear Algebra

Why it matters: Every neural network operation is a matrix multiplication under the hood.

Vectors and Matrices

A vector x ∈ ℝⁿ is a column of numbers. A matrix W ∈ ℝ^(m×n) transforms vectors:

y = Wx + b

where W is a weight matrix, x is input, b is a bias vector.

Dot Product

x · y = Σᵢ xᵢyᵢ = ||x|| ||y|| cos(θ)

Matrix Multiplication

If A ∈ ℝ^(m×k) and B ∈ ℝ^(k×n), then C = AB ∈ ℝ^(m×n) where:

Cᵢⱼ = Σₖ AᵢₖBₖⱼ

Transpose: (AB)ᵀ = BᵀAᵀ

Eigenvalues and Eigenvectors

For a matrix A: Av = λv, where λ is the eigenvalue and v is the eigenvector.

Used in: Principal Component Analysis (PCA), understanding optimization landscapes.

Norms

L1 norm: ||x||₁ = Σᵢ |xᵢ|
L2 norm: ||x||₂ = √(Σᵢ xᵢ²)
Frobenius norm (for matrices): ||A||_F = √(Σᵢⱼ Aᵢⱼ²)

📄 Resources for Linear Algebra

Book:

Goodfellow et al. (2016), Deep Learning, Chapter 2 — Linear Algebra

→ https://www.deeplearningbook.org/contents/linear_algebra.html

Interactive:

3Blue1Brown — 'Essence of Linear Algebra' series

→ https://www.3blue1brown.com/topics/linear-algebra

0.2 Calculus & Multivariable Differentiation

Why it matters: Training neural networks = minimizing a loss function using derivatives.

Derivatives

f'(x) = lim[h→0] (f(x+h) - f(x)) / h

Partial Derivatives

For f(x, y): ∂f/∂x treats y as a constant and differentiates w.r.t. x.

The Chain Rule

The most important rule in deep learning

If z = f(g(x)), then:

dz/dx = (dz/dg) · (dg/dx)

For multiple variables, if z = f(y₁, y₂, ..., yₙ) and each yᵢ = gᵢ(x):

∂z/∂x = Σᵢ (∂z/∂yᵢ)(∂yᵢ/∂x)

Gradient

Vector of partial derivatives — points in steepest ascent:

∇f(x) = [∂f/∂x₁ ... ∂f/∂xₙ]ᵀ

Jacobian Matrix

For a vector-valued function f: ℝⁿ → ℝᵐ:

J = [∂fᵢ/∂xⱼ] (m×n)

Hessian Matrix

Second-order curvature matrix:

H_ij = ∂²f / ∂xᵢ∂xⱼ

📄 Resources for Calculus

Book:

Goodfellow et al., Deep Learning, Chapter 4 — Numerical Computation

→ https://www.deeplearningbook.org/contents/numerical.html

Paper:

Baydin et al. (2018), Automatic Differentiation in Machine Learning: a Survey

→ https://arxiv.org/abs/1502.05767

0.3 Probability & Statistics

Why it matters: Neural networks model probability distributions; loss functions derive from likelihood.

Probability Rules

P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
P(A | B) = P(A ∩ B) / P(B)

Expectation and Variance

E[X] = Σₓ x · P(x)
Var(X) = E[(X - E[X])²] = E[X²] - (E[X])²

Bayes' TheoremP(θ | X) = P(X | θ) · P(θ) / P(X)

P(θ | X) posterior
what we believe after seeing data

P(X | θ) likelihood
how probable the data is given parameters

P(θ) prior
what we believed before seeing data

Gaussian Distribution

P(x; μ, σ²) = (1/√(2πσ²)) · exp(-(x - μ)² / (2σ²))

KL Divergence

measures how different two distributions are

D_KL(P || Q) = Σₓ P(x) log(P(x) / Q(x))

Information & Entropy

H(P) = -Σₓ P(x) log P(x)

Cross-Entropy

the foundation of classification loss

H(P, Q) = -Σₓ P(x) log Q(x)

📄 Resources for Probability

Book:

Goodfellow et al., Deep Learning, Chapter 3 — Probability and Information Theory

→ https://www.deeplearningbook.org/contents/prob.html

0.4 Optimization Theory

Gradient Descent (Vanilla)

θ ← θ - α · ∇_θ L(θ)

where α is the learning rate and L is the loss.

Convexity

A function f is convex if:

f(λx + (1-λ)y) ≤ λf(x) + (1-λ)f(y), ∀ λ ∈ [0,1]

Convex problems have guaranteed global minima. Neural networks are non-convex — we find good local minima instead.

MODULE 1

Machine Learning Fundamentals (Weeks 4–5)

1.1 The Machine Learning Paradigm

Types of Learning

Type	Setup	Example
Supervised	Labeled data (X, Y)	Image classification
Unsupervised	Unlabeled data X only	Clustering, generation
Reinforcement	Agent + reward signal	Game playing

The Statistical Learning Framework

We want to find a function f: X → Y that minimizes expected risk:

R(f) = E[L(f(x), y)] = ∫ L(f(x), y) dP(x, y)

Since we can't compute this over all possible data, we minimize empirical risk over training set:

R̂(f) = (1/n) Σᵢ L(f(xᵢ), yᵢ)

1.2 The Bias-Variance Tradeoff

The expected test error of any estimator decomposes as:

E[(y - f̂(x))²] = Bias(f̂(x))² + Var(f̂(x)) + σ²_noise

■ High bias = underfitting (model too simple)
■ High variance = overfitting (model memorizes training data)
■ σ²_noise = irreducible error

1.3 Key Evaluation Metrics

For Classification:

Accuracy: (TP + TN) / Total
Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
F1 Score: 2 · (Precision · Recall) / (Precision + Recall)

For Regression:

MSE: (1/n) Σ (yᵢ - ŷᵢ)²
MAE: (1/n) Σ |yᵢ - ŷᵢ|
R²: 1 - SS_res / SS_tot

📄 Key Papers for ML Fundamentals

Paper:

LeCun, Bengio, Hinton (2015), 'Deep Learning', Nature

→ https://www.nature.com/articles/nature14539

Read: Section 1 (Introduction) and Section 2 (Supervised learning)

MODULE 2

The Perceptron & Single-Layer Networks (Week 6)

2.1 Biological Inspiration

The McCulloch-Pitts Neuron (1943) was the first mathematical model of a biological neuron:

output = 1 if Σᵢ wᵢxᵢ ≥ threshold

output = 0 otherwise

📄 Foundational Paper

Paper:

McCulloch, W.S. & Pitts, W. (1943), 'A Logical Calculus of Ideas Immanent in Nervous Activity'

→ https://link.springer.com/article/10.1007/BF02478259

2.2 Rosenblatt's Perceptron (1958)

The perceptron computes:

ŷ = sign(wᵀx + b) = sign(Σᵢ wᵢxᵢ + b)

Perceptron Learning Rule:

For each misclassified example (x, y):

w ← w + α · y · x

b ← b + α · y

Perceptron Convergence Theorem: If the data is linearly separable, the perceptron will converge in a finite number of steps.

Limitation: The perceptron cannot solve XOR — it only separates linearly separable data. This led to the first "AI winter."

📄 Key Papers for Perceptron

Paper:

Rosenblatt, F. (1958), 'The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain'

→ https://psycnet.apa.org/doi/10.1037/h0042519

Book:

Minsky & Papert (1969), Perceptrons — proved XOR limitation

→ https://mitpress.mit.edu/9780262630221/perceptrons/

MODULE 3

Artificial Neural Networks (Weeks 7–10)

3.1 The Multi-Layer Perceptron (MLP)

A feedforward neural network with:

An input layer
One or more hidden layers
An output layer

Single Neuron (layer l, neuron j):

zⱼ^(l) = Σᵢ wⱼᵢ^(l) · aᵢ^(l-1) + bⱼ^(l)
aⱼ^(l) = σ(zⱼ^(l))

In Matrix Form:

z^(l) = W^(l) · a^(l-1) + b^(l)
a^(l) = σ(z^(l))

where:

W^(l) ∈ ℝ^(n_l × n_{l-1}) is the weight matrix
b^(l) ∈ ℝ^(n_l) is the bias vector
σ is the activation function applied element-wise

3.2 Activation Functions

Launch Activation Visualizer

Activation functions introduce non-linearity — without them, deep networks collapse to a single linear transformation.

Sigmoid

σ(z) = 1 / (1 + e^(-z))
σ'(z) = σ(z)(1 - σ(z)) ∈ (0, 0.25]

Output range: (0, 1)
Problem: Vanishing gradient — σ'(z) ≈ 0 for large |z|

Tanh

tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))
tanh'(z) = 1 - tanh²(z) ∈ (0, 1]

Output range: (-1, 1)
Zero-centered (better than sigmoid)
Still suffers from vanishing gradients

ReLU — Rectified Linear Unit

ReLU(z) = max(0, z)
ReLU'(z) = 1 if z > 0, else 0

Computationally efficient
Avoids vanishing gradient for z > 0
Problem: "Dying ReLU" — neurons with z < 0 never update

Leaky ReLU

f(z) = z if z > 0
f(z) = αz if z ≤ 0 (α ≈ 0.01)

ELU — Exponential Linear Unit

f(z) = z if z > 0
f(z) = α(e^z - 1) if z ≤ 0

Softmax (for multi-class output)

softmax(zᵢ) = e^(zᵢ) / Σⱼ e^(zⱼ)

Properties: outputs sum to 1, interpretable as probabilities.

📄 Key Papers for Activations

Paper:

Nair, V. & Hinton, G.E. (2010), 'Rectified Linear Units Improve Restricted Boltzmann Machines'

→ https://icml.cc/Conferences/2010/papers/432.pdf

Paper:

Clevert, D., Unterthiner, T. & Hochreiter, S. (2015), 'Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)'

→ https://arxiv.org/abs/1511.07289

Paper:

Maas, A.L. et al. (2013), 'Rectifier Nonlinearities Improve Neural Network Acoustic Models'

→ https://ai.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf

3.3 Loss Functions

Launch Loss Function Explorer

Mean Squared Error (Regression)

L_MSE = (1/n) Σᵢ (yᵢ - ŷᵢ)²

Derived from the negative log-likelihood of a Gaussian — minimizing MSE = maximizing likelihood under Gaussian noise.

Binary Cross-Entropy (Binary Classification)

L_BCE = -(1/n) Σᵢ [yᵢ log(ŷᵢ) + (1-yᵢ) log(1-ŷᵢ)]

Derived from the Bernoulli likelihood.

Categorical Cross-Entropy (Multi-class)

L_CE = -(1/n) Σᵢ Σⱼ yᵢⱼ log(ŷᵢⱼ)

where yᵢⱼ ∈ {0,1} is one-hot encoded ground truth.

3.4 Forward Propagation

Interactive Network Visualizer

Full forward pass for an L-layer network:

a^(0) = x                               (input)for l = 1, 2, ..., L:z^(l) = W^(l) a^(l-1) + b^(l)a^(l) = σ(z^(l))ŷ = a^(L)                               (output)

3.5 Backpropagation

Interactive Backprop Visualizer

The most important algorithm in deep learning. Efficiently computes gradients using the chain rule. Goal: Compute ∂L/∂W^(l) and ∂L/∂b^(l) for all layers l.

Step 1 — Output layer error

δ^(L) = ∇_a L ⊙ σ'(z^(L))

where ⊙ is element-wise multiplication.

Step 2 — Backpropagate the error

δ^(l) = (W^(l+1))ᵀ δ^(l+1) ⊙ σ'(z^(l))

Step 3 — Compute gradients

∂L/∂W^(l) = δ^(l) (a^(l-1))ᵀ
∂L/∂b^(l) = δ^(l)

Intuition: δ^(l) represents how much each neuron in layer l contributed to the final error. We propagate this "blame" backwards through the network.

Computational Complexity:O(n_params) — same cost as a single forward pass!

📄 Foundational Papers for Backprop

Paper:

Rumelhart, D.E., Hinton, G.E. & Williams, R.J. (1986), 'Learning Representations by Back-propagating Errors', Nature

→ https://www.nature.com/articles/323533a0

THE paper that made neural networks practical. Must-read.

Paper:

LeCun, Y. (1988), 'A Theoretical Framework for Back-Propagation'

→ http://yann.lecun.com/exdb/publis/pdf/lecun-88.pdf

Paper:

Linnainmaa, S. (1976), 'Taylor Expansion of the Accumulated Rounding Error'

→ #

(First formulation of the reverse mode automatic differentiation)

3.6 Gradient Descent Variants

Batch Gradient Descent

θ ← θ - α · (1/n) Σᵢ ∇_θ L(xᵢ, yᵢ)

Uses entire dataset per update. Stable but slow.

Stochastic Gradient Descent (SGD)

θ ← θ - α · ∇_θ L(xᵢ, yᵢ) (one sample at a time)

Noisy but fast and escapes local minima better.

Mini-Batch Gradient Descent (Standard in practice)

θ ← θ - α · (1/B) Σᵢ_{∈batch} ∇_θ L(xᵢ, yᵢ)

where B is the batch size (typically 32, 64, 128, 256).

Momentum

Accumulates a velocity vector in directions of persistent gradient:

v ← β·v - α·∇_θ L
θ ← θ + v

Adam Optimizer

Combines momentum and adaptive learning rates:

m_t = β₁·m_{t-1} + (1-β₁)·g_t (1st moment)
v_t = β₂·v_{t-1} + (1-β₂)·g_t² (2nd moment)

m̂_t = m_t / (1 - β₁ᵗ) (bias correction)
v̂_t = v_t / (1 - β₂ᵗ) (bias correction)

θ_t = θ_{t-1} - α · m̂_t / (√v̂_t + ε)

Default: β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸

📄 Key Papers for Optimization

Paper:

Kingma, D.P. & Ba, J. (2014), 'Adam: A Method for Stochastic Optimization'

→ https://arxiv.org/abs/1412.6980

One of the most cited ML papers ever.

Paper:

Sutton, R.S. (1986), 'Two Problems with Backpropagation and Other Steepest Descent Learning Procedures'

→ #

(Introduces momentum)

Survey:

Ruder, S. (2016), 'An Overview of Gradient Descent Optimization Algorithms'

→ https://arxiv.org/abs/1609.04747

MODULE 4

Training Deep Networks (Weeks 11–13)

4.1 The Vanishing and Exploding Gradient Problem

During backprop, gradients are products of weight matrices and activation derivatives:

∂L/∂W^(1) ∝ (W^(L))ᵀ · σ'(z^(L)) · (W^(L-1))ᵀ · ... · σ'(z^(1))

Vanishing Gradients

If each term is < 1 → gradients shrink exponentially

Exploding Gradients

If each term is > 1 → gradients grow exponentially

Solutions:

✔ Better activation functions (ReLU)
✔ Better weight initialization
✔ Batch Normalization
✔ Residual connections (ResNet)
✔ Gradient clipping (for RNNs)

📄 Key Paper for Gradient Problems

Paper:

Hochreiter, S. (1991/1998), 'The Vanishing Gradient Problem During Learning Recurrent Neural Nets'

→ https://www.bioinf.jku.at/publications/older/2304.pdf

4.2 Weight Initialization

Why it matters: Bad initialization → dead neurons or exploding activations.

Zero Initialization

All weights = 0 → all neurons compute the same function → symmetry breaking fails. Never do this.

Random Initialization

Too large → exploding. Too small → vanishing.

Xavier/Glorot Initialization (for tanh/sigmoid)

W ~ Uniform[-√(6/(n_in + n_out)), √(6/(n_in + n_out))]

or equivalently: Var(W) = 2 / (n_in + n_out)

He Initialization (for ReLU)

W ~ N(0, 2/n_in)

Accounts for the fact that ReLU zeros out half the inputs.

📄 Key Papers for Initialization

Paper:

Glorot, X. & Bengio, Y. (2010), 'Understanding the Difficulty of Training Deep Feedforward Neural Networks'

→ https://proceedings.mlr.press/v9/glorot10a.html

Paper:

He, K. et al. (2015), 'Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification'

→ https://arxiv.org/abs/1502.01852

4.3 Batch Normalization

Normalizes layer inputs to have zero mean and unit variance, then re-scales with learnable parameters:

μ_B = (1/m) Σᵢ xᵢ(batch mean)

σ²_B = (1/m) Σᵢ (xᵢ - μ_B)²(batch variance)

x̂ᵢ = (xᵢ - μ_B) / √(σ²_B + ε)(normalize)

yᵢ = γ · x̂ᵢ + β(scale & shift)

Benefits:

• Reduces internal covariate shift
• Allows higher learning rates
• Acts as a regularizer (reduces need for Dropout)
• Reduces sensitivity to initialization

📄 Key Paper for Batch Norm

Paper:

Ioffe, S. & Szegedy, C. (2015), 'Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift'

→ https://arxiv.org/abs/1502.03167

4.4 Regularization Techniques

The Problem: Neural networks can memorize training data (overfit).

L2 Regularization (Weight Decay)

Add penalty to loss:

L_total = L_data + (λ/2) Σ_w w²

Gradient update becomes:

w ← w(1 - αλ) - α · ∂L_data/∂w

The factor (1 - αλ) "decays" weights toward zero.

L1 Regularization

Add penalty to loss:

L_total = L_data + λ Σ_w |w|

Promotes sparsity — many weights become exactly zero.

Dropout

During training, randomly set each neuron to zero with probability p:

aᵢ^(l) = aᵢ^(l) · Bernoulli(1-p) / (1-p)

The division by (1-p) ensures expected values remain unchanged. At test time, use all neurons.

Intuition: Forces the network to learn redundant representations; prevents co-adaptation of neurons.

Early Stopping

Monitor validation loss; stop training when it starts increasing. Effectively limits model capacity.

📄 Key Papers for Regularization

Paper:

Srivastava, N. et al. (2014), 'Dropout: A Simple Way to Prevent Neural Networks from Overfitting'

→ https://jmlr.org/papers/v15/srivastava14a.html

Paper:

Krogh, A. & Hertz, J. (1992), 'A Simple Weight Decay Can Improve Generalization'

→ https://proceedings.neurips.cc/paper/1991/hash/8eefcfdf5990e441f0fb6f3fad709e21-Abstract.html

4.5 The Universal Approximation Theorem

Cybenko, 1989; Hornik, 1991

Theorem

A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of ℝⁿ, under mild assumptions on the activation function.

Formal Statement:

For any continuous f: [0,1]ⁿ → ℝ and any ε > 0, there exists a single-hidden-layer network F with N neurons such that:

|F(x) - f(x)| < ε for all x ∈ [0,1]ⁿ

Important Caveat

This theorem guarantees existence, not that gradient descent will find such a network, nor that it requires a practical number of neurons.

📄 Key Papers for Universal Approximation

Paper:

Cybenko, G. (1989), 'Approximation by Superpositions of a Sigmoidal Function'

→ https://link.springer.com/article/10.1007/BF02551274

Paper:

Hornik, K. (1991), 'Approximation Capabilities of Multilayer Feedforward Networks'

→ https://www.sciencedirect.com/science/article/pii/089360809190009T

MODULE 5

Convolutional Neural Networks (Weeks 14–18)

5.1 Motivation: Why Not Just Use MLPs for Images?

A 224×224 RGB image has 224 × 224 × 3 = 150,528 input dimensions. A first hidden layer with just 1,000 neurons needs 150 million parameters — just in layer 1. This is:

Computationally infeasibleto train efficiently.

Data-hungryneeds billions of examples to avoid overfitting.

Ignores spatial structuretreats pixel (0,0) and pixel (200,200) as equally related.

CNNs exploit three key properties of natural images:

1. Locality — nearby pixels are correlated
2. Translation invariance — a cat is a cat whether it's top-left or bottom-right
3. Compositionality — complex features are built from simple ones

5.2 The Convolution Operation

Signal Graph Visualizer

1D Convolution (for intuition)

(f * g)[n] = Σₖ f[k] · g[n - k]

2D Discrete Convolution (for images)

For input I and filter (kernel) K of size (2m+1) × (2n+1):

(I * K)[i, j] = Σₛ Σₜ I[i+s, j+t] · K[s, t]

Animated Worked Example:

INPUT (5×5)

-1

-2

-1

-8

-2

KERNEL (3×3)

-1

sliding
kernel

OUTPUT (3×3)

Output[0,0] =

-1·1 + 2·0 + 3·(-1) + -2·1 + 4·0 + 3·(-1) + 2·1 + 3·0 + -8·(-1)

= 1

Output size formula:

output_size = (input_size - kernel_size + 2·padding) / stride + 1

Parameters in a conv layer:

params = (kernel_h × kernel_w × in_channels + 1) × num_filters

vs. dense layer: input_size × output_size — orders of magnitude fewer parameters!

5.3 Key CNN Components

Stride

Moving the kernel by stride S instead of 1 pixel at a time:

Reduces output spatial dimensions
Stride 2 ≈ halves the spatial size

Padding

Adding zeros around the input border:

Valid padding: No padding, output shrinks
Same padding: Pad so output size = input size (when stride=1)

Required same-padding: p = (k-1)/2 for kernel size k

Pooling

Reduces spatial dimensions while retaining important information.

Max Pooling (most common):

output[i,j] = max{ input[i·s+r, j·s+c] : 0 ≤ r,c < k }

Average Pooling:

output[i,j] = (1/k²) Σᵣ Σ_c input[i·s+r, j·s+c]

Properties of pooling:

No learnable parameters
Makes representation approximately translation-invariant
Reduces computation and memory

Feature Maps

Each convolutional filter produces one feature map (also called activation map). With C filters, we get C feature maps — each detecting a different feature (edges, curves, textures, etc.).

5.4 Receptive Field

The receptive field of a neuron is the region of the input image it can "see." After stacking multiple conv layers:

RF_l = RF_{l-1} + (k_l - 1) · Π_{i=1}^{l-1} s_i

where k_l is kernel size and sᵢ is stride at layer i.

Insight: Stacking small 3×3 kernels is more efficient than using large kernels while achieving the same receptive field with fewer parameters and more non-linearities.

5.5 Full CNN Architecture

Launch Architecture Playground

A typical CNN pipeline:

Input Image

↓

[Conv → BatchNorm → ReLU → Pool] × N(Feature Extraction)

↓

Flatten

↓

[Fully Connected → ReLU] × M(Classification Head)

↓

[Fully Connected → Softmax](Output)

5.6 Backpropagation Through Convolutions

Gradient w.r.t. kernel weights:

∂L/∂K[s,t] = Σᵢ Σⱼ δ[i,j] · I[i+s, j+t]

Gradient w.r.t. input (for propagating to earlier layers):

∂L/∂I[i,j] = Σₛ Σₜ δ[i-s, j-t] · K[s,t]

This is a full convolution of δ with the (180°-rotated) kernel.

MODULE 6

Landmark CNN Architectures (Weeks 19–22)

6.1 LeNet-5 (1998) — The Pioneer

Conv → Pool → Conv → Pool → FC → FC → Output

Key Contributions:

First practical deep CNN for digit recognition (MNIST)
Demonstrated CNNs for document recognition at scale
Used sigmoid/tanh activations

Input: 32×32 grayscale↓Output: 10 classes

📄 Key Paper for LeNet

Paper:

LeCun, Y. et al. (1998), 'Gradient-Based Learning Applied to Document Recognition'

→ http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf

Must-read: Sections I-III for the architecture, Section II.B for backpropagation in CNNs

6.2 AlexNet (2012) — The Deep Learning Revolution

Architecture: 5 Conv layers + 3 FC layers, ~60M parameters

Key Contributions:

Won ImageNet 2012 with 15.3% top-5 error (vs. 26.2% runner-up)
First large-scale use of ReLU (dramatically faster training)
Introduced Dropout in FC layers
Used GPU training (two GTX 580s)
Local Response Normalization (LRN)
Data augmentation (random crops, horizontal flips)

Architecture in detail:

Input: 224×224×3

Conv1: 96 filters, 11×11, stride 4 → 55×55×96

MaxPool: 3×3, stride 2 → 27×27×96

Conv2: 256 filters, 5×5 → 27×27×256

MaxPool: 3×3, stride 2 → 13×13×256

Conv3: 384 filters, 3×3 → 13×13×384

Conv4: 384 filters, 3×3 → 13×13×384

Conv5: 256 filters, 3×3 → 13×13×256

MaxPool: 3×3, stride 2 → 6×6×256

Flatten → 9216

FC6: 4096, Dropout

FC7: 4096, Dropout

FC8: 1000 (softmax)

📄 Key Paper for AlexNet

Paper:

Krizhevsky, A., Sutskever, I. & Hinton, G.E. (2012), 'ImageNet Classification with Deep Convolutional Neural Networks'

→ https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html

The paper that started the modern deep learning era. Read entirely.

6.3 VGGNet (2014) — The Simplicity Principle

Architecture: Very deep networks using only 3×3 convolutions (VGG-16: 16 layers, VGG-19: 19 layers)

Key insight:

Multiple stacked 3×3 conv layers have the same receptive field as larger kernels but with fewer parameters and more non-linearities:

Two 3×3 layers≈one 5×5 layer

Three 3×3 layers≈one 7×7 layer

Parameter comparison:

1× 7×7 conv: 49 × C² parameters
3× 3×3 convs: 27 × C² parameters (+ 3 non-linearities!)

📄 Key Paper for VGG

Paper:

Simonyan, K. & Zisserman, A. (2014), 'Very Deep Convolutional Networks for Large-Scale Image Recognition'

→ https://arxiv.org/abs/1409.1556

Read: Table 1 (architectures), Section 2.1-2.3

6.4 GoogLeNet/Inception (2014) — Going Wider

Key innovation: The Inception Module

Instead of choosing between 1×1, 3×3, or 5×5 convolutions, do all of them in parallel and concatenate:

Input

↓↓↓↓

1×11×1→3×31×1→5×5MaxPool→1×1

↓↓↓↓

Concatenate along channel dim

1×1 convolutions act as a bottleneck to reduce channels before expensive 3×3/5×5 convolutions.

Other innovations:

Auxiliary classifiers at intermediate layers (helps gradient flow)
No FC layers at the end (uses global average pooling)
4M parameters (12× fewer than AlexNet!)

📄 Key Paper for GoogLeNet

Paper:

Szegedy, C. et al. (2014), 'Going Deeper with Convolutions'

→ https://arxiv.org/abs/1409.4842

6.5 ResNet (2015) — The Residual Revolution

The problem:Very deep networks (>20 layers) trained worse than shallower ones — not just due to overfitting, but due to optimization difficulty.

The solution: Residual (Skip) Connections

Instead of learning H(x), learn the residual F(x) = H(x) - x:

Output = F(x, {Wᵢ}) + x

where F is 2 or 3 conv layers.

Why it works:

Gradient highway — gradients flow directly through skip connections
Identity mapping is trivially learnable (set F = 0)
Allows networks of 50, 101, 152+ layers

Mathematical insight:

The gradient of the loss w.r.t. the input becomes:

∂L/∂x = ∂L/∂y · (1 + ∂F/∂x)

The "1" ensures gradients are at least as large as in the residual branch — no vanishing!

ResNet-50 Architecture:

Conv1: 7×7, 64 filters, stride 2

MaxPool: 3×3, stride 2

↓

Stage 1: 3× [1×1-64, 3×3-64, 1×1-256] bottleneck blocks

Stage 2: 4× [1×1-128, 3×3-128, 1×1-512] bottleneck blocks

Stage 3: 6× [1×1-256, 3×3-256, 1×1-1024] bottleneck blocks

Stage 4: 3× [1×1-512, 3×3-512, 1×1-2048] bottleneck blocks

↓

GlobalAvgPool → FC 1000 → Softmax

📄 Key Papers for ResNet

Paper:

He, K. et al. (2015), 'Deep Residual Learning for Image Recognition'

→ https://arxiv.org/abs/1512.03385

The most influential vision paper of the 2010s. Read entirely.

Paper:

He, K. et al. (2016), 'Identity Mappings in Deep Residual Networks'

→ https://arxiv.org/abs/1603.05027

Read: Section 3 (theoretical analysis of why residuals work)

6.6 DenseNet (2016) — Dense Connectivity

DenseNet connects each layer to every subsequent layer:

x_l = H_l([x_0, x_1, ..., x_{l-1}])

where [·] denotes concatenation of all previous feature maps.

Benefits:

Maximizes gradient flow
Feature reuse
Substantially fewer parameters than ResNet

📄 Key Paper for DenseNet

Paper:

Huang, G. et al. (2016), 'Densely Connected Convolutional Networks'

→ https://arxiv.org/abs/1608.06993

MODULE 7

Transfer Learning & Practical Applications (Weeks 23–25)

7.1 Transfer Learning

Idea: A model trained on a large dataset (e.g., ImageNet with 1.2M images, 1000 classes) learns general visual features that transfer to new tasks.

Why it works: Early CNN layers detect universal low-level features (edges, textures). Later layers are more task-specific.

1. Feature Extraction

Freeze all conv layers, train only the new head.

2. Fine-tuning

Unfreeze some/all layers and train with a small learning rate.

3. Full retraining

Use pretrained weights as initialization, train everything.

Rule of thumb:

	Small dataset	Large dataset
Similar domain:	Feature extraction	Fine-tune last layers
Different domain:	Careful fine-tuning	Possibly full retraining

📄 Key Papers for Transfer Learning

Paper:

Yosinski, J. et al. (2014), 'How Transferable are Features in Deep Neural Networks?'

→ https://arxiv.org/abs/1411.1792

Survey:

Pan, S.J. & Yang, Q. (2009), 'A Survey on Transfer Learning'

→ https://ieeexplore.ieee.org/document/5288526

7.2 Data Augmentation

Artificially expand training data with label-preserving transformations:

Geometric: Random crop, flip, rotation, scaling, shearing
Photometric: Brightness, contrast, hue, saturation changes
Noise: Gaussian noise, cutout, random erasing
Advanced: Mixup (α-blend two images + labels), CutMix

Mixup Formula:

x̃ = λ·xᵢ + (1-λ)·xⱼ
ỹ = λ·yᵢ + (1-λ)·yⱼ

where λ ~ Beta(α, α)

📄 Key Paper for Advanced Augmentation

Paper:

Zhang, H. et al. (2017), 'mixup: Beyond Empirical Risk Minimization'

→ https://arxiv.org/abs/1710.09412

MODULE 8

Modern and Efficient Architectures (Weeks 26–28)

8.1 Depthwise Separable Convolutions (MobileNet)

Standard conv: C_in × k × k → C_out (one step)

Depthwise separable conv (Two steps):

1. Depthwise

Apply one k×k filter per input channel (C_in filters total)

2. Pointwise

Apply C_out 1×1 filters

Parameter reduction factor:

Standard:	C_in · k² · C_out
Depthwise:	C_in · k² + C_in · C_out
Ratio:	1/C_out + 1/k² ≈ 8-9× fewer (with k=3)

📄 Key Paper for MobileNet

Paper:

Howard, A.G. et al. (2017), 'MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications'

→ https://arxiv.org/abs/1704.04861

8.2 Neural Architecture Search (NAS) — EfficientNet

EfficientNet proposes a principled way to scale CNNs on three dimensions simultaneously:

1. Depth: Number of layers (d)
2. Width: Number of channels (w)
3. Resolution: Input image size (r)

With a compound coefficient φ:

d = α^φ | w = β^φ | r = γ^φ

subject to: α · β² · γ² ≈ 2

α ≥ 1, β ≥ 1, γ ≥ 1

📄 Key Paper for EfficientNet

Paper:

Tan, M. & Le, Q.V. (2019), 'EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks'

→ https://arxiv.org/abs/1905.11946

MODULE 9

Beyond CNNs: Vision Transformers (Weeks 29–30)

9.1 Self-Attention (The Mechanism Powering Transformers)

For input sequence X, compute:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

where Q = XW_Q, K = XW_K, V = XW_V are linear projections.

Multi-Head Attention:

MultiHead(Q, K, V) = Concat(head₁, ..., head_h) · W_O

where headᵢ = Attention(QWᵢ_Q, KWᵢ_K, VWᵢ_V)

9.2 Vision Transformer (ViT)

Idea: Treat image patches as "tokens" and apply transformer attention

1 Split image into N patches of size 16×16
2 Embed each patch linearly
3 Add positional embeddings
4 Process with transformer encoder
5 Use [CLS] token for classification

[Image]
↓
[Patches]
↓
[Linear Projection]
+ [Positional Emb]
↓
[Transformer]
↓
[MLP Head]

📄 Key Papers for Transformers

Paper:

Vaswani, A. et al. (2017), 'Attention Is All You Need'

→ https://arxiv.org/abs/1706.03762

(Original Transformer)

Paper:

Dosovitskiy, A. et al. (2020), 'An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale'

→ https://arxiv.org/abs/2010.11929

(ViT)

Complete Curated Reading List

📚 Textbooks

Book	Authors	Link
Deep Learning	Goodfellow, Bengio, Courville	Link ↗
Neural Networks and Deep Learning	Nielsen	Link ↗
Pattern Recognition and ML	Bishop	Free PDF available online
Mathematics for Machine Learning	Deisenroth et al.	Link ↗

End of Syllabus Preview.