Machine Learning Syllabus

From Basic ANN to CNN

A Research-Backed, Math-Heavy Learning Path for Beginners.

How to Use This Syllabus

Each module contains:

  • Concept explanation — intuitive understanding
  • Mathematics — formal notation and derivations
  • Key Research Papers — with links and what to read in them
  • Suggested time — approximate weeks per module
Prerequisites: High school algebra, basic Python programming.
MODULE 0

Foundation Mathematics (Weeks 1–3)

"You cannot understand deep learning without understanding linear algebra, calculus, and statistics. These are not optional extras — they are the language in which neural networks are written."

0.1 Linear Algebra

Why it matters: Every neural network operation is a matrix multiplication under the hood.

Vectors and Matrices

A vector x ∈ ℝⁿ is a column of numbers. A matrix W ∈ ℝ^(m×n) transforms vectors:

y = Wx + b

where W is a weight matrix, x is input, b is a bias vector.

Dot Product

x · y = Σᵢ xᵢyᵢ = ||x|| ||y|| cos(θ)

Matrix Multiplication

If A ∈ ℝ^(m×k) and B ∈ ℝ^(k×n), then C = AB ∈ ℝ^(m×n) where:

Cᵢⱼ = Σₖ AᵢₖBₖⱼ
Transpose: (AB)ᵀ = BᵀAᵀ

Eigenvalues and Eigenvectors

For a matrix A: Av = λv, where λ is the eigenvalue and v is the eigenvector.

Used in: Principal Component Analysis (PCA), understanding optimization landscapes.

Norms

  • L1 norm: ||x||₁ = Σᵢ |xᵢ|
  • L2 norm: ||x||₂ = √(Σᵢ xᵢ²)
  • Frobenius norm (for matrices): ||A||_F = √(Σᵢⱼ Aᵢⱼ²)

0.2 Calculus & Multivariable Differentiation

Why it matters: Training neural networks = minimizing a loss function using derivatives.

Derivatives

f'(x) = lim[h→0] (f(x+h) - f(x)) / h

Partial Derivatives

For f(x, y): ∂f/∂x treats y as a constant and differentiates w.r.t. x.

The Chain Rule

The most important rule in deep learning

If z = f(g(x)), then:

dz/dx = (dz/dg) · (dg/dx)

For multiple variables, if z = f(y₁, y₂, ..., yₙ) and each yᵢ = gᵢ(x):

∂z/∂x = Σᵢ (∂z/∂yᵢ)(∂yᵢ/∂x)

Gradient

Vector of partial derivatives — points in steepest ascent:

∇f(x) = [∂f/∂x₁ ... ∂f/∂xₙ]ᵀ

Jacobian Matrix

For a vector-valued function f: ℝⁿ → ℝᵐ:

J = [∂fᵢ/∂xⱼ] (m×n)

Hessian Matrix

Second-order curvature matrix:

H_ij = ∂²f / ∂xᵢ∂xⱼ

0.3 Probability & Statistics

Why it matters: Neural networks model probability distributions; loss functions derive from likelihood.

Probability Rules

  • P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
  • P(A | B) = P(A ∩ B) / P(B)

Expectation and Variance

  • E[X] = Σₓ x · P(x)
  • Var(X) = E[(X - E[X])²] = E[X²] - (E[X])²
Bayes' TheoremP(θ | X) = P(X | θ) · P(θ) / P(X)
P(θ | X) posterior
what we believe after seeing data
P(X | θ) likelihood
how probable the data is given parameters
P(θ) prior
what we believed before seeing data

Gaussian Distribution

P(x; μ, σ²) = (1/√(2πσ²)) · exp(-(x - μ)² / (2σ²))

KL Divergence

measures how different two distributions are

D_KL(P || Q) = Σₓ P(x) log(P(x) / Q(x))

Information & Entropy

H(P) = -Σₓ P(x) log P(x)

Cross-Entropy

the foundation of classification loss

H(P, Q) = -Σₓ P(x) log Q(x)

0.4 Optimization Theory

Gradient Descent (Vanilla)

θ ← θ - α · ∇_θ L(θ)

where α is the learning rate and L is the loss.

Convexity

A function f is convex if:

f(λx + (1-λ)y) ≤ λf(x) + (1-λ)f(y), ∀ λ ∈ [0,1]

Convex problems have guaranteed global minima. Neural networks are non-convex — we find good local minima instead.

MODULE 1

Machine Learning Fundamentals (Weeks 4–5)

1.1 The Machine Learning Paradigm

Types of Learning

TypeSetupExample
SupervisedLabeled data (X, Y)Image classification
UnsupervisedUnlabeled data X onlyClustering, generation
ReinforcementAgent + reward signalGame playing

The Statistical Learning Framework

We want to find a function f: X → Y that minimizes expected risk:

R(f) = E[L(f(x), y)] = ∫ L(f(x), y) dP(x, y)

Since we can't compute this over all possible data, we minimize empirical risk over training set:

R̂(f) = (1/n) Σᵢ L(f(xᵢ), yᵢ)

1.2 The Bias-Variance Tradeoff

The expected test error of any estimator decomposes as:

E[(y - f̂(x))²] = Bias(f̂(x))² + Var(f̂(x)) + σ²_noise
  • High bias = underfitting (model too simple)
  • High variance = overfitting (model memorizes training data)
  • σ²_noise = irreducible error

1.3 Key Evaluation Metrics

For Classification:
  • Accuracy: (TP + TN) / Total
  • Precision: TP / (TP + FP)
  • Recall: TP / (TP + FN)
  • F1 Score: 2 · (Precision · Recall) / (Precision + Recall)
For Regression:
  • MSE: (1/n) Σ (yᵢ - ŷᵢ)²
  • MAE: (1/n) Σ |yᵢ - ŷᵢ|
  • R²: 1 - SS_res / SS_tot
📄 Key Papers for ML Fundamentals
MODULE 2

The Perceptron & Single-Layer Networks (Week 6)

2.1 Biological Inspiration

The McCulloch-Pitts Neuron (1943) was the first mathematical model of a biological neuron:

output = 1 if Σᵢ wᵢxᵢ ≥ threshold
output = 0 otherwise

2.2 Rosenblatt's Perceptron (1958)

The perceptron computes:

ŷ = sign(wᵀx + b) = sign(Σᵢ wᵢxᵢ + b)

Perceptron Learning Rule:

For each misclassified example (x, y):

w ← w + α · y · x
b ← b + α · y
Perceptron Convergence Theorem: If the data is linearly separable, the perceptron will converge in a finite number of steps.
Limitation: The perceptron cannot solve XOR — it only separates linearly separable data. This led to the first "AI winter."
MODULE 3

Artificial Neural Networks (Weeks 7–10)

3.1 The Multi-Layer Perceptron (MLP)

A feedforward neural network with:

  • An input layer
  • One or more hidden layers
  • An output layer

Single Neuron (layer l, neuron j):

zⱼ^(l) = Σᵢ wⱼᵢ^(l) · aᵢ^(l-1) + bⱼ^(l)
aⱼ^(l) = σ(zⱼ^(l))

In Matrix Form:

z^(l) = W^(l) · a^(l-1) + b^(l)
a^(l) = σ(z^(l))

where:

  • W^(l) ∈ ℝ^(n_l × n_{l-1}) is the weight matrix
  • b^(l) ∈ ℝ^(n_l) is the bias vector
  • σ is the activation function applied element-wise

3.2 Activation Functions

Launch Activation Visualizer

Activation functions introduce non-linearity — without them, deep networks collapse to a single linear transformation.

Sigmoid

σ(z) = 1 / (1 + e^(-z))
σ'(z) = σ(z)(1 - σ(z)) ∈ (0, 0.25]
  • Output range: (0, 1)
  • Problem: Vanishing gradient — σ'(z) ≈ 0 for large |z|

Tanh

tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))
tanh'(z) = 1 - tanh²(z) ∈ (0, 1]
  • Output range: (-1, 1)
  • Zero-centered (better than sigmoid)
  • Still suffers from vanishing gradients

ReLU — Rectified Linear Unit

ReLU(z) = max(0, z)
ReLU'(z) = 1 if z > 0, else 0
  • Computationally efficient
  • Avoids vanishing gradient for z > 0
  • Problem: "Dying ReLU" — neurons with z < 0 never update

Leaky ReLU

f(z) = z if z > 0
f(z) = αz if z ≤ 0 (α ≈ 0.01)

ELU — Exponential Linear Unit

f(z) = z if z > 0
f(z) = α(e^z - 1) if z ≤ 0

Softmax (for multi-class output)

softmax(zᵢ) = e^(zᵢ) / Σⱼ e^(zⱼ)

Properties: outputs sum to 1, interpretable as probabilities.

Mean Squared Error (Regression)

L_MSE = (1/n) Σᵢ (yᵢ - ŷᵢ)²

Derived from the negative log-likelihood of a Gaussian — minimizing MSE = maximizing likelihood under Gaussian noise.

Binary Cross-Entropy (Binary Classification)

L_BCE = -(1/n) Σᵢ [yᵢ log(ŷᵢ) + (1-yᵢ) log(1-ŷᵢ)]

Derived from the Bernoulli likelihood.

Categorical Cross-Entropy (Multi-class)

L_CE = -(1/n) Σᵢ Σⱼ yᵢⱼ log(ŷᵢⱼ)

where yᵢⱼ ∈ {0,1} is one-hot encoded ground truth.

3.4 Forward Propagation

Interactive Network Visualizer

Full forward pass for an L-layer network:

a^(0) = x                               (input)for l = 1, 2, ..., L:z^(l) = W^(l) a^(l-1) + b^(l)a^(l) = σ(z^(l))ŷ = a^(L)                               (output)

The most important algorithm in deep learning. Efficiently computes gradients using the chain rule. Goal: Compute ∂L/∂W^(l) and ∂L/∂b^(l) for all layers l.

Step 1 — Output layer error

δ^(L) = ∇_a L ⊙ σ'(z^(L))

where ⊙ is element-wise multiplication.

Step 2 — Backpropagate the error

δ^(l) = (W^(l+1))ᵀ δ^(l+1) ⊙ σ'(z^(l))

Step 3 — Compute gradients

∂L/∂W^(l) = δ^(l) (a^(l-1))ᵀ
∂L/∂b^(l) = δ^(l)
Intuition: δ^(l) represents how much each neuron in layer l contributed to the final error. We propagate this "blame" backwards through the network.
Computational Complexity:O(n_params) — same cost as a single forward pass!

3.6 Gradient Descent Variants

Batch Gradient Descent

θ ← θ - α · (1/n) Σᵢ ∇_θ L(xᵢ, yᵢ)

Uses entire dataset per update. Stable but slow.

Stochastic Gradient Descent (SGD)

θ ← θ - α · ∇_θ L(xᵢ, yᵢ) (one sample at a time)

Noisy but fast and escapes local minima better.

Mini-Batch Gradient Descent (Standard in practice)

θ ← θ - α · (1/B) Σᵢ_{∈batch} ∇_θ L(xᵢ, yᵢ)

where B is the batch size (typically 32, 64, 128, 256).

Momentum

Accumulates a velocity vector in directions of persistent gradient:

v ← β·v - α·∇_θ L
θ ← θ + v
Most Popular

Adam Optimizer

Combines momentum and adaptive learning rates:

m_t = β₁·m_{t-1} + (1-β₁)·g_t (1st moment)
v_t = β₂·v_{t-1} + (1-β₂)·g_t² (2nd moment)

m̂_t = m_t / (1 - β₁ᵗ) (bias correction)
v̂_t = v_t / (1 - β₂ᵗ) (bias correction)

θ_t = θ_{t-1} - α · m̂_t / (√v̂_t + ε)

Default: β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸

MODULE 4

Training Deep Networks (Weeks 11–13)

4.1 The Vanishing and Exploding Gradient Problem

During backprop, gradients are products of weight matrices and activation derivatives:

∂L/∂W^(1) ∝ (W^(L))ᵀ · σ'(z^(L)) · (W^(L-1))ᵀ · ... · σ'(z^(1))

Vanishing Gradients

If each term is < 1 → gradients shrink exponentially

Exploding Gradients

If each term is > 1 → gradients grow exponentially

Solutions:

  • Better activation functions (ReLU)
  • Better weight initialization
  • Batch Normalization
  • Residual connections (ResNet)
  • Gradient clipping (for RNNs)

4.2 Weight Initialization

Why it matters: Bad initialization → dead neurons or exploding activations.

Zero Initialization

All weights = 0 → all neurons compute the same function → symmetry breaking fails. Never do this.

Random Initialization

Too large → exploding. Too small → vanishing.

Xavier/Glorot Initialization (for tanh/sigmoid)

W ~ Uniform[-√(6/(n_in + n_out)), √(6/(n_in + n_out))]

or equivalently: Var(W) = 2 / (n_in + n_out)

He Initialization (for ReLU)

W ~ N(0, 2/n_in)

Accounts for the fact that ReLU zeros out half the inputs.

4.3 Batch Normalization

Normalizes layer inputs to have zero mean and unit variance, then re-scales with learnable parameters:

μ_B = (1/m) Σᵢ xᵢ(batch mean)
σ²_B = (1/m) Σᵢ (xᵢ - μ_B)²(batch variance)
x̂ᵢ = (xᵢ - μ_B) / √(σ²_B + ε)(normalize)
yᵢ = γ · x̂ᵢ + β(scale & shift)

Benefits:

  • Reduces internal covariate shift
  • Allows higher learning rates
  • Acts as a regularizer (reduces need for Dropout)
  • Reduces sensitivity to initialization

4.4 Regularization Techniques

The Problem: Neural networks can memorize training data (overfit).

L2 Regularization (Weight Decay)

Add penalty to loss:

L_total = L_data + (λ/2) Σ_w w²

Gradient update becomes:

w ← w(1 - αλ) - α · ∂L_data/∂w

The factor (1 - αλ) "decays" weights toward zero.

L1 Regularization

Add penalty to loss:

L_total = L_data + λ Σ_w |w|

Promotes sparsity — many weights become exactly zero.

Dropout

During training, randomly set each neuron to zero with probability p:

aᵢ^(l) = aᵢ^(l) · Bernoulli(1-p) / (1-p)

The division by (1-p) ensures expected values remain unchanged. At test time, use all neurons.

Intuition: Forces the network to learn redundant representations; prevents co-adaptation of neurons.

Early Stopping

Monitor validation loss; stop training when it starts increasing. Effectively limits model capacity.

4.5 The Universal Approximation Theorem

Cybenko, 1989; Hornik, 1991

Theorem

A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of ℝⁿ, under mild assumptions on the activation function.

Formal Statement:

For any continuous f: [0,1]ⁿ → ℝ and any ε > 0, there exists a single-hidden-layer network F with N neurons such that:

|F(x) - f(x)| < ε for all x ∈ [0,1]ⁿ

Important Caveat

This theorem guarantees existence, not that gradient descent will find such a network, nor that it requires a practical number of neurons.

MODULE 5

Convolutional Neural Networks (Weeks 14–18)

5.1 Motivation: Why Not Just Use MLPs for Images?

A 224×224 RGB image has 224 × 224 × 3 = 150,528 input dimensions. A first hidden layer with just 1,000 neurons needs 150 million parameters — just in layer 1. This is:

Computationally infeasibleto train efficiently.
Data-hungryneeds billions of examples to avoid overfitting.
Ignores spatial structuretreats pixel (0,0) and pixel (200,200) as equally related.

CNNs exploit three key properties of natural images:

  • 1. Locality — nearby pixels are correlated
  • 2. Translation invariance — a cat is a cat whether it's top-left or bottom-right
  • 3. Compositionality — complex features are built from simple ones

5.2 The Convolution Operation

Signal Graph Visualizer

1D Convolution (for intuition)

(f * g)[n] = Σₖ f[k] · g[n - k]

2D Discrete Convolution (for images)

For input I and filter (kernel) K of size (2m+1) × (2n+1):

(I * K)[i, j] = Σₛ Σₜ I[i+s, j+t] · K[s, t]

Animated Worked Example:

INPUT (5×5)
-1
2
3
4
5
-2
4
3
-1
1
2
3
-8
5
6
1
-2
3
4
5
5
4
0
2
1
KERNEL (3×3)
1
0
-1
1
0
-1
1
0
-1
=
sliding
kernel
OUTPUT (3×3)
1
?
?
?
?
?
?
?
?
Output[0,0] =
-1·1 + 2·0 + 3·(-1) + -2·1 + 4·0 + 3·(-1) + 2·1 + 3·0 + -8·(-1)
= 1

Output size formula:

output_size = (input_size - kernel_size + 2·padding) / stride + 1

Parameters in a conv layer:

params = (kernel_h × kernel_w × in_channels + 1) × num_filters

vs. dense layer: input_size × output_size — orders of magnitude fewer parameters!

5.3 Key CNN Components

Stride

Moving the kernel by stride S instead of 1 pixel at a time:

  • Reduces output spatial dimensions
  • Stride 2 ≈ halves the spatial size

Padding

Adding zeros around the input border:

  • Valid padding: No padding, output shrinks
  • Same padding: Pad so output size = input size (when stride=1)
Required same-padding: p = (k-1)/2 for kernel size k

Pooling

Reduces spatial dimensions while retaining important information.

Max Pooling (most common):
output[i,j] = max{ input[i·s+r, j·s+c] : 0 ≤ r,c < k }
Average Pooling:
output[i,j] = (1/k²) Σᵣ Σ_c input[i·s+r, j·s+c]
Properties of pooling:
  • No learnable parameters
  • Makes representation approximately translation-invariant
  • Reduces computation and memory

Feature Maps

Each convolutional filter produces one feature map (also called activation map). With C filters, we get C feature maps — each detecting a different feature (edges, curves, textures, etc.).

5.4 Receptive Field

The receptive field of a neuron is the region of the input image it can "see." After stacking multiple conv layers:

RF_l = RF_{l-1} + (k_l - 1) · Π_{i=1}^{l-1} s_i

where k_l is kernel size and sᵢ is stride at layer i.

Insight: Stacking small 3×3 kernels is more efficient than using large kernels while achieving the same receptive field with fewer parameters and more non-linearities.

5.5 Full CNN Architecture

Launch Architecture Playground

A typical CNN pipeline:

Input Image
[Conv → BatchNorm → ReLU → Pool] × N(Feature Extraction)
Flatten
[Fully Connected → ReLU] × M(Classification Head)
[Fully Connected → Softmax](Output)

5.6 Backpropagation Through Convolutions

Gradient w.r.t. kernel weights:

∂L/∂K[s,t] = Σᵢ Σⱼ δ[i,j] · I[i+s, j+t]

Gradient w.r.t. input (for propagating to earlier layers):

∂L/∂I[i,j] = Σₛ Σₜ δ[i-s, j-t] · K[s,t]

This is a full convolution of δ with the (180°-rotated) kernel.

MODULE 6

Landmark CNN Architectures (Weeks 19–22)

6.1 LeNet-5 (1998) — The Pioneer

Conv → Pool → Conv → Pool → FC → FC → Output

Key Contributions:

  • First practical deep CNN for digit recognition (MNIST)
  • Demonstrated CNNs for document recognition at scale
  • Used sigmoid/tanh activations
Input: 32×32 grayscaleOutput: 10 classes
📄 Key Paper for LeNet

6.2 AlexNet (2012) — The Deep Learning Revolution

Architecture: 5 Conv layers + 3 FC layers, ~60M parameters

Key Contributions:

  • Won ImageNet 2012 with 15.3% top-5 error (vs. 26.2% runner-up)
  • First large-scale use of ReLU (dramatically faster training)
  • Introduced Dropout in FC layers
  • Used GPU training (two GTX 580s)
  • Local Response Normalization (LRN)
  • Data augmentation (random crops, horizontal flips)
Architecture in detail:
Input: 224×224×3
Conv1: 96 filters, 11×11, stride 4 → 55×55×96
MaxPool: 3×3, stride 2 → 27×27×96
Conv2: 256 filters, 5×5 → 27×27×256
MaxPool: 3×3, stride 2 → 13×13×256
Conv3: 384 filters, 3×3 → 13×13×384
Conv4: 384 filters, 3×3 → 13×13×384
Conv5: 256 filters, 3×3 → 13×13×256
MaxPool: 3×3, stride 2 → 6×6×256
Flatten → 9216
FC6: 4096, Dropout
FC7: 4096, Dropout
FC8: 1000 (softmax)

6.3 VGGNet (2014) — The Simplicity Principle

Architecture: Very deep networks using only 3×3 convolutions (VGG-16: 16 layers, VGG-19: 19 layers)

Key insight:

Multiple stacked 3×3 conv layers have the same receptive field as larger kernels but with fewer parameters and more non-linearities:

Two 3×3 layersone 5×5 layer
Three 3×3 layersone 7×7 layer
Parameter comparison:
  • 1× 7×7 conv: 49 × C² parameters
  • 3× 3×3 convs: 27 × C² parameters (+ 3 non-linearities!)

6.4 GoogLeNet/Inception (2014) — Going Wider

Key innovation: The Inception Module

Instead of choosing between 1×1, 3×3, or 5×5 convolutions, do all of them in parallel and concatenate:

Input
1×11×1→3×31×1→5×5MaxPool→1×1
Concatenate along channel dim

1×1 convolutions act as a bottleneck to reduce channels before expensive 3×3/5×5 convolutions.

Other innovations:

  • Auxiliary classifiers at intermediate layers (helps gradient flow)
  • No FC layers at the end (uses global average pooling)
  • 4M parameters (12× fewer than AlexNet!)

6.5 ResNet (2015) — The Residual Revolution

The problem:Very deep networks (>20 layers) trained worse than shallower ones — not just due to overfitting, but due to optimization difficulty.

The solution: Residual (Skip) Connections

Instead of learning H(x), learn the residual F(x) = H(x) - x:

Output = F(x, {Wᵢ}) + x

where F is 2 or 3 conv layers.

Why it works:
  • Gradient highway — gradients flow directly through skip connections
  • Identity mapping is trivially learnable (set F = 0)
  • Allows networks of 50, 101, 152+ layers
Mathematical insight:

The gradient of the loss w.r.t. the input becomes:

∂L/∂x = ∂L/∂y · (1 + ∂F/∂x)

The "1" ensures gradients are at least as large as in the residual branch — no vanishing!

ResNet-50 Architecture:
Conv1: 7×7, 64 filters, stride 2
MaxPool: 3×3, stride 2
Stage 1: 3× [1×1-64, 3×3-64, 1×1-256] bottleneck blocks
Stage 2: 4× [1×1-128, 3×3-128, 1×1-512] bottleneck blocks
Stage 3: 6× [1×1-256, 3×3-256, 1×1-1024] bottleneck blocks
Stage 4: 3× [1×1-512, 3×3-512, 1×1-2048] bottleneck blocks
GlobalAvgPool → FC 1000 → Softmax
📄 Key Papers for ResNet

6.6 DenseNet (2016) — Dense Connectivity

DenseNet connects each layer to every subsequent layer:

x_l = H_l([x_0, x_1, ..., x_{l-1}])

where [·] denotes concatenation of all previous feature maps.

Benefits:

  • Maximizes gradient flow
  • Feature reuse
  • Substantially fewer parameters than ResNet
MODULE 7

Transfer Learning & Practical Applications (Weeks 23–25)

7.1 Transfer Learning

Idea: A model trained on a large dataset (e.g., ImageNet with 1.2M images, 1000 classes) learns general visual features that transfer to new tasks.

Why it works: Early CNN layers detect universal low-level features (edges, textures). Later layers are more task-specific.

1. Feature Extraction

Freeze all conv layers, train only the new head.

2. Fine-tuning

Unfreeze some/all layers and train with a small learning rate.

3. Full retraining

Use pretrained weights as initialization, train everything.

Rule of thumb:
Small datasetLarge dataset
Similar domain:Feature extractionFine-tune last layers
Different domain:Careful fine-tuningPossibly full retraining

7.2 Data Augmentation

Artificially expand training data with label-preserving transformations:

  • Geometric: Random crop, flip, rotation, scaling, shearing
  • Photometric: Brightness, contrast, hue, saturation changes
  • Noise: Gaussian noise, cutout, random erasing
  • Advanced: Mixup (α-blend two images + labels), CutMix

Mixup Formula:

x̃ = λ·xᵢ + (1-λ)·xⱼ
ỹ = λ·yᵢ + (1-λ)·yⱼ

where λ ~ Beta(α, α)

MODULE 8

Modern and Efficient Architectures (Weeks 26–28)

8.1 Depthwise Separable Convolutions (MobileNet)

Standard conv: C_in × k × k → C_out (one step)

Depthwise separable conv (Two steps):

1. Depthwise

Apply one k×k filter per input channel (C_in filters total)

2. Pointwise

Apply C_out 1×1 filters

Parameter reduction factor:
Standard:C_in · k² · C_out
Depthwise:C_in · k² + C_in · C_out
Ratio:1/C_out + 1/k² ≈ 8-9× fewer (with k=3)

8.2 Neural Architecture Search (NAS) — EfficientNet

EfficientNet proposes a principled way to scale CNNs on three dimensions simultaneously:

  • 1. Depth: Number of layers (d)
  • 2. Width: Number of channels (w)
  • 3. Resolution: Input image size (r)

With a compound coefficient φ:

d = α^φ  | w = β^φ  | r = γ^φ
subject to: α · β² · γ² ≈ 2

α ≥ 1, β ≥ 1, γ ≥ 1

MODULE 9

Beyond CNNs: Vision Transformers (Weeks 29–30)

9.1 Self-Attention (The Mechanism Powering Transformers)

For input sequence X, compute:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

where Q = XW_Q, K = XW_K, V = XW_V are linear projections.

Multi-Head Attention:

MultiHead(Q, K, V) = Concat(head₁, ..., head_h) · W_O
where headᵢ = Attention(QWᵢ_Q, KWᵢ_K, VWᵢ_V)

9.2 Vision Transformer (ViT)

Idea: Treat image patches as "tokens" and apply transformer attention

  • 1 Split image into N patches of size 16×16
  • 2 Embed each patch linearly
  • 3 Add positional embeddings
  • 4 Process with transformer encoder
  • 5 Use [CLS] token for classification
[Image]

[Patches]

[Linear Projection]
+ [Positional Emb]

[Transformer]

[MLP Head]

Complete Curated Reading List

📚 Textbooks

BookAuthorsLink
Deep LearningGoodfellow, Bengio, CourvilleLink ↗
Neural Networks and Deep LearningNielsenLink ↗
Pattern Recognition and MLBishopFree PDF available online
Mathematics for Machine LearningDeisenroth et al.Link ↗

End of Syllabus Preview.