Machine Learning Syllabus
From Basic ANN to CNN
A Research-Backed, Math-Heavy Learning Path for Beginners.
How to Use This Syllabus
Each module contains:
- ■Concept explanation — intuitive understanding
- ■Mathematics — formal notation and derivations
- ■Key Research Papers — with links and what to read in them
- ■Suggested time — approximate weeks per module
Foundation Mathematics (Weeks 1–3)
"You cannot understand deep learning without understanding linear algebra, calculus, and statistics. These are not optional extras — they are the language in which neural networks are written."
0.1 Linear Algebra
Why it matters: Every neural network operation is a matrix multiplication under the hood.
Vectors and Matrices
A vector x ∈ ℝⁿ is a column of numbers. A matrix W ∈ ℝ^(m×n) transforms vectors:
where W is a weight matrix, x is input, b is a bias vector.
Dot Product
Matrix Multiplication
If A ∈ ℝ^(m×k) and B ∈ ℝ^(k×n), then C = AB ∈ ℝ^(m×n) where:
Eigenvalues and Eigenvectors
For a matrix A: Av = λv, where λ is the eigenvalue and v is the eigenvector.
Used in: Principal Component Analysis (PCA), understanding optimization landscapes.
Norms
- L1 norm: ||x||₁ = Σᵢ |xᵢ|
- L2 norm: ||x||₂ = √(Σᵢ xᵢ²)
- Frobenius norm (for matrices): ||A||_F = √(Σᵢⱼ Aᵢⱼ²)
0.2 Calculus & Multivariable Differentiation
Why it matters: Training neural networks = minimizing a loss function using derivatives.
Derivatives
Partial Derivatives
For f(x, y): ∂f/∂x treats y as a constant and differentiates w.r.t. x.
The Chain Rule
The most important rule in deep learning
If z = f(g(x)), then:
For multiple variables, if z = f(y₁, y₂, ..., yₙ) and each yᵢ = gᵢ(x):
Gradient
Vector of partial derivatives — points in steepest ascent:
Jacobian Matrix
For a vector-valued function f: ℝⁿ → ℝᵐ:
Hessian Matrix
Second-order curvature matrix:
0.3 Probability & Statistics
Why it matters: Neural networks model probability distributions; loss functions derive from likelihood.
Probability Rules
- P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
- P(A | B) = P(A ∩ B) / P(B)
Expectation and Variance
- E[X] = Σₓ x · P(x)
- Var(X) = E[(X - E[X])²] = E[X²] - (E[X])²
what we believe after seeing data
how probable the data is given parameters
what we believed before seeing data
Gaussian Distribution
KL Divergence
measures how different two distributions are
Information & Entropy
Cross-Entropy
the foundation of classification loss
0.4 Optimization Theory
Gradient Descent (Vanilla)
where α is the learning rate and L is the loss.
Convexity
A function f is convex if:
Convex problems have guaranteed global minima. Neural networks are non-convex — we find good local minima instead.
Machine Learning Fundamentals (Weeks 4–5)
1.1 The Machine Learning Paradigm
Types of Learning
| Type | Setup | Example |
|---|---|---|
| Supervised | Labeled data (X, Y) | Image classification |
| Unsupervised | Unlabeled data X only | Clustering, generation |
| Reinforcement | Agent + reward signal | Game playing |
The Statistical Learning Framework
We want to find a function f: X → Y that minimizes expected risk:
Since we can't compute this over all possible data, we minimize empirical risk over training set:
1.2 The Bias-Variance Tradeoff
The expected test error of any estimator decomposes as:
- ■ High bias = underfitting (model too simple)
- ■ High variance = overfitting (model memorizes training data)
- ■ σ²_noise = irreducible error
1.3 Key Evaluation Metrics
- Accuracy: (TP + TN) / Total
- Precision: TP / (TP + FP)
- Recall: TP / (TP + FN)
- F1 Score: 2 · (Precision · Recall) / (Precision + Recall)
- MSE: (1/n) Σ (yᵢ - ŷᵢ)²
- MAE: (1/n) Σ |yᵢ - ŷᵢ|
- R²: 1 - SS_res / SS_tot
Read: Section 1 (Introduction) and Section 2 (Supervised learning)
The Perceptron & Single-Layer Networks (Week 6)
2.1 Biological Inspiration
The McCulloch-Pitts Neuron (1943) was the first mathematical model of a biological neuron:
2.2 Rosenblatt's Perceptron (1958)
The perceptron computes:
Perceptron Learning Rule:
For each misclassified example (x, y):
Artificial Neural Networks (Weeks 7–10)
3.1 The Multi-Layer Perceptron (MLP)
A feedforward neural network with:
- An input layer
- One or more hidden layers
- An output layer
Single Neuron (layer l, neuron j):
aⱼ^(l) = σ(zⱼ^(l))
In Matrix Form:
a^(l) = σ(z^(l))
where:
- W^(l) ∈ ℝ^(n_l × n_{l-1}) is the weight matrix
- b^(l) ∈ ℝ^(n_l) is the bias vector
- σ is the activation function applied element-wise
3.2 Activation Functions
Launch Activation VisualizerActivation functions introduce non-linearity — without them, deep networks collapse to a single linear transformation.
Sigmoid
σ'(z) = σ(z)(1 - σ(z)) ∈ (0, 0.25]
- Output range: (0, 1)
- Problem: Vanishing gradient — σ'(z) ≈ 0 for large |z|
Tanh
tanh'(z) = 1 - tanh²(z) ∈ (0, 1]
- Output range: (-1, 1)
- Zero-centered (better than sigmoid)
- Still suffers from vanishing gradients
ReLU — Rectified Linear Unit
ReLU'(z) = 1 if z > 0, else 0
- Computationally efficient
- Avoids vanishing gradient for z > 0
- Problem: "Dying ReLU" — neurons with z < 0 never update
Leaky ReLU
f(z) = αz if z ≤ 0 (α ≈ 0.01)
ELU — Exponential Linear Unit
f(z) = α(e^z - 1) if z ≤ 0
Softmax (for multi-class output)
Properties: outputs sum to 1, interpretable as probabilities.
3.3 Loss Functions
Launch Loss Function ExplorerMean Squared Error (Regression)
Derived from the negative log-likelihood of a Gaussian — minimizing MSE = maximizing likelihood under Gaussian noise.
Binary Cross-Entropy (Binary Classification)
Derived from the Bernoulli likelihood.
Categorical Cross-Entropy (Multi-class)
where yᵢⱼ ∈ {0,1} is one-hot encoded ground truth.
3.4 Forward Propagation
Interactive Network VisualizerFull forward pass for an L-layer network:
a^(0) = x (input)for l = 1, 2, ..., L:z^(l) = W^(l) a^(l-1) + b^(l)a^(l) = σ(z^(l))ŷ = a^(L) (output)
3.5 Backpropagation
Interactive Backprop VisualizerThe most important algorithm in deep learning. Efficiently computes gradients using the chain rule. Goal: Compute ∂L/∂W^(l) and ∂L/∂b^(l) for all layers l.
Step 1 — Output layer error
where ⊙ is element-wise multiplication.
Step 2 — Backpropagate the error
Step 3 — Compute gradients
∂L/∂b^(l) = δ^(l)
THE paper that made neural networks practical. Must-read.
(First formulation of the reverse mode automatic differentiation)
3.6 Gradient Descent Variants
Batch Gradient Descent
Uses entire dataset per update. Stable but slow.
Stochastic Gradient Descent (SGD)
Noisy but fast and escapes local minima better.
Mini-Batch Gradient Descent (Standard in practice)
where B is the batch size (typically 32, 64, 128, 256).
Momentum
Accumulates a velocity vector in directions of persistent gradient:
θ ← θ + v
Adam Optimizer
Combines momentum and adaptive learning rates:
v_t = β₂·v_{t-1} + (1-β₂)·g_t² (2nd moment)
m̂_t = m_t / (1 - β₁ᵗ) (bias correction)
v̂_t = v_t / (1 - β₂ᵗ) (bias correction)
θ_t = θ_{t-1} - α · m̂_t / (√v̂_t + ε)
Default: β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸
One of the most cited ML papers ever.
(Introduces momentum)
Training Deep Networks (Weeks 11–13)
4.1 The Vanishing and Exploding Gradient Problem
During backprop, gradients are products of weight matrices and activation derivatives:
Vanishing Gradients
If each term is < 1 → gradients shrink exponentially
Exploding Gradients
If each term is > 1 → gradients grow exponentially
Solutions:
- ✔ Better activation functions (ReLU)
- ✔ Better weight initialization
- ✔ Batch Normalization
- ✔ Residual connections (ResNet)
- ✔ Gradient clipping (for RNNs)
4.2 Weight Initialization
Why it matters: Bad initialization → dead neurons or exploding activations.
Zero Initialization
All weights = 0 → all neurons compute the same function → symmetry breaking fails. Never do this.
Random Initialization
Too large → exploding. Too small → vanishing.
Xavier/Glorot Initialization (for tanh/sigmoid)
or equivalently: Var(W) = 2 / (n_in + n_out)
He Initialization (for ReLU)
Accounts for the fact that ReLU zeros out half the inputs.
4.3 Batch Normalization
Normalizes layer inputs to have zero mean and unit variance, then re-scales with learnable parameters:
Benefits:
- • Reduces internal covariate shift
- • Allows higher learning rates
- • Acts as a regularizer (reduces need for Dropout)
- • Reduces sensitivity to initialization
4.4 Regularization Techniques
The Problem: Neural networks can memorize training data (overfit).
L2 Regularization (Weight Decay)
Add penalty to loss:
Gradient update becomes:
The factor (1 - αλ) "decays" weights toward zero.
L1 Regularization
Add penalty to loss:
Promotes sparsity — many weights become exactly zero.
Dropout
During training, randomly set each neuron to zero with probability p:
The division by (1-p) ensures expected values remain unchanged. At test time, use all neurons.
Early Stopping
Monitor validation loss; stop training when it starts increasing. Effectively limits model capacity.
4.5 The Universal Approximation Theorem
Theorem
A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of ℝⁿ, under mild assumptions on the activation function.
Formal Statement:
For any continuous f: [0,1]ⁿ → ℝ and any ε > 0, there exists a single-hidden-layer network F with N neurons such that:
Important Caveat
This theorem guarantees existence, not that gradient descent will find such a network, nor that it requires a practical number of neurons.
Convolutional Neural Networks (Weeks 14–18)
5.1 Motivation: Why Not Just Use MLPs for Images?
A 224×224 RGB image has 224 × 224 × 3 = 150,528 input dimensions. A first hidden layer with just 1,000 neurons needs 150 million parameters — just in layer 1. This is:
CNNs exploit three key properties of natural images:
- 1. Locality — nearby pixels are correlated
- 2. Translation invariance — a cat is a cat whether it's top-left or bottom-right
- 3. Compositionality — complex features are built from simple ones
5.2 The Convolution Operation
Signal Graph Visualizer1D Convolution (for intuition)
2D Discrete Convolution (for images)
For input I and filter (kernel) K of size (2m+1) × (2n+1):
Animated Worked Example:
kernel
Output size formula:
Parameters in a conv layer:
vs. dense layer: input_size × output_size — orders of magnitude fewer parameters!
5.3 Key CNN Components
Stride
Moving the kernel by stride S instead of 1 pixel at a time:
- Reduces output spatial dimensions
- Stride 2 ≈ halves the spatial size
Padding
Adding zeros around the input border:
- Valid padding: No padding, output shrinks
- Same padding: Pad so output size = input size (when stride=1)
Pooling
Reduces spatial dimensions while retaining important information.
Max Pooling (most common):
Average Pooling:
Properties of pooling:
- No learnable parameters
- Makes representation approximately translation-invariant
- Reduces computation and memory
Feature Maps
Each convolutional filter produces one feature map (also called activation map). With C filters, we get C feature maps — each detecting a different feature (edges, curves, textures, etc.).
5.4 Receptive Field
The receptive field of a neuron is the region of the input image it can "see." After stacking multiple conv layers:
where k_l is kernel size and sᵢ is stride at layer i.
Insight: Stacking small 3×3 kernels is more efficient than using large kernels while achieving the same receptive field with fewer parameters and more non-linearities.
5.5 Full CNN Architecture
Launch Architecture PlaygroundA typical CNN pipeline:
5.6 Backpropagation Through Convolutions
Gradient w.r.t. kernel weights:
Gradient w.r.t. input (for propagating to earlier layers):
This is a full convolution of δ with the (180°-rotated) kernel.
Landmark CNN Architectures (Weeks 19–22)
6.1 LeNet-5 (1998) — The Pioneer
Key Contributions:
- First practical deep CNN for digit recognition (MNIST)
- Demonstrated CNNs for document recognition at scale
- Used sigmoid/tanh activations
Must-read: Sections I-III for the architecture, Section II.B for backpropagation in CNNs
6.2 AlexNet (2012) — The Deep Learning Revolution
Architecture: 5 Conv layers + 3 FC layers, ~60M parameters
Key Contributions:
- Won ImageNet 2012 with 15.3% top-5 error (vs. 26.2% runner-up)
- First large-scale use of ReLU (dramatically faster training)
- Introduced Dropout in FC layers
- Used GPU training (two GTX 580s)
- Local Response Normalization (LRN)
- Data augmentation (random crops, horizontal flips)
The paper that started the modern deep learning era. Read entirely.
6.3 VGGNet (2014) — The Simplicity Principle
Architecture: Very deep networks using only 3×3 convolutions (VGG-16: 16 layers, VGG-19: 19 layers)
Key insight:
Multiple stacked 3×3 conv layers have the same receptive field as larger kernels but with fewer parameters and more non-linearities:
Parameter comparison:
- 1× 7×7 conv: 49 × C² parameters
- 3× 3×3 convs: 27 × C² parameters (+ 3 non-linearities!)
Read: Table 1 (architectures), Section 2.1-2.3
6.4 GoogLeNet/Inception (2014) — Going Wider
Key innovation: The Inception Module
Instead of choosing between 1×1, 3×3, or 5×5 convolutions, do all of them in parallel and concatenate:
1×1 convolutions act as a bottleneck to reduce channels before expensive 3×3/5×5 convolutions.
Other innovations:
- Auxiliary classifiers at intermediate layers (helps gradient flow)
- No FC layers at the end (uses global average pooling)
- 4M parameters (12× fewer than AlexNet!)
6.5 ResNet (2015) — The Residual Revolution
The solution: Residual (Skip) Connections
Instead of learning H(x), learn the residual F(x) = H(x) - x:
where F is 2 or 3 conv layers.
Why it works:
- Gradient highway — gradients flow directly through skip connections
- Identity mapping is trivially learnable (set F = 0)
- Allows networks of 50, 101, 152+ layers
Mathematical insight:
The gradient of the loss w.r.t. the input becomes:
The "1" ensures gradients are at least as large as in the residual branch — no vanishing!
The most influential vision paper of the 2010s. Read entirely.
Read: Section 3 (theoretical analysis of why residuals work)
6.6 DenseNet (2016) — Dense Connectivity
DenseNet connects each layer to every subsequent layer:
where [·] denotes concatenation of all previous feature maps.
Benefits:
- Maximizes gradient flow
- Feature reuse
- Substantially fewer parameters than ResNet
Transfer Learning & Practical Applications (Weeks 23–25)
7.1 Transfer Learning
Idea: A model trained on a large dataset (e.g., ImageNet with 1.2M images, 1000 classes) learns general visual features that transfer to new tasks.
Why it works: Early CNN layers detect universal low-level features (edges, textures). Later layers are more task-specific.
1. Feature Extraction
Freeze all conv layers, train only the new head.
2. Fine-tuning
Unfreeze some/all layers and train with a small learning rate.
3. Full retraining
Use pretrained weights as initialization, train everything.
| Small dataset | Large dataset | |
|---|---|---|
| Similar domain: | Feature extraction | Fine-tune last layers |
| Different domain: | Careful fine-tuning | Possibly full retraining |
7.2 Data Augmentation
Artificially expand training data with label-preserving transformations:
- Geometric: Random crop, flip, rotation, scaling, shearing
- Photometric: Brightness, contrast, hue, saturation changes
- Noise: Gaussian noise, cutout, random erasing
- Advanced: Mixup (α-blend two images + labels), CutMix
Mixup Formula:
ỹ = λ·yᵢ + (1-λ)·yⱼ
where λ ~ Beta(α, α)
Modern and Efficient Architectures (Weeks 26–28)
8.1 Depthwise Separable Convolutions (MobileNet)
Depthwise separable conv (Two steps):
Apply one k×k filter per input channel (C_in filters total)
Apply C_out 1×1 filters
| Standard: | C_in · k² · C_out |
| Depthwise: | C_in · k² + C_in · C_out |
| Ratio: | 1/C_out + 1/k² ≈ 8-9× fewer (with k=3) |
8.2 Neural Architecture Search (NAS) — EfficientNet
EfficientNet proposes a principled way to scale CNNs on three dimensions simultaneously:
- 1. Depth: Number of layers (d)
- 2. Width: Number of channels (w)
- 3. Resolution: Input image size (r)
With a compound coefficient φ:
α ≥ 1, β ≥ 1, γ ≥ 1
Beyond CNNs: Vision Transformers (Weeks 29–30)
9.1 Self-Attention (The Mechanism Powering Transformers)
For input sequence X, compute:
where Q = XW_Q, K = XW_K, V = XW_V are linear projections.
Multi-Head Attention:
9.2 Vision Transformer (ViT)
Idea: Treat image patches as "tokens" and apply transformer attention
- 1 Split image into N patches of size 16×16
- 2 Embed each patch linearly
- 3 Add positional embeddings
- 4 Process with transformer encoder
- 5 Use [CLS] token for classification
↓
[Patches]
↓
[Linear Projection]
+ [Positional Emb]
↓
[Transformer]
↓
[MLP Head]
Complete Curated Reading List
End of Syllabus Preview.