Introduction

Neural networks have revolutionized fields ranging from computer vision to natural language processing. This blog provides an intuitive and mathematical understanding of Neural Networks (NNs), Convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs).

Neural Networks: The Foundation

A neural network is a computational model inspired by the human brain. It consists of layers of interconnected “neurons,” each performing a weighted summation followed by a non-linear activation function.

Mathematical Representation

Consider an input vector \(\mathbf{x}\in\mathbb{R}^n\), weights \(\mathbf{W}\in\mathbb{R}^{m\times n}\), biases \(\mathbf{b}\in\mathbb{R}^m\), and activation function \(f(\cdot)\). The output of a single layer is:

\[\mathbf{h} = f(\mathbf{W}\mathbf{x} + \mathbf{b})\]

Stacking these layers allows the network to learn hierarchical representations:

\[\mathbf{h}^{(k+1)} = f(\mathbf{W}^{(k)} \mathbf{h}^{(k)} + \mathbf{b}^{(k)})\]

The figure below illustrates this “stacking” process:

Key Intuitions

Neural networks approximate functions by learning parameters (weights and biases) through backpropagation, minimizing a loss function (e.g., Mean Squared Error or Cross-Entropy Loss).

Convolutional Neural Networks (CNNs): Spatial Data Specialists

CNNs are specialized for data with spatial structure, like images. Instead of fully connected layers, they use convolutional layers to extract local patterns, such as edges in images (corresponding to shapes in the image). Multiple layers involve in this process.

graph LR
    Input[Input] --> Conv[Convolutional Layer<br>Feature Maps]
    Conv --> Pooling[Pooling<br>Layer]
    Pooling --> Flatten[Flattening]
    Flatten --> FC[Fully Connected<br>Layer]
    FC --> Output[Output<br>Layer]

Mathematical Representation

A convolution operation involves a filter (or kernel) \(\mathbf{K} \in \mathbb{R}^{k \times k}\) sliding over the input \(\mathbf{X} \in \mathbb{R}^{n \times n}\):

\[(\mathbf{X} * \mathbf{K})_{ij} = \sum_{p=0}^{k-1}\sum_{q=0}^{k-1} \mathbf{X}_{i+p, j+q} \mathbf{K}_{p, q}\]

The output is called a feature map. The figure below illustrates this process:

Pooling layers (e.g., max pooling) then downsample these feature maps, reducing dimensionality:

The flattening process converts the feature maps in \(\mathbb{R}^{H \times W \times D}\) to an one-dimensional vector (e.g., in \(\mathbb{R}^{K}\)):

Lastly, a fully connected layer connects the vector to the output layer:

\[\mathbf{h}^{(k+1)} = f(\mathbf{W}^{(k)} \mathbf{h}^{(k)} + \mathbf{b}^{(k)})\]

Key Intuitions

Locality: Convolutions focus on local patches of data.
Shared Weights: The same filter is applied across the input, reducing the number of parameters.
Hierarchical Features: Layers capture increasingly complex features (e.g., edges → textures → objects).

Recurrent Neural Networks (RNNs): Temporal Data Specialists

RNNs excel at processing sequential data, such as time series or text. They maintain a “memory” of previous inputs via a hidden state, allowing them to model temporal dependencies.

Mathematical Representation

RNNs process data sequentially. At each time step \(t\), an input \(\mathbf{x}\_t\) is provided. The sequence of inputs can be represented as \(\mathbf{x}\_1, \mathbf{x}\_2, \dots, \mathbf{x}\_T\), where \(T\) is the total number of time steps. RNNs maintain a hidden state \(\mathbf{h}\_t\), which acts as the network’s memory. This hidden state is updated at each time step based on the current input and the previous hidden state. This hidden state acts as the network’s memory. It is updated at each time step based on the current input and the previous hidden state.

Given input \(\mathbf{x}_t\), hidden state \(\mathbf{h}\_t\), and weights \(\mathbf{W}_{xh}\), \(\mathbf{W}\_{hh}\), the hidden state is updated as:

\[\mathbf{h}_t = f(\mathbf{W}_{xh} \mathbf{x}_t + \mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{b}).\]

The output \(\mathbf{y}\_t\) is computed as:

\[\mathbf{y}_t = g(\mathbf{W}_{hy} \mathbf{h}_t + \mathbf{c}).\]

flowchart LR
    X1["Input x1"] --> H1["Hidden State h1"]
    H1 --> O1["Output y1"]
    X2["Input x2"] --> H2["Hidden State h2"]
    H1 --> H2
    H2 --> O2["Output y2"]
    X3["Input x3"] --> H3["Hidden State h3"]
    H2 --> H3
    H3 --> O3["Output y3"]

Key Intuitions

Memory: RNNs use the hidden state to retain information from previous time steps.
Shared Parameters: Parameters are shared across time steps, making RNNs computationally efficient.

Limitations and Variants

Standard RNNs struggle with long-term dependencies due to vanishing gradients. Variants like LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) address this by introducing gating mechanisms.

Long Short-Term Memory (LSTMs)

LSTMs are a special type of RNN that use gates to control the flow of information. These gates help the network decide which information to remember, forget, or output at each time step, enabling LSTMs to capture long-term dependencies effectively. The graph below illustrates this sequential structure:

Mathematical Representation

Here, \(\sigma(\cdot)\) is the sigmoid activation, \(\tanh(\cdot)\) is the hyperbolic tangent, and \(\odot\) represents element-wise multiplication.

At each time step \(t\), the LSTM maintains long-term memory called a cell state (\(\mathbf{C}\_t\)) and short-term memory called a hidden state (\(\mathbf{h}\_t\)). They also introduce 3 types of gates:

Forget Gate (\(\mathbf{f}\_t\)): Decides which information to discard from the cell state:
\[\mathbf{f}_t = \sigma(\mathbf{W}_f \mathbf{x}_t + \mathbf{U}_f \mathbf{h}_{t-1} + \mathbf{b}_f).\]
Input Gate (\(\mathbf{i}\_t\)): Decides which new information to add to the cell state:
\[\mathbf{i}_t = \sigma(\mathbf{W}_i \mathbf{x}_t + \mathbf{U}_i \mathbf{h}_{t-1} + \mathbf{b}_i).\]
Candidate Cell State: The candidate cell state (\(\tilde{\mathbf{C}}\_t\)) is computed as:
\[\tilde{\mathbf{C}}_t = \tanh(\mathbf{W}_C \mathbf{x}_t + \mathbf{U}_C \mathbf{h}_{t-1} + \mathbf{b}_C).\]
Output Gate (\(\mathbf{o}\_t\)): Decides the output based on the hidden state and cell state:
\[\mathbf{o}_t = \sigma(\mathbf{W}_o \mathbf{x}_t + \mathbf{U}_o \mathbf{h}_{t-1} + \mathbf{b}_o)\]

Updating the States

Cell State Update:
\[\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t\]
The forget gate decides what to discard, and the input gate decides what to add.
Hidden State Update:
\[\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{C}_t)\]

The graph below illustrates this updating process:

Gated Recurrent Units (GRUs)

GRUs are a simpler alternative to LSTMs, with fewer parameters. GRUs combine the forget and input gates into a single update gate and use a reset gate to control the influence of the previous hidden state.

Mathematical Representation

At each time step \(t\), GRUs maintain a single hidden state \(\mathbf{h}\_t\).

Update Gate (\(\mathbf{z}\_t\)): Determines how much of the past hidden state to retain:
\[\mathbf{z}_t = \sigma(\mathbf{W}_z \mathbf{x}_t + \mathbf{U}_z \mathbf{h}_{t-1} + \mathbf{b}_z)\]
Reset Gate (\(\mathbf{r}\_t\)): Controls how much of the previous hidden state to forget
\[\mathbf{r}_t = \sigma(\mathbf{W}_r \mathbf{x}_t + \mathbf{U}_r \mathbf{h}_{t-1} + \mathbf{b}_r)\]
Candidate Hidden State (\(\tilde{\mathbf{h}}\_t\)): The candidate for the current hidden state, computed as:
\[\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}_h \mathbf{x}_t + \mathbf{r}_t \odot (\mathbf{U}_h \mathbf{h}_{t-1}) + \mathbf{b}_h)\]
Hidden State Update: The hidden state is updated using the update gate:
\[\mathbf{h}_t = \mathbf{z}_t \odot \mathbf{h}_{t-1} + (1 - \mathbf{z}_t) \odot \tilde{\mathbf{h}}_t\]

Intuition Behind LSTMs and GRUs

Both LSTMs and GRUs improve on standard RNNs by controlling the flow of information:

LSTMs: Use gates to selectively remember, forget, or output information, allowing the network to retain long-term dependencies.
GRUs: Simplify the gating mechanism while maintaining similar effectiveness, making them computationally more efficient.

Comparison

Feature	NN	CNN	RNN
Data Type	General	Spatial (e.g., images)	Sequential (e.g., text)
Key Operation	Weighted Summation	Convolution	Recurrence
Parameter Sharing	None	Across spatial locations	Across time steps
Strength	Universal Approximation	Spatial Pattern Extraction	Temporal Dependencies