Optimizing Gradient Descent

A deep dive into momentum, adaptive learning rates, and modern optimization techniques for training neural networks.

Gradient descent is the workhorse of modern machine learning. At its core, it's beautifully simple: compute the gradient of your loss function with respect to your parameters, then take a step in the opposite direction. Yet this simplicity belies the rich landscape of optimization techniques that have emerged over the years.

The Vanilla Algorithm

Standard gradient descent updates parameters according to a simple rule: θ = θ - α∇L(θ), where α is the learning rate and ∇L(θ) is the gradient of the loss. This works, but has well-documented issues with saddle points, local minima, and sensitivity to the learning rate.

def gradient_descent(params, grad_fn, lr=0.01, steps=1000):
    for _ in range(steps):
        grads = grad_fn(params)
        params = params - lr * grads
    return params

Momentum: Physics to the Rescue

Momentum adds a "velocity" term that accumulates gradient information over time. Think of a ball rolling down a hill – it doesn't just follow the local slope, it builds up speed in consistent directions.

"The key insight is that gradient descent is not just about finding the steepest descent at each point, but about navigating the overall loss landscape efficiently."

The momentum update rule introduces a velocity term:

  • v = βv + ∇L(θ) — accumulate the gradient with decay β
  • θ = θ - αv — update parameters using velocity

Adam: Adaptive Moments

Adam (Adaptive Moment Estimation) combines the best of momentum with per-parameter adaptive learning rates. It maintains exponentially decaying averages of both the gradient (first moment) and squared gradient (second moment).

def adam(params, grad_fn, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
    m = np.zeros_like(params)  # First moment
    v = np.zeros_like(params)  # Second moment
    t = 0
    
    for _ in range(steps):
        t += 1
        g = grad_fn(params)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g**2
        
        # Bias correction
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        
        params = params - lr * m_hat / (np.sqrt(v_hat) + eps)
    
    return params

When to Use What

In practice, Adam is often the default choice for deep learning. However, recent work has shown that SGD with momentum can generalize better in some cases, particularly for image classification tasks. The key is understanding your problem's loss landscape.

Practical Tips

  • Start with Adam at lr=0.001 for quick experimentation
  • Use learning rate warmup for large batch training
  • Consider SGD+momentum for final fine-tuning
  • Monitor gradient norms to detect instability

The optimization landscape continues to evolve. Recent developments like LAMB, AdaFactor, and Lion offer promising alternatives for specific use cases. The best optimizer is often problem-dependent – understanding the fundamentals helps you make informed choices.

← Back to all posts