Adam

phasic.svgd.Adam(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08)

Adam optimizer for SVGD with per-parameter adaptive learning rates.

Adam maintains running estimates of the first moment (mean) and second moment (uncentered variance) of gradients, using these to adaptively scale updates per-parameter. This is especially useful when: - Gradients have vastly different scales across parameters - Dataset size causes large gradient magnitudes - Optimization landscape has varying curvature

Parameters

learning_rate : float or StepSizeSchedule = 0.001

Base learning rate (α in Adam paper). Can be a schedule (e.g., ExpStepSize, WarmupExpStepSize) for learning rate decay during optimization.

beta1 : float or StepSizeSchedule = 0.9

Exponential decay rate for first moment estimates (momentum). Higher = more smoothing, slower adaptation. Can be a schedule for advanced warmup strategies.

beta2 : float or StepSizeSchedule = 0.999

Exponential decay rate for second moment estimates (gradient variance). Higher = longer memory of gradient magnitudes. Can be a schedule.

epsilon : float = 1e-8

Small constant for numerical stability in division.

Attributes

m : array or None

First moment estimate (shape: n_particles, theta_dim)

v : array or None

Second moment estimate (shape: n_particles, theta_dim)

t : int

Current timestep (for bias correction)

Examples

>>> from phasic import SVGD, Adam
>>>
>>> # Create optimizer with default settings
>>> optimizer = Adam(learning_rate=0.01)
>>>
>>> # Use with SVGD
>>> svgd = SVGD(
...     model=model,
...     observed_data=observations,
...     theta_dim=2,
...     optimizer=optimizer,
...     n_particles=50,
...     n_iterations=200
... )
>>> svgd.fit()
>>>
>>> # Exponential decay learning rate
>>> optimizer = Adam(learning_rate=ExpStepSize(first_step=0.01, last_step=0.001, tau=500))
>>>
>>> # Warmup + decay (recommended for large models)
>>> optimizer = Adam(learning_rate=WarmupExpStepSize(peak_lr=0.01, warmup_steps=70))

References

Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980. https://arxiv.org/abs/1412.6980

Notes

When using Adam, the learning_rate parameter passed to SVGD is ignored in favor of the optimizer’s learning rate.

Methods

Name Description
reset Reset optimizer state for given particle shape.
step Compute Adam update given SVGD gradient direction.

reset

phasic.svgd.Adam.reset(shape)

Reset optimizer state for given particle shape.

Called at the start of optimization to initialize moment estimates.

Parameters

shape : tuple

Shape of particles array (n_particles, theta_dim) or (n_particles, learnable_dim) if fixed parameters are used.

step

phasic.svgd.Adam.step(phi, particles=None)

Compute Adam update given SVGD gradient direction.

Parameters

phi : array(n_particles, theta_dim)

SVGD gradient direction: (K @ grad_log_p + sum(grad_K)) / n_particles This is the direction of steepest ascent in the RKHS.

particles : array(n_particles, theta_dim) = None

Current particle positions. Not used by base Adam, but available for subclasses (e.g., Adamelia jitter).

Returns

update : array(n_particles, theta_dim)

Scaled update to add to particles. Each element is adaptively scaled based on the history of gradients for that parameter.