In this last part of the LLM With The JAX Ecosystem From Scratch series, I share my experience of using Maximal Update Parameterization ( $\mu$ P) and the scaling laws to train LLMs. I’ll first go over a brief overview about $\mu$ P mostly from a practitioner’s perspective and how I implemented it. Then I’ll explain how I used the scaling laws to figure out the largest model that I can scale up to.

At every step during this process of building the training pipeline and figuring out the model architecture and hyperparameters, there are so many things to explore and optimize. However, the purpose of this series is to get hands-on practice of using the JAX ecosystem to train modern LLMs, and get a good sense of the high level landscape of the academia and industry research and engineering directions. So I wouldn’t be able to go any deeper in any direction in this series than it already is. Instead, this series serves as a good starting point for those explorations.

Maximal update parameterization#

One of the most compute-intense problems in training LLMs is finding the optimal learning rate. One typically has to sweep across a large range of values. To make it worse, the optimal learning rate is not transferable when moving to larger models. So as we scale up the model trying to fit a scaling law, we would have to repeatedly do this sweep at each scale.

$\mu$ P solves this problem by finding a way to scale the model size, parameter initialization standard deviations, and the learning rate using a single scale factor $m_p$ , which is defined as the ratio of model width over the base model width:

$m_p = \frac{d_{\text model}}{d_{\text base}}$ ,

where $d_{\cdot}$ means the model width, namely the hidden state size of the LLM.

The scaling dependencies on $m_p$ of the various hyperparameters make sure that throughout the training, the norms (more precisely, the spectral norm) of the activations, weights, and gradients are independent of $m_p$ , hence staying at $\Theta(1)$ . Since the scaling is only a function of $m_p$ , one no longer needs to tune the hyperparameters at every scale. Instead, the base model hyperparameters can be reused in, or transferred to, larger scale models. The following is a comparison of the training loss vs learning rate plots in standard parameterization and $\mu$ P from the Tensor Programs V paper.

muP vs standard parameterization training loss vs learning rate plots from the Tensor Programs V paper

As we can see, with standard parameterization, the optimal learning rate shifts as the model size changes, while it stays stable with $\mu$ P.

The exact derivation of the scaling functions is rather mathy and complicated. It was first derived in the Tensor Program series of papers (starting with Tensor Programs I). A more accessible introduction is here. For obvious reasons, I’ll skip the derivation in this series, and just cite the scaling functions that are relevant for my application. Another helpful source of learning $mu$ P is this blog post from CerebrasAI and EleutherAI.

Parameter	Standard Parameterization	$\mu$ P
Embedding initialization variance	$\sigma^2_{\text base}$	$\sigma^2_{\text base}$
Embedding learning rate (Adam)	$\eta_{\text base}$	$\eta_{\text base}$
Embedding activation	$xW_{\text emb}$	$\alpha_{\text input}\cdot xW_{\text emb}$
Hidden layer initialization variance	$\sigma^2_{\text base}$	$\sigma^2_{\text base} / m_p$
Hidden layer learning rate (Adam)	$\eta_{\text base}$	$\eta_{\text base} / m_p$
Output/logits activation	$xW_{\text out}$	$\alpha_{\text output}\cdot xW_{\text out} / m_p$
Attention logits	$Q^T K / \sqrt{d_{\text model}}$	$Q^T K / d_{\text model}$

Attention logits normalization#

One interesting thing to note is the difference in the normalization factor in attention logits. In SP, it is $1/\sqrt{d_{\text head}}$ , but in $\mu$ P, it becomes $1/d_{\text head}$ . The short answer is: $\sqrt{d_{\text{head}}}$ is derived under the assumption that the elements of the query ( $Q$ ) and key ( $K$ ) vectors are independent random variables, which is only true exactly at initialization. However, during training, weight updates cause the vectors to become correlated. When vectors are correlated, their dot product scales linearly with the dimension ( $d_{\text{head}}$ ), not the square root

In the case of SP, at initialization, the projection matrices $W_q$ and $W_k$ are typically initialized from a Gaussian distribution, meaning the resulting query and key vectors $q$ and $k$ have entries that are independent, identically distributed (i.i.d.) random variables with a mean of zero.

Let $q_m$ and $k_m$ be the $m$ -th coordinates of $q$ and $k$ , both with $\Theta(1)$ variance. The dot product is:

$q \cdot k = \sum_{m=1}^{d} q_m k_m$

Because $q_m$ and $k_m$ are independent and zero-mean:

$\mathbb{E}[q_m k_m] = \mathbb{E}[q_m]\mathbb{E}[k_m] = 0$

According to the Central Limit Theorem (CLT), the sum of $d$ independent zero-mean variables acts as a random walk. While the expectation remains 0, the variance grows linearly with $d$ :

$\text{Var}(q \cdot k) = \sum_{m=1}^{d} \text{Var}(q_m k_m) = \Theta(d)$

To keep the attention logits (pre-softmax values) at a stable $\Theta(1)$ variance so the softmax doesn’t immediately saturate, Standard Parameterization divides by the standard deviation:

$\text{Logits}_{\text{SP}} = \frac{q \cdot k}{\sqrt{d}}$

However, as training progresses, $W_q$ and $W_k$ becomes more and more correlated, so are $q$ and $k$ . They eventually align along meaningful feature directions. Because of this correlation, the expected value of their coordinate-wise product is no longer zero:

$\mathbb{E}[q_m k_m] = c \neq 0$

Now, we must apply the Law of Large Numbers (LLN) rather than the CLT. The sum of $d$ variables with a non-zero mean $c$ scales linearly with $d$ :

$q \cdot k = \sum_{m=1}^{d} q_m k_m = d \left( \frac{1}{d} \sum_{m=1}^{d} q_m k_m \right)$

As $d \to \infty$ , the sample mean converges to the expected value:

$q \cdot k \approx d \cdot c = \Theta(d)$

In $\mu$ P, we still want to keep the logits and hence the softmax well-behaved, so we divide by $d$ instead of $\sqrt{d}$ :

$\text{Logits}_{\mu\text{P}} = \frac{q \cdot k}{d} = \frac{\Theta(d)}{d} = \Theta(1)$

Learning rate for the output/logits layer#

You might also wonder why the learning rate for the output/logits layer is still scaled by $1/m_p$ when the activation is already scaled by $1/m_p$ . The answer lies in the fact that we use Adam as the optimizer, whose update rule is:

$\Delta w = -\frac{\alpha_t}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$ ,

where $\alpha_t$ is the learning rate adjusted by optimization steps taken, $\hat{m}_t$ and $\hat{v}_t$ are the first and second moment estimates of the gradients, and $\epsilon$ is a small constant to prevent division by zero. Since $\hat{v}_t$ scales with $m_p$ as $\hat{v}_t \to \hat{v}_t / m_p^2$ , and $\hat{m}_t$ scales as $\hat{m}_t \to \hat{m}_t / m_p$ , the update step $\Delta w$ scales as $\Delta w \to \Delta w / m_p \cdot \sqrt{m_p^2} = \Delta w$ , which has no dependency on $m_p$ . Therefore, we need to an extra $1/m_p$ factor from the learning rate itself even though the output activation is already scaled by $1/m_p$ .

Scaling law#

There is a lot to talk about scaling laws, such as pretraining scaling, RL scaling, inference time scaling, and agent scaling. But for the purpose of this series, I’ll simply mention one thing that can guide how I pick the largest model size given the compute budget I have. For more details of pretraining scaling laws, check out these resources:

The original scaling law paper.
The Chinchilla paper.
Stanford CS 336 lecture on scaling laws, part one and two.

The part I need for this exercise is this rule of thumb connecting the amount of data needed and the amount of model parameters. Various careful studies of scaling laws of modern LLMs found that in order to train the models to a good quality, one would typically need as many tokens as $X$ times of model parameters. $X$ was about 20 until recently, where it has gone up significantly, such as 150, 200 etc.

The reason for those higher ratios is that inference is much more expensive than training, so it would be more preferable to train a smaller model much longer (to reach similar performances as larger models trained for less data) so that the inference time cost (which is accrued repeatedly throughout the model’s deployment life span) could be reduced, even if the training budget is not spent optimally. But for my little experiment, 20 is a good rule of thumb.

Putting it together#

So now we have two nice guidelines at play to guide us to find the optimal training recipe.

Scaling law tells us given the compute budget, what’s the size of the model I should train.
$\mu$ P parameterization tells us for that model size, what the training hyperparameters should be, such as learning rate, parameter initialization standard deviation etc.

For my little exercise, the recipe then could be divided into two stages: finding the optimal hyperparameters for the base model, and scaling it up to the largest model I can train given my compute budget, which is 2 hours on 8x H100 SXM.

Base model architecture#

The first step is to find a reasonable base model size so that it can be run cheaply, such as on my laptop with 5080. The reason for this is that we will need to scan the $\mu$ P hyperparameters, which would require tens, if not hundreds, training runs of the base model.

The model size is relatively easy to compute given the model architecture. I tried to build this util to compute the various numbers associated with a model and its inference and training cost, such as number of parameters, memory cost of model state and optimizer state, memory cost of activation, flops cost of one inference or one training pass.

Anything about the model is easy to estimate (number of parameters, model state and optimizer state memory cost etc.), but activation cost and flops are harder because at the runtime, there could be various implementation details and optimizations that could change the cost footprint. I initially tried to do a grid search over the number of layers and batch size to find a good base model set up. If the estimates are accurate, I can easily find an appropriate batch size and layers number combo from a plot like this:

muP base model batch size num of layers grid search

But my util wildly underestimated the memory cost, so I ended up reducing the number of layers slightly and using microbatching to fit the base model training on my laptop 5080. But the grid search at least provides a somewhat good starting point. I eventually settled with the following parameters:

$d_{\text base} = 256$ . This can’t be too small, in order for the effects of the law of large numbers to kick in (for $\mu$ P).
Number of layers = 36.
Batch size 16.
Max context window size 1024.

The other parameters can be found in any of the sweep runs, such as this one.

Hyperparameter sweep#

As seen from the $\mu$ P section, the hyperparameters we need to tune are

Base learning rate $\eta_{\text base}$ .
Base standard deviation used for model parameter initialization $\sigma_{\text base}$ .
Multiplicative factor $\alpha_{\text input}$ for the embedding layer.
Multiplicative factor $\alpha_{\text output}$ for the output/logit layer.

I did three rounds of sweeps (sweep 1, 2, 3) for a base model above, and found the following hyperparameters:

$\eta_{\text base} = 0.008$ .
$\sigma_{\text base} = 0.25$ .
$\alpha_{\text input} = 1.6$ .
$\alpha_{\text output} = 3.2$ .

Scaling up#

Given all the preparation in this post and the previous ones (in particular, parallel training and all the plumbing work), all that is left is to figure out what $m_p$ should we scale up to. This is determined by the compute budget we have, which is 2 hours on 8x H100 SXM. Based on the specs of H100 SXM, the total FLOPS I have is $1979 \times 10^{12} \times 8 \times 2 \times 3600 \times \eta = 1.14\times 10^{20} \lambda$ , where $\lambda$ is the Model FLOPS Utilization.

The FLOPS of a training step can be approximated by matrix multiplications. This gives a total training FLOPS estimate of $6\times N_P \times N_T$ , where $N_P$ is the number of model parameters and $N_T$ is the number of training tokens. So roughly the FLOPS as a function of $N_P$ is $120 N_T^2$ . Note how FLOPS grow quadratically with the number of parameters.

Setting the two equal to each other $120 N_T^2 = 1.14\times 10^{20} \lambda$ gives a rough estimate of the largest model I can train. $\lambda$ can be estimated by actually running the training pipeline. From the theoretical FLOP/s on the spec, we can get a theoretical value of steps/sec. $\lambda$ is then basically the actual steps/sec (as measured by test runs) divided by that theoretical value. In my case, it turns out to be about $1/7$ . This puts my max $m_p$ at 5–this just shows how expensive this way of training LLMs is.

The following are the loss curves of my large run with this setup on 8x H100 SXM.

training loss of the largest training run

validation loss of the largest training run

Later I had some further optimizations, such as using gradient checkpointing and moving to a fp8 quantization. I’m sure they can improve the training pipeline efficiency and hence the final model quality, but I’ll leave it to future experiments when I have more time and hobby funds.

Closing#

Wheeew, that’s a wrap.

whew meme

I hope you enjoyed reading this series as much as I enjoyed writing it, and can take something useful away with you.

And please feel free to leave a comment below and let me know if you have any questions or suggestions! If you’re interested, you can find the full code for this series in my repo

djwenren

llm-with-jax-practice

Waiting for api.github.com...

00K

Waiting...

Modern AI is fun. Let’s keep exploring!