This is the first one of a series of posts documenting my journey of building LLMs using the JAX ecosystem from scratch.

Why would I want to do such a thing? You might ask. After all, the vast majority of open source model releases and libraries are based on PyTorch. Even the transformer library from Hugging Face dropped support for the JAX ecosystem last year (see this and this). Well, here are the main reasons:

The ease of setting up parallel training with JAX is amazing. Instead of writing explicit collective communications like this, I only need to specify the sharding of the arrays/tensors, and the compilers will figure out how to do the communication. Of course, JAX/XLA also provides the flexibility of manually coding those communications.
The performance increase and cost reduction as reported on various sources, because of the jit compilation, though there is also torch.compile in PyTorch. If I had an extra hobby budget to spare, it would be interesting to do some benchmarking on this. For now, I’ll take the side that I could squeeze more compute out of my hobby budget with JAX.
Just because I can out of curiosity.

This is also partly based on the Stanford CS 336 class I took last year (unofficially ofc). What I’m basically trying to do is to reproduce most of the interesting things there, plus a few other things such as muP and scaling the model to 8 GPUs with some experiments such as FSDP + TP training.

My code is in this repo:

djwenren

llm-with-jax-practice

Waiting for api.github.com...

00K

Waiting...

Most of the implementation is done as of March 2026. I’m yet to do the proper tuning (such as architecture and training hyperparameters) and implement the RLVR training pipelines similar to the one explained in my other post.

The structure of this series is roughly as follows.

The basic components, including layers, data loader, checkpoint management, optimizers, and the basic pre-training pipeline.
Sharding and parallel training.
Scaling up. This part includes memory and flops estimation and maximal update parameterization (muP).
RLVR training. This part is yet to be done as of mid March 2026. I might have to use some open source models and potentially port them into JAX since the low-budget LLM I pretrained from scratch might not have the enough intelligence to demonstrate any interesting post-training behavior.

Layers#

For the layers and the framework in general, I used Flax NNX. I very briefly tried an earlier version of Flax a few years back. At the time, it felt very non-Pythonic and from that perspective, PyTorch is clearly a better framework. But now with the latest Flax NXX, the framework has matured significantly, and it feels pretty natural to implement neural net layers.

The states (parameters, generic variables, Python native numbers etc.) are now contained in the modules, instead of users having to explicitly manage the states. From the eyes of JAX, the modules are basically computations with the class members (states) captured. The implementation is pretty similar to PyTorch and descriptive. See here for more details.

One interesting thing to note about JAX is the use of vmap and jax.lax.scan for defining and calling the layers of transformer blocks. One naive implementation of it is to wrap the blocks in an nnx.list, and call them in a for loop, similar to how one would do in PyTorch:

1
class TransformerLm(nnx.Module):
2
    """Transformer language model."""
3

4
    def __init__(
5
        self,
6
        config: TransformerConfig,
7
        rngs: nnx.Rngs,
8
        dtype: jnp.dtype = jnap.float32,
9
    ):
10
        …
11
        self.transformer_blocks = nnx.List(
12
            [
13
                L.TransformerBlock(
14
                    d_model=config.d_model,
15
                    num_heads=config.num_heads,
16
                    d_ff=self._get_d_ff(
17
                        d_model=config.d_model,
18
                        d_ff_to_d_model=config.d_ff_to_d_model,
19
                        d_ff=config.d_ff,
20
                    ),
21
                    rngs=rngs,
22
                    rope=self.rope,
23
                    dtype=dtype,
24
                )
25
                for _ in range(config.num_layers)
26
            ]
27
        )
28
       …
29

30
   …
31

32
    def __call__(
33
        self, input_tokens: Int[jnp.ndarray, "... seq_len"]
34
    ) -> Float[jnp.ndarray, "... seq_len vocab_size"]:
35
        …
36
        for transformer_block in self.transformer_blocks:
37
            activation = transformer_block(
38
                in_features=activation,
39
                token_positions=token_positions,
40
            )
41
        …

The sharp bit (“sharp bit” as in the same spirit as it’s used in this guide) about this is that the JAX/XLA compiler might unroll the for loop as an effort to optimize the code. This would often lead to very large jit compilation output, with very little actual efficiency improvement. One way to fix this is to define the series of blocks using nnx.vmap and call them using jax.lax.scan:

1
class TransformerLm(nnx.Module):
2
    """Transformer language model."""
3

4
    def __init__(
5
        self,
6
        config: TransformerConfig,
7
        rngs: nnx.Rngs,
8
        dtype: jnp.dtype = jnp.float32,
9
    ):
10
        …
11
        @nnx.vmap(transform_metadata={nnx.PARTITION_NAME: None}, in_axes=(0,))
12
        def _create_transformer_block(rngs: nnx.Rngs) -> L.TransformerBlock:
13
            return L.TransformerBlock(
14
                d_model=config.d_model,
15
                num_heads=config.num_heads,
16
                d_ff=_get_d_ff(
17
                    d_model=config.d_model,
18
                    d_ff_to_d_model=config.d_ff_to_d_model,
19
                    d_ff=config.d_ff,
20
                ),
21
                rngs=rngs,
22
                dtype=dtype,
23
                sharding=sharding.transformer_blocks,
24
                use_mu_p=False,
25
                attn_std=None,
26
                ffn_std=None,
27
            )
28

29
        self.transformer_blocks = _create_transformer_block(
30
            rngs.fork(split=config.num_layers)
31
        …
32
   …
33

34
    def __call__(
35
        self, input_tokens: Int[jnp.ndarray, "... seq_len"]
36
    ) -> Float[jnp.ndarray, "... seq_len vocab_size"]:
37
        …
38
        def scan_over_transformer_blocks(activation, transformer_block):
39
            return (
40
                transformer_block(
41
                    in_features=activation,
42
                    token_positions=token_positions,
43
                    rope=self.rope,
44
                ),
45
                None,
46
            )
47

48
        activation, _ = jax.lax.scan(
49
            scan_over_transformer_blocks, activation, self.transformer_blocks
50
        )
51
        …

This can control the unrolling of the for loops. If one wishes to control the runtime efficiency, they can even use the unroll parameter of jax.lax.scan.

Data loader#

For loading the training and validation datasets, I used the Grain library. The nice things I like about it are

Natural integration with the rest of the JAX ecosystem.
Support for checkpointing, so the training can restart at the same point of the training data flow.
Support for multithread prefetching, making it more efficient.

The data sets are still the TinyStory and OpenWebText datasets used in CS 336. I didn’t re-implement the BPE tokenizer and instead used the tokens directly from there in the format of Numpy array dumps. I just need to write a custom Grain data source and configure a dataset batch iterator.

One difference between this and the data loading when working with the PyTorch ecosystem is that there doesn’t seem to be a need for calling .pin_memory().to(device, non_blocking=True), at least that’s according to my conversation with Gemini. This is because the JAX/XLA compiler already handles memory management. When a NumPy array is passed to JAX (either via jax.device_input or into a jit-compiled function, the runtime handles the transfer efficiently, and may have automatically pinned memory internally.

Checkpoint management#

For checkpoint management, I used the Orbax library for similar reasons as above. For my use case, I need to save the following in the checkpoints:

Model state.
Optimizer state.
Metadata, such as train config and model config, so that the same configs can be loaded from checkpoint too.

Later when using muP, there are two optimizers used (one for embedding, one for the other model parts), but the overall structure of the checkpoint manager remains the same.

The critical requirement when training large models is that when restoring from checkpoints, we shouldn’t first materialize a placeholder model and optimizer, and then immediately update their parameters from the checkpoint. Instead, we can pass in abstract models returned from nnx.eval_shape, such as the following

1
abstract_model = nnx.eval_shape(
2
    lambda: transformer.TransformerLm(
3
        config=model_config, rngs=nnx.Rngs(jax.random.key(42)), sharding=sharding
4
    )
5
)

In this way, the parameters in the model will be replaced by jax.ShapeDtypeStruct and since I used explicit sharding, the sharding information is also retained.

With the abstract model, we can create an abstract optimizer, and then load real parameters from the checkpoint to replace their abstract parameters (of type jax.ShapeDtypeStruct). So throughout the process, we only materialize the parameters once. See the following snippet

1
class CheckpointManager(BaseCheckpointManager):
2
    …
3

4
    def restore(
5
        self,
6
        step: int,
7
        abstract_model: nnx.Module,
8
        **kwargs,
9
    ) -> tuple[nnx.Module, PyTree[Any], ...]:
10
        """Restores the checkpoint."""
11
        assert "tx" in kwargs, "tx must be provided"
12
        tx = kwargs["tx"]
13
        assert isinstance(
14
            tx, optax.GradientTransformation
15
        ), "tx must be an instance of optax.GradientTransformation"
16

17
        # 1. Create abstract optimizer on top of abstract model.
18
        # Since abstract_model contains ShapeDtypeStructs, no real arrays are allocated here.
19
        abstract_optimizer = nnx.Optimizer(abstract_model, tx, wrt=nnx.Param)
20

21
        # 2. Split both together to get a unified GraphDef and combined abstract state.
22
        # This allows us to restore both in one merge call, ensuring correct linking.
23
        # Path 0: optimizer state, Path 1: model state.
24
        opt_model_graph_def, abstract_combined_state = nnx.split(
25
            (abstract_optimizer, abstract_model)
26
        )
27
        abstract_combined_state = _canonicalize_sharding(abstract_combined_state)
28

29
        # 3. Restore using fixed shardings from their respective checkpoint slots.
30
        restored_args = self._ocp_checkpoint_manager.restore(
31
            step=step,
32
            args=ocp.args.Composite(
33
                model_state=ocp.args.StandardRestore(abstract_combined_state[1]),
34
                optimizer_state=ocp.args.StandardRestore(abstract_combined_state[0]),
35
                metadata=ocp.args.JsonRestore(),
36
            ),
37
        )
38

39
        # 4. Merge everything back into real objects in one go.
40
        # This bypasses optax.init() and prevents materializing zero-filled states.
41
        full_restored_state = nnx.State(
42
            {0: restored_args.optimizer_state, 1: restored_args.model_state}
43
        )
44
        restored_optimizer, restored_model = nnx.merge(
45
            opt_model_graph_def, full_restored_state
46
        )
47

48
        return restored_model, restored_args.metadata, restored_optimizer

Side note about sharding
It may be possible that nnx.eval_shape will also abstract the mesh specified in the shardings of the parameters. In other words, the shardings of the parameters in the input abstract model may be of type jax.sharding.AbstractMesh. Orbax’s checkpoint restoration can only take shardings with physical meshes (as of mid-March 2026). So I needed to canonicalize (replace abstract mesh with physical mesh) their shardings.

Optimizers#

For optimizers, I use the Optax library for the same reasons explained above. The unique thing about this library is that it treats every optimizer as a gradient transformation, or a chain of gradient transformations. That’s what a user will need to implement, and the library will handle the application of the final output of that gradient transformation to model parameters.

I re-implemented Adam, weight decay, and the learning weight cosine schedule similar to CS 336. One sharp bit is to be careful with what we specify as nnx.data state parameter and nnx.static state parameter. The difference is basically that nnx.static state parameters are treated as static parameters when the update method is jitted. In other words, in later calls to the jitted update method, static state parameters always take the value of the very first call. This could be a trap for the step state parameter. I naively thought in the beginning that it could simply be a Python native int. But that would make it stay at 0 throughout the training. A solution is to wrap it in a jax.array and mark it with nnx.data.

Pre-training pipeline#

The pre-training pipeline is fairly similar to that in PyTorch, such as this one from CS 336. Here are two interesting unique things to JAX.

JIT fresh model and optimizer creation#

Jitting the creation of a fresh model and optimizer can help with doing it more efficiently. For example, when we specify shardings of the model across multiple devices, without sharding the model parameters will be first created on CPU/RAM and then sent to their corresponding device. But with sharding, the compiler will generate code that directly creates those parameters on their right devices. One way of doing this is as simple as the following

1
def _get_sp_model_and_optimizer(
2
    train_config: _train_config.TrainConfig,
3
    model_config: transformer.TransformerConfig,
4
    sharding: _sharding.TransformerLmSharding,
5
    ckpt_manager: checkpoint.CheckpointManager,
6
) -> tuple[nnx.Module, nnx.Optimizer]:
7
    """Gets the model and optimizers."""
8
    …
9
    @nnx.jit
10
    def _get_fresh_model_and_optimizer():
11
        model = transformer.TransformerLm(
12
            config=model_config, rngs=nnx.Rngs(jax.random.key(42)), sharding=sharding
13
        )
14
        optimizer = nnx.Optimizer(model, tx, wrt=nnx.Param)
15
        return model, optimizer
16

17
    if latest_step is None:
18
        return _get_fresh_model_and_optimizer()
19
    …

Donate model and optimizer states in jitted train step#

This falls under the broader topic of buffer donation. From the guide:

When JAX executes a computation it uses buffers on the device for all inputs and outputs. If you know that one of the inputs is not needed after the computation, and if it matches the shape and element type of one of the outputs, you can specify that you want the corresponding input buffer to be donated to hold an output.

In the case of train step, such as the following

1
@nnx.jit(donate_argnames=("local_model", "local_optimizer"))
2
def _train_step(
3
    local_model: nnx.Module,
4
    local_optimizer: nnx.Optimizer,
5
    input_seq: Int[jnp.ndarray, "batch_size context_length"],
6
    target_seq: Int[jnp.ndarray, "batch_size context_length"],
7
) -> tuple[Float[jnp.ndarray, ""], Float[jnp.ndarray, ""]]:
8
    """Trains the model for one step."""
9
    loss, grads = nnx.value_and_grad(loss_fn)(local_model, input_seq, target_seq)
10
    local_optimizer.update(local_model, grads)
11
    return (
12
        loss,
13
        # Compute the total L2 norm of the gradients.
14
        jnp.sqrt(
15
            jax.tree.reduce(
16
                lambda acc, x: acc + jnp.sum(jnp.square(x)),
17
                grads,
18
                0,
19
            )
20
        ),
21
    )

local_model and local_optimizer are the input and the output. Therefore, by donating these two arguments, we effectively save half of the HBM cost by having them updated in place. This issue and optimization is studied and discussed in detail in this GitHub issue.

Closing#

With the above, I was able to start running pre-training on one device. Here is one example training curve

Example pre-training learning curve

In part 2, I’ll share how I implemented sharding for various parallelisms, such as Fully Sharded Data Parallel (FSDP) and/or Tensor Parallelism (TP) ✌️