PatronusAI/llada_2.1_world_model_v3

PatronusAI/llada_2.1_world_model_v3 is a LLaDA2-style Mixture-of-Experts language model checkpoint trained with the dFactory training stack for conversational world-modeling style data.

This repository contains the Hugging Face checkpoint exported from the final training step of the most recent local W&B run:

Run ID: fc0mdstv
Date: March 11, 2026
Training entrypoint: tasks/train_llada2_bd.py
Training config: configs/sft/llada2_mini_bd_sft.yaml
Git commit: 92b6890808088b112fcf5fc73a341b78b6ab76bf

Model details

Architecture: LLaDA2MoeModelLM
Model type: custom llada2_moe
Approximate size: 16B-class model
Layers: 20
Hidden size: 2048
Attention heads: 16
Key/value heads: 4
Experts: 256 total, 8 experts routed per token
Shared experts: 1
Vocabulary size: 157,184
Checkpoint dtype: bfloat16
Saved max position embeddings: 16,384

The repository includes custom modeling files:

configuration_llada2_moe.py
modeling_llada2_moe.py

trust_remote_code=True is required when loading through transformers.

Training data

The checkpoint was trained from the local dataset at:

/workspace/dFactory/world_modeling_datasets/world_modeling_train.jsonl

Based on the dataset sample and config, this is a conversation-format JSONL dataset using the messages field. The examples appear to be multi-turn agent traces with system, user, assistant, and tool-interaction style content.

I am inferring the exact dataset semantics from the local files and run config; this repository does not include a separate dataset card.

Training recipe

The model was trained with block-diffusion SFT settings from configs/sft/llada2_mini_bd_sft.yaml.

Objective: block-diffusion conversational fine-tuning
Sequence length used for training: 8192
Epochs: 1
Train steps: 4067
Global batch size: 64
Micro batch size: 2
Optimizer: AdamW
Learning rate: 1e-5 with cosine decay to 1e-7
Warmup ratio: 0.03
Weight decay: 0.1
Gradient clipping: 1.0
Block diffusion mode: enabled
Block size: 32
Noise range: 0.3 to 0.8
Mixed precision: enabled
Gradient checkpointing: enabled
Parallelism: FSDP2 over 8 GPUs

Training hardware

GPUs: 8x NVIDIA H200
CUDA: 12.8
Python: 3.11.10
transformers: 4.56.2

Final run metrics

Metrics below come from wandb/run-20260311_054957-fc0mdstv/files/wandb-summary.json.

Final training loss: 0.4174
Final grad norm: 1.2001
Consumed tokens: 2.132B
Tokens / second: 0.0673M
Runtime: 32147.65s (~8.93h)
Max allocated GPU memory: 101.46 GB
Max reserved GPU memory: 125.39 GB

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "PatronusAI/llada_2.1_world_model_v3"

tokenizer = AutoTokenizer.from_pretrained(
    repo_id,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
)

For chat-style prompting, use the bundled tokenizer and chat template in this repository.

Limitations

This is a specialized checkpoint trained on world-modeling style conversational traces, not a general-purpose safety-tuned assistant.
The training data appears to include agent and tool-use transcripts, so generations may imitate tool-calling or system-prompt patterns.
No formal evaluation results or benchmark scores are included in this repository yet.
Because this model uses custom code, downstream environments must allow remote code loading or vendor the modeling files locally.

License

This model card uses the repository license observed locally in dFactory, which is Apache 2.0.

Downloads last month: 24

Safetensors

Model size

16B params

Tensor type

BF16