PatronusAI/llada_2.1_world_model_v3

PatronusAI/llada_2.1_world_model_v3 is a LLaDA2-style Mixture-of-Experts language model checkpoint trained with the dFactory training stack for conversational world-modeling style data.

This repository contains the Hugging Face checkpoint exported from the final training step of the most recent local W&B run:

  • Run ID: fc0mdstv
  • Date: March 11, 2026
  • Training entrypoint: tasks/train_llada2_bd.py
  • Training config: configs/sft/llada2_mini_bd_sft.yaml
  • Git commit: 92b6890808088b112fcf5fc73a341b78b6ab76bf

Model details

  • Architecture: LLaDA2MoeModelLM
  • Model type: custom llada2_moe
  • Approximate size: 16B-class model
  • Layers: 20
  • Hidden size: 2048
  • Attention heads: 16
  • Key/value heads: 4
  • Experts: 256 total, 8 experts routed per token
  • Shared experts: 1
  • Vocabulary size: 157,184
  • Checkpoint dtype: bfloat16
  • Saved max position embeddings: 16,384

The repository includes custom modeling files:

  • configuration_llada2_moe.py
  • modeling_llada2_moe.py

trust_remote_code=True is required when loading through transformers.

Training data

The checkpoint was trained from the local dataset at:

  • /workspace/dFactory/world_modeling_datasets/world_modeling_train.jsonl

Based on the dataset sample and config, this is a conversation-format JSONL dataset using the messages field. The examples appear to be multi-turn agent traces with system, user, assistant, and tool-interaction style content.

I am inferring the exact dataset semantics from the local files and run config; this repository does not include a separate dataset card.

Training recipe

The model was trained with block-diffusion SFT settings from configs/sft/llada2_mini_bd_sft.yaml.

  • Objective: block-diffusion conversational fine-tuning
  • Sequence length used for training: 8192
  • Epochs: 1
  • Train steps: 4067
  • Global batch size: 64
  • Micro batch size: 2
  • Optimizer: AdamW
  • Learning rate: 1e-5 with cosine decay to 1e-7
  • Warmup ratio: 0.03
  • Weight decay: 0.1
  • Gradient clipping: 1.0
  • Block diffusion mode: enabled
  • Block size: 32
  • Noise range: 0.3 to 0.8
  • Mixed precision: enabled
  • Gradient checkpointing: enabled
  • Parallelism: FSDP2 over 8 GPUs

Training hardware

  • GPUs: 8x NVIDIA H200
  • CUDA: 12.8
  • Python: 3.11.10
  • transformers: 4.56.2

Final run metrics

Metrics below come from wandb/run-20260311_054957-fc0mdstv/files/wandb-summary.json.

  • Final training loss: 0.4174
  • Final grad norm: 1.2001
  • Consumed tokens: 2.132B
  • Tokens / second: 0.0673M
  • Runtime: 32147.65s (~8.93h)
  • Max allocated GPU memory: 101.46 GB
  • Max reserved GPU memory: 125.39 GB

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "PatronusAI/llada_2.1_world_model_v3"

tokenizer = AutoTokenizer.from_pretrained(
    repo_id,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
)

For chat-style prompting, use the bundled tokenizer and chat template in this repository.

Limitations

  • This is a specialized checkpoint trained on world-modeling style conversational traces, not a general-purpose safety-tuned assistant.
  • The training data appears to include agent and tool-use transcripts, so generations may imitate tool-calling or system-prompt patterns.
  • No formal evaluation results or benchmark scores are included in this repository yet.
  • Because this model uses custom code, downstream environments must allow remote code loading or vendor the modeling files locally.

License

This model card uses the repository license observed locally in dFactory, which is Apache 2.0.

Downloads last month
24
Safetensors
Model size
16B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support