Pretrain Datasets Collection Datasets we use for pretraining large language models • 12 items • Updated Oct 2, 2025