view article Article 2. Attention Optimizations: From Standard Attention to FlashAttention 2 days ago • 1
meta-llama/Llama-3.2-11B-Vision Image-Text-to-Text • 11B • Updated Sep 27, 2024 • 11.4k • 580
Efficient Memory Management for Large Language Model Serving with PagedAttention Paper • 2309.06180 • Published Sep 12, 2023 • 34
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion Paper • 2503.11576 • Published Mar 14, 2025 • 140
view article Article Performant local mixture-of-experts CPU inference with GPU acceleration in llama.cpp 12 days ago • 10
view article Article LightOnOCR-2-1B: a lightweight high-performance end-to-end OCR model family 23 days ago • 80
view article Article Tokenization in Transformers v5: Simpler, Clearer, and More Modular +4 Dec 18, 2025 • 119
view article Article Shrinking Giants: The Quantization Mathematics Making LLMs Accessible May 3, 2025 • 2
view article Article A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using transformers, accelerate and bitsandbytes Aug 17, 2022 • 123
Running on CPU Upgrade Featured 2.97k The Smol Training Playbook 📚 2.97k The secrets to building world-class LLMs
Running 3.68k The Ultra-Scale Playbook 🌌 3.68k The ultimate guide to training LLM on large GPU Clusters