Tiny Audio
A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with Tiny Audio—a minimal, hackable ASR framework.
Quick Start
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
result = pipe("audio.wav")
print(result["text"])
Usage Examples
Basic Transcription
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
# From file
result = pipe("audio.wav")
print(result["text"])
# From URL
result = pipe("https://example.com/audio.mp3")
# From numpy array (must be 16kHz)
import numpy as np
audio = np.random.randn(16000).astype(np.float32) # 1 second
result = pipe(audio)
Batch Processing
# Process multiple files
files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = pipe(files, batch_size=4)
for r in results:
print(r["text"])
Word-Level Timestamps
result = pipe("audio.wav", return_timestamps="word")
# Returns:
# {
# "text": "hello world",
# "chunks": [
# {"text": "hello", "timestamp": (0.0, 0.5)},
# {"text": "world", "timestamp": (0.6, 1.0)}
# ]
# }
Streaming Inference
from tiny_audio import ASRModel, ASRProcessor
import torch
model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
# Load and process audio
import librosa
audio, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
# Stream tokens
for token in model.generate_streaming(inputs["input_features"]):
print(token, end="", flush=True)
Using with torch directly
from tiny_audio import ASRModel, ASRProcessor
import torch
import librosa
# Load model and processor
model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
# Load audio (16kHz)
audio, sr = librosa.load("audio.wav", sr=16000)
# Process
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
# Generate
with torch.no_grad():
output = model.generate(
input_features=inputs["input_features"],
attention_mask=inputs["attention_mask"],
max_new_tokens=256
)
# Decode
text = processor.batch_decode(output, skip_special_tokens=True)[0]
print(text)
GPU Inference
import torch
pipe = pipeline(
"automatic-speech-recognition",
model="mazesmazes/tiny-audio",
trust_remote_code=True,
device="cuda" # or device=0
)
Half Precision
pipe = pipeline(
"automatic-speech-recognition",
model="mazesmazes/tiny-audio",
trust_remote_code=True,
torch_dtype=torch.float16,
device="cuda"
)
Architecture
Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text
Only the projector is trained (~12M params). The encoder and decoder remain frozen, leveraging their pretrained knowledge.
| Component | Model | Parameters | Status |
|---|---|---|---|
| Audio Encoder | GLM-ASR-Nano-2512 | ~600M | Frozen |
| Projector | 2-layer MLP | ~12M | Trained |
| Language Model | Qwen3-0.6B | ~600M | Frozen |
How It Works
- Audio Encoder: GLM-ASR converts 16kHz audio into frame-level embeddings (768-dim)
- Projector: A 2-layer MLP with frame stacking bridges the audio and text embedding spaces
- Language Model: Qwen3 generates text autoregressively, conditioned on the projected audio
The projector reduces sequence length via frame stacking: output_len = (input_len - 5) // 5 + 1
Model Specifications
| Specification | Value |
|---|---|
| Input | Audio (16kHz mono) |
| Output | Text transcription |
| Max Audio Length | ~30 seconds (limited by encoder) |
| Vocabulary | Qwen3 tokenizer |
| Languages | English only |
| Generation | Greedy decoding (num_beams=1, do_sample=False) |
Training Details
| Dataset | LoquaciousSet (25,000 hours) |
| Hardware | Single NVIDIA A40 |
| Time | ~24 hours |
| Cost | ~$12 |
| Optimizer | AdamW |
| Learning Rate | 1e-4 |
| Batch Size | 4 |
| Steps | 50,000 |
Limitations
- English only: Not trained on other languages
- Sample rate: Expects 16kHz audio (other rates resampled automatically)
- Audio length: Best for clips under 30 seconds
- Accuracy: May degrade on:
- Heavily accented speech
- Noisy or low-quality audio
- Domain-specific terminology
- Overlapping speakers
- No punctuation: Output is lowercase without punctuation by default
Requirements
transformers>=4.40.0
torch>=2.0.0
torchaudio>=2.0.0
Optional for streaming:
librosa
soundfile
Files
| File | Description |
|---|---|
config.json |
Model configuration |
model.safetensors |
Projector weights (~48MB) |
preprocessor_config.json |
Audio preprocessing config |
tokenizer.json |
Tokenizer |
tokenizer_config.json |
Tokenizer config |
special_tokens_map.json |
Special tokens |
Note: Only the projector weights are stored. The encoder (GLM-ASR) and decoder (Qwen3) are loaded from their respective HuggingFace repos.
Citation
If you use this model, please cite:
@misc{tinyaudio2024,
author = {Alex Kroman},
title = {Tiny Audio: Minimal ASR Training},
year = {2024},
publisher = {GitHub},
url = {https://github.com/alexkroman/tiny-audio}
}
Links
- GitHub Repository - Train your own model
- Free 3.5-hour Course - Learn ASR from scratch
- Live Demo - Try it in your browser
Acknowledgments
- GLM-ASR for the audio encoder
- Qwen3 for the language model
- LoquaciousSet for training data
License
MIT
- Downloads last month
- 1,103