Tiny Audio

A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with Tiny Audio—a minimal, hackable ASR framework.

Quick Start

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
result = pipe("audio.wav")
print(result["text"])

Usage Examples

Basic Transcription

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)

# From file
result = pipe("audio.wav")
print(result["text"])

# From URL
result = pipe("https://example.com/audio.mp3")

# From numpy array (must be 16kHz)
import numpy as np
audio = np.random.randn(16000).astype(np.float32)  # 1 second
result = pipe(audio)

Batch Processing

# Process multiple files
files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = pipe(files, batch_size=4)
for r in results:
    print(r["text"])

Word-Level Timestamps

result = pipe("audio.wav", return_timestamps="word")
# Returns:
# {
#   "text": "hello world",
#   "chunks": [
#     {"text": "hello", "timestamp": (0.0, 0.5)},
#     {"text": "world", "timestamp": (0.6, 1.0)}
#   ]
# }

Streaming Inference

from tiny_audio import ASRModel, ASRProcessor
import torch

model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")

# Load and process audio
import librosa
audio, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

# Stream tokens
for token in model.generate_streaming(inputs["input_features"]):
    print(token, end="", flush=True)

Using with torch directly

from tiny_audio import ASRModel, ASRProcessor
import torch
import librosa

# Load model and processor
model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")

# Load audio (16kHz)
audio, sr = librosa.load("audio.wav", sr=16000)

# Process
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

# Generate
with torch.no_grad():
    output = model.generate(
        input_features=inputs["input_features"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=256
    )

# Decode
text = processor.batch_decode(output, skip_special_tokens=True)[0]
print(text)

GPU Inference

import torch

pipe = pipeline(
    "automatic-speech-recognition",
    model="mazesmazes/tiny-audio",
    trust_remote_code=True,
    device="cuda"  # or device=0
)

Half Precision

pipe = pipeline(
    "automatic-speech-recognition",
    model="mazesmazes/tiny-audio",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device="cuda"
)

Architecture

Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text

Only the projector is trained (~12M params). The encoder and decoder remain frozen, leveraging their pretrained knowledge.

Component Model Parameters Status
Audio Encoder GLM-ASR-Nano-2512 ~600M Frozen
Projector 2-layer MLP ~12M Trained
Language Model Qwen3-0.6B ~600M Frozen

How It Works

  1. Audio Encoder: GLM-ASR converts 16kHz audio into frame-level embeddings (768-dim)
  2. Projector: A 2-layer MLP with frame stacking bridges the audio and text embedding spaces
  3. Language Model: Qwen3 generates text autoregressively, conditioned on the projected audio

The projector reduces sequence length via frame stacking: output_len = (input_len - 5) // 5 + 1

Model Specifications

Specification Value
Input Audio (16kHz mono)
Output Text transcription
Max Audio Length ~30 seconds (limited by encoder)
Vocabulary Qwen3 tokenizer
Languages English only
Generation Greedy decoding (num_beams=1, do_sample=False)

Training Details

Dataset LoquaciousSet (25,000 hours)
Hardware Single NVIDIA A40
Time ~24 hours
Cost ~$12
Optimizer AdamW
Learning Rate 1e-4
Batch Size 4
Steps 50,000

Limitations

  • English only: Not trained on other languages
  • Sample rate: Expects 16kHz audio (other rates resampled automatically)
  • Audio length: Best for clips under 30 seconds
  • Accuracy: May degrade on:
    • Heavily accented speech
    • Noisy or low-quality audio
    • Domain-specific terminology
    • Overlapping speakers
  • No punctuation: Output is lowercase without punctuation by default

Requirements

transformers>=4.40.0
torch>=2.0.0
torchaudio>=2.0.0

Optional for streaming:

librosa
soundfile

Files

File Description
config.json Model configuration
model.safetensors Projector weights (~48MB)
preprocessor_config.json Audio preprocessing config
tokenizer.json Tokenizer
tokenizer_config.json Tokenizer config
special_tokens_map.json Special tokens

Note: Only the projector weights are stored. The encoder (GLM-ASR) and decoder (Qwen3) are loaded from their respective HuggingFace repos.

Citation

If you use this model, please cite:

@misc{tinyaudio2024,
  author = {Alex Kroman},
  title = {Tiny Audio: Minimal ASR Training},
  year = {2024},
  publisher = {GitHub},
  url = {https://github.com/alexkroman/tiny-audio}
}

Links

Acknowledgments

License

MIT

Downloads last month
1,103
Safetensors
Model size
12.6M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mazesmazes/tiny-audio

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(565)
this model
Quantizations
1 model

Dataset used to train mazesmazes/tiny-audio

Space using mazesmazes/tiny-audio 1