Sarvam-30B 8-Bit (BitsAndBytes)
This repository provides an 8-bit quantized version of the base model sarvamai/sarvam-30b using bitsandbytes.
8-bit quantization reduces memory usage while maintaining very high model quality.
Base model
sarvamai/sarvam-30b
Architecture
SarvamMoEForCausalLM
Quantization Details
Quantization method: BitsAndBytes 8-bit
Configuration used:
- load_in_8bit = True
Approximate GPU memory usage:
| Model | GPU VRAM |
|---|---|
| FP16 original | ~60 GB |
| 8-bit | ~30 GB |
This version provides near-FP16 quality while using roughly half the memory.
Installation
Install dependencies.
pip install transformers accelerate bitsandbytes torch safetensors
Loading the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"neuralnets/sarvam-30b-8bit",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"neuralnets/sarvam-30b-8bit",
trust_remote_code=True
)
Example Inference
prompt = "Explain mixture of experts in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=200
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Hardware Requirements
Recommended GPUs:
- A100 40GB or 80GB
- RTX 4090
- RTX 3090
CPU RAM recommendation:
- 32 GB or more
Notes
- Uses bitsandbytes 8-bit quantization integrated with Hugging Face Transformers.
- Requires
trust_remote_code=Truedue to the Sarvam architecture. - Suitable for high-quality inference.
Base Model
Original model repository:
sarvamai/sarvam-30b
Refer to the base model page for detailed information about training and architecture.
License
This repository distributes a quantized derivative of the upstream model.
Users must comply with the license of the original model:
sarvamai/sarvam-30b
- Downloads last month
- 113
Model tree for neuralnets/sarvam-30b-8bit
Base model
sarvamai/sarvam-30b