RHOMBUS
AI & ML interests
None defined yet.
RHOMBUS — Official Hugging Face Organization
Clean geometry. Bold ideas. Practical AI. Building compact, efficient intelligence for everyone — from student laptops to planetary-scale clusters.

🔷 Who We Are
Rhombus is an independent AI research & engineering studio focused on small, efficient, and reasoning-strong models. We prototype new architectures, build high-quality datasets, and ship production tools that work offline, on low compute, and in real-world constraints.
Our guiding principles:
- Geometry over noise: clear structure, measurable outcomes, minimal bloat.
- Small-first: design models that outperform their size class.
- Reasoning-centric: prioritise logic, reliability, and controllability.
- Accessible: reproducible, transparent, documented for students & startups.
🧭 Mission
- Re-think model architecture beyond classic Transformers for efficiency and robustness.
- Compress intelligence — make 50M–2B parameter models reason like much larger ones.
- Democratise training with tooling that runs on consumer GPUs and CPU-only environments.
- Ship pragmatic AI — tools that solve real problems in coding, data, education, and research.
📦 Key Projects (Active/Planned)
🧪 Architectures & Models
- Brahma — a post-transformer research line targeting minimal compute with strong reasoning and robustness. Goal: beat GPT-2 class baselines with a fraction of compute.
- Water v0.x — early Brahma-based series (5M → 100M) proving the concept with rigorous evals.
- Karta 135M — fine-tuned SmolLM-based series for compact instruction following.
- Kishor — a multilingual reasoning LLM family (Hindi/Hinglish/English) with strong coding skills. v3 target: ~2.2B params, balanced for edge + server.
🖼️ Generative & Multimodal
- Klaa — text‑to‑image model (Mk 2.5+) with robust prompt alignment and style control.
- Rhombus TTS (R&D) — lightweight text‑to‑speech optimized for clarity on consumer GPUs.
🧰 Tooling
- Rhombus CorpusForge (aka DataCrafter) — offline dataset factory: dedup, filtering, chunking, quality lift, export for training.
- Project Fruit — dataset classification + high‑quality custom instruction sets for fine‑tuning compact models.
🗺️ Roadmap Snapshot
- 2025: Brahma research notes, Water v0.2 (100M), CorpusForge alpha, Karta 135M releases.
- 2026: Kishor v3 training; robust multilingual/coding evals; Klaa Mk 3.
- 2027: Brahma v1 reference; inference SDK; offline QA/coding assistant.
- 2028–2030: Scaled Brahma family; unified multimodal small models; education-first deployments.
Detailed per-quarter milestones live in the organization Projects board.
🧩 Organization Layout
We keep repos single-purpose, well‑documented, and tagged.
Rhombus/
├─ brahma/ # core research, papers, reference impls
├─ water/ # Water v0.x experimental models (Brahma-based)
├─ kishor/ # multilingual reasoning LLMs
├─ karta-135m/ # smol fine-tunes (instruction)
├─ klaa/ # text-to-image models & training
├─ corpusforge/ # dataset factory & CLI
├─ project-fruit/ # data classification + curation pipelines
├─ eval/ # evaluation harness & leaderboards
├─ datasets/ # dataset cards, loaders, governance
└─ docs/ # org-wide specs, style guides, templates
Tagging & Naming
- Repos:
area-name(e.g.,architecture-brahma,tooling-corpusforge). - Branches:
main(stable),dev(active),exp/<topic>(short‑lived). - Releases: semantic tags
vMAJOR.MINOR.PATCH+ training/build metadata.
📊 Evaluation & Benchmarks
We care about reasoning over raw next-token loss. Our standard evals:
- Language: MMLU, ARC, HellaSwag, TruthfulQA, WinoGrande, BIG‑bench (selected), XNLI subset (Hindi/English)
- Coding: HumanEval+, MBPP, Codeforces-style synthetic stacks
- Safety: jailbreak suites, refusal correctness, harmful content filters
- Image (Klaa): FID-like proxies, CLIP‑score, prompt adherence, style robustness
We publish exact prompts, seeds, decoders, and compute for reproducibility.
🔐 Safety, Security & Governance
- Alignment: instruction tuning with preference data; safety rail prompts; content filters on output.
- Security: supply-chain checksums, signed releases, deterministic builds when possible.
- Privacy: strict dataset licensing review; PII scrubbing; opt‑out channels.
- Ethics: transparent data sources; clear intended use; red‑line misuse policy.
📄 Licenses
- Code: Apache-2.0 (preferred) or MIT when noted.
- Models: Apache-2.0 / OpenRAIL / custom Responsible AI license depending on risk profile.
- Datasets: Original content under CC‑BY‑4.0 or CC‑BY‑SA‑4.0; third‑party per-source.
Each repo contains a clear
LICENSEandNOTICEwith third‑party attributions.
🧪 Reproducibility Policy
For every release we strive to provide:
- Training recipe: data mix, token count, curriculum, batch schedulers.
- Compute: GPU/TPU type, hours, energy notes.
- Exact checkpoints: with SHA256, quantized variants, and safetensors.
- Configs: tokenizer, architecture params, decoder settings.
🧱 Contribution Guide (Quick Start)
1) Discuss
Open an issue in the relevant repo with a clear proposal. Use the proposal template.
2) Develop
- Fork the repo, create
exp/<topic>branch - Follow code style (ruff/black for Py, ortho tests, mypy optional)
- Add/Update docs and unit tests
3) Submit
Open a PR to dev with:
- motivation, design notes
- benchmarks (even small‑scale)
- safety considerations
4) Review & Merge
- 2 approvals minimum for core repos
- CI must pass (lint, tests, basic eval sanity)
See
CONTRIBUTING.mdin each repo for details.
🧾 Templates
Below are copy‑ready card templates you can use across Rhombus repositories.
📘 Model Card (template)
---
license: apache-2.0
language:
- en
- hi
library_name: transformers
tags:
- reasoning
- small-language-model
- multilingual
- rhombus
- brahma
model-index:
- name: <MODEL_NAME>
results:
- task: {type: text-generation}
dataset: {name: <DATASET or MIX>, type: <hf-dataset-id>}
metrics:
- name: MMLU
type: mmlu
value: <score>
- name: ARC
type: arc
value: <score>
---
# <MODEL_NAME>
## Summary
One‑paragraph description, positioning, and key capabilities.
## Intended Use
- **Primary:** education, coding assistant, offline QA
- **Out of scope / disallowed:** …
## Training
- **Tokens:** <N>
- **Data mix:** <list>
- **Compute:** <GPU/TPU, hours>
## Evaluation
Report exact prompts, seeds, decoders, and links to scripts.
## Safety
Known limitations, bias notes, and refusal behavior.
## License
Apache-2.0 (see `LICENSE`).
📗 Dataset Card (template)
---
license: cc-by-4.0
tags:
- dataset
- rhombus
- instruction
language:
- en
- hi
pretty_name: <DATASET_NAME>
---
# <DATASET_NAME>
## Summary
High‑level description and purpose.
## Source & Collection
List all sources, filters, dedup steps, and justification.
## Processing
Tokenization, chunking, cleaning, quality enhancement (if any).
## Structure
- **Splits:** train/val/test sizes
- **Fields:** schema and examples
## Licensing
Origin licenses with links; redistribution terms.
## Ethical Considerations
PII policy, redactions, opt‑out mechanism.
📙 Space Card (template)
# <SPACE_NAME>
Interactive demo for `<MODEL_NAME>`.
## Features
- Prompt presets
- Safety toggle
- Quantized checkpoints for CPU
## Run Locally
```bash
pip install -r requirements.txt
python app.py
---
## 🛠️ Tooling & Dev Environment
- **Languages**: Python, Rust (perf‑critical), TypeScript (tools/UI)
- **Core libs**: PyTorch/Transformers, JAX (experiments), ONNX, ggml/gguf
- **Eval**: Eleuther evals, custom harness in `eval/`
- **CI**: pre‑commit, pytest, minimal eval sanity per PR
---
## 📬 Contact & Community
- **Issues**: GitHub/HF Issues (preferred)
- **Security**: [email protected] (PGP available)
- **General**: [email protected]
- **Updates**: Follow our HF org and star repos to get release notifications.
> Want to collaborate on **education‑grade** AI that runs on modest hardware? We’d love to hear from you.
---
## 🏷️ Badges (optional examples)
Add these to repo READMEs as needed:
[](#)
[](#)
[](#)
[](#)
[](#)
---
## 🧭 Quick Links
- **Releases**: see repo tags
- **Changelogs**: `CHANGELOG.md` in each project
- **Docs**: `/docs` (org‑wide), per‑repo READMEs
---
<p align="center">Made with ◇ by the Rhombus team — *geometry over noise.*</p>