AI & ML interests

None defined yet.

RHOMBUS — Official Hugging Face Organization

Clean geometry. Bold ideas. Practical AI. Building compact, efficient intelligence for everyone — from student laptops to planetary-scale clusters.

image/png


🔷 Who We Are

Rhombus is an independent AI research & engineering studio focused on small, efficient, and reasoning-strong models. We prototype new architectures, build high-quality datasets, and ship production tools that work offline, on low compute, and in real-world constraints.

Our guiding principles:

  • Geometry over noise: clear structure, measurable outcomes, minimal bloat.
  • Small-first: design models that outperform their size class.
  • Reasoning-centric: prioritise logic, reliability, and controllability.
  • Accessible: reproducible, transparent, documented for students & startups.

🧭 Mission

  1. Re-think model architecture beyond classic Transformers for efficiency and robustness.
  2. Compress intelligence — make 50M–2B parameter models reason like much larger ones.
  3. Democratise training with tooling that runs on consumer GPUs and CPU-only environments.
  4. Ship pragmatic AI — tools that solve real problems in coding, data, education, and research.

📦 Key Projects (Active/Planned)

🧪 Architectures & Models

  • Brahma — a post-transformer research line targeting minimal compute with strong reasoning and robustness. Goal: beat GPT-2 class baselines with a fraction of compute.
  • Water v0.x — early Brahma-based series (5M → 100M) proving the concept with rigorous evals.
  • Karta 135M — fine-tuned SmolLM-based series for compact instruction following.
  • Kishor — a multilingual reasoning LLM family (Hindi/Hinglish/English) with strong coding skills. v3 target: ~2.2B params, balanced for edge + server.

🖼️ Generative & Multimodal

  • Klaa — text‑to‑image model (Mk 2.5+) with robust prompt alignment and style control.
  • Rhombus TTS (R&D) — lightweight text‑to‑speech optimized for clarity on consumer GPUs.

🧰 Tooling

  • Rhombus CorpusForge (aka DataCrafter) — offline dataset factory: dedup, filtering, chunking, quality lift, export for training.
  • Project Fruit — dataset classification + high‑quality custom instruction sets for fine‑tuning compact models.

🗺️ Roadmap Snapshot

  • 2025: Brahma research notes, Water v0.2 (100M), CorpusForge alpha, Karta 135M releases.
  • 2026: Kishor v3 training; robust multilingual/coding evals; Klaa Mk 3.
  • 2027: Brahma v1 reference; inference SDK; offline QA/coding assistant.
  • 2028–2030: Scaled Brahma family; unified multimodal small models; education-first deployments.

Detailed per-quarter milestones live in the organization Projects board.


🧩 Organization Layout

We keep repos single-purpose, well‑documented, and tagged.

Rhombus/
├─ brahma/                   # core research, papers, reference impls
├─ water/                    # Water v0.x experimental models (Brahma-based)
├─ kishor/                   # multilingual reasoning LLMs
├─ karta-135m/               # smol fine-tunes (instruction)
├─ klaa/                     # text-to-image models & training
├─ corpusforge/              # dataset factory & CLI
├─ project-fruit/            # data classification + curation pipelines
├─ eval/                     # evaluation harness & leaderboards
├─ datasets/                 # dataset cards, loaders, governance
└─ docs/                     # org-wide specs, style guides, templates

Tagging & Naming

  • Repos: area-name (e.g., architecture-brahma, tooling-corpusforge).
  • Branches: main (stable), dev (active), exp/<topic> (short‑lived).
  • Releases: semantic tags vMAJOR.MINOR.PATCH + training/build metadata.

📊 Evaluation & Benchmarks

We care about reasoning over raw next-token loss. Our standard evals:

  • Language: MMLU, ARC, HellaSwag, TruthfulQA, WinoGrande, BIG‑bench (selected), XNLI subset (Hindi/English)
  • Coding: HumanEval+, MBPP, Codeforces-style synthetic stacks
  • Safety: jailbreak suites, refusal correctness, harmful content filters
  • Image (Klaa): FID-like proxies, CLIP‑score, prompt adherence, style robustness

We publish exact prompts, seeds, decoders, and compute for reproducibility.


🔐 Safety, Security & Governance

  • Alignment: instruction tuning with preference data; safety rail prompts; content filters on output.
  • Security: supply-chain checksums, signed releases, deterministic builds when possible.
  • Privacy: strict dataset licensing review; PII scrubbing; opt‑out channels.
  • Ethics: transparent data sources; clear intended use; red‑line misuse policy.

📄 Licenses

  • Code: Apache-2.0 (preferred) or MIT when noted.
  • Models: Apache-2.0 / OpenRAIL / custom Responsible AI license depending on risk profile.
  • Datasets: Original content under CC‑BY‑4.0 or CC‑BY‑SA‑4.0; third‑party per-source.

Each repo contains a clear LICENSE and NOTICE with third‑party attributions.


🧪 Reproducibility Policy

For every release we strive to provide:

  • Training recipe: data mix, token count, curriculum, batch schedulers.
  • Compute: GPU/TPU type, hours, energy notes.
  • Exact checkpoints: with SHA256, quantized variants, and safetensors.
  • Configs: tokenizer, architecture params, decoder settings.

🧱 Contribution Guide (Quick Start)

1) Discuss

Open an issue in the relevant repo with a clear proposal. Use the proposal template.

2) Develop

  • Fork the repo, create exp/<topic> branch
  • Follow code style (ruff/black for Py, ortho tests, mypy optional)
  • Add/Update docs and unit tests

3) Submit

Open a PR to dev with:

  • motivation, design notes
  • benchmarks (even small‑scale)
  • safety considerations

4) Review & Merge

  • 2 approvals minimum for core repos
  • CI must pass (lint, tests, basic eval sanity)

See CONTRIBUTING.md in each repo for details.


🧾 Templates

Below are copy‑ready card templates you can use across Rhombus repositories.

📘 Model Card (template)

---
license: apache-2.0
language:
  - en
  - hi
library_name: transformers
tags:
  - reasoning
  - small-language-model
  - multilingual
  - rhombus
  - brahma
model-index:
  - name: <MODEL_NAME>
    results:
      - task: {type: text-generation}
        dataset: {name: <DATASET or MIX>, type: <hf-dataset-id>}
        metrics:
          - name: MMLU
            type: mmlu
            value: <score>
          - name: ARC
            type: arc
            value: <score>
---

# <MODEL_NAME>

## Summary
One‑paragraph description, positioning, and key capabilities.

## Intended Use
- **Primary:** education, coding assistant, offline QA
- **Out of scope / disallowed:** 

## Training
- **Tokens:** <N>
- **Data mix:** <list>
- **Compute:** <GPU/TPU, hours>

## Evaluation
Report exact prompts, seeds, decoders, and links to scripts.

## Safety
Known limitations, bias notes, and refusal behavior.

## License
Apache-2.0 (see `LICENSE`).

📗 Dataset Card (template)

---
license: cc-by-4.0
tags:
  - dataset
  - rhombus
  - instruction
language:
  - en
  - hi
pretty_name: <DATASET_NAME>
---

# <DATASET_NAME>

## Summary
High‑level description and purpose.

## Source & Collection
List all sources, filters, dedup steps, and justification.

## Processing
Tokenization, chunking, cleaning, quality enhancement (if any).

## Structure
- **Splits:** train/val/test sizes
- **Fields:** schema and examples

## Licensing
Origin licenses with links; redistribution terms.

## Ethical Considerations
PII policy, redactions, opt‑out mechanism.

📙 Space Card (template)

# <SPACE_NAME>

Interactive demo for `<MODEL_NAME>`.

## Features
- Prompt presets
- Safety toggle
- Quantized checkpoints for CPU

## Run Locally
```bash
pip install -r requirements.txt
python app.py

---

## 🛠️ Tooling & Dev Environment
- **Languages**: Python, Rust (perf‑critical), TypeScript (tools/UI)
- **Core libs**: PyTorch/Transformers, JAX (experiments), ONNX, ggml/gguf
- **Eval**: Eleuther evals, custom harness in `eval/`
- **CI**: pre‑commit, pytest, minimal eval sanity per PR

---

## 📬 Contact & Community
- **Issues**: GitHub/HF Issues (preferred)
- **Security**: [email protected] (PGP available)  
- **General**: [email protected]  
- **Updates**: Follow our HF org and star repos to get release notifications.

> Want to collaborate on **education‑grade** AI that runs on modest hardware? We’d love to hear from you.

---

## 🏷️ Badges (optional examples)
Add these to repo READMEs as needed:

[![HF Spaces](https://img.shields.io/badge/🤗-Spaces-blue.svg)](#)
[![Models](https://img.shields.io/badge/Models-Release-brightgreen.svg)](#)
[![Datasets](https://img.shields.io/badge/Datasets-Live-orange.svg)](#)
[![License: Apache-2.0](https://img.shields.io/badge/License-Apache%202.0-black.svg)](#)
[![Twitter Follow](https://img.shields.io/twitter/follow/rhombus_ai?style=social)](#)

---

## 🧭 Quick Links
- **Releases**: see repo tags
- **Changelogs**: `CHANGELOG.md` in each project
- **Docs**: `/docs` (org‑wide), per‑repo READMEs

---

<p align="center">Made with ◇ by the Rhombus team — *geometry over noise.*</p>

title: README emoji: 🐠 colorFrom: red colorTo: green sdk: static pinned: true


license: apache-2.0 emoji: 📚 colorFrom: gray colorTo: gray