RHOMBUS

non-profit

https://rhombuai.kesug.com

Activity Feed

AI & ML interests

None defined yet.

Organization Card

Community About org cards

RHOMBUS — Official Hugging Face Organization

Clean geometry. Bold ideas. Practical AI. Building compact, efficient intelligence for everyone — from student laptops to planetary-scale clusters.

🔷 Who We Are

Rhombus is an independent AI research & engineering studio focused on small, efficient, and reasoning-strong models. We prototype new architectures, build high-quality datasets, and ship production tools that work offline, on low compute, and in real-world constraints.

Our guiding principles:

Geometry over noise: clear structure, measurable outcomes, minimal bloat.
Small-first: design models that outperform their size class.
Reasoning-centric: prioritise logic, reliability, and controllability.
Accessible: reproducible, transparent, documented for students & startups.

🧭 Mission

Re-think model architecture beyond classic Transformers for efficiency and robustness.
Compress intelligence — make 50M–2B parameter models reason like much larger ones.
Democratise training with tooling that runs on consumer GPUs and CPU-only environments.
Ship pragmatic AI — tools that solve real problems in coding, data, education, and research.

📦 Key Projects (Active/Planned)

🧪 Architectures & Models

Brahma — a post-transformer research line targeting minimal compute with strong reasoning and robustness. Goal: beat GPT-2 class baselines with a fraction of compute.
Water v0.x — early Brahma-based series (5M → 100M) proving the concept with rigorous evals.
Karta 135M — fine-tuned SmolLM-based series for compact instruction following.
Kishor — a multilingual reasoning LLM family (Hindi/Hinglish/English) with strong coding skills. v3 target: ~2.2B params, balanced for edge + server.

🖼️ Generative & Multimodal

Klaa — text‑to‑image model (Mk 2.5+) with robust prompt alignment and style control.
Rhombus TTS (R&D) — lightweight text‑to‑speech optimized for clarity on consumer GPUs.

🧰 Tooling

Rhombus CorpusForge (aka DataCrafter) — offline dataset factory: dedup, filtering, chunking, quality lift, export for training.
Project Fruit — dataset classification + high‑quality custom instruction sets for fine‑tuning compact models.

🗺️ Roadmap Snapshot

2025: Brahma research notes, Water v0.2 (100M), CorpusForge alpha, Karta 135M releases.
2026: Kishor v3 training; robust multilingual/coding evals; Klaa Mk 3.
2027: Brahma v1 reference; inference SDK; offline QA/coding assistant.
2028–2030: Scaled Brahma family; unified multimodal small models; education-first deployments.

Detailed per-quarter milestones live in the organization Projects board.

🧩 Organization Layout

We keep repos single-purpose, well‑documented, and tagged.

Rhombus/
├─ brahma/                   # core research, papers, reference impls
├─ water/                    # Water v0.x experimental models (Brahma-based)
├─ kishor/                   # multilingual reasoning LLMs
├─ karta-135m/               # smol fine-tunes (instruction)
├─ klaa/                     # text-to-image models & training
├─ corpusforge/              # dataset factory & CLI
├─ project-fruit/            # data classification + curation pipelines
├─ eval/                     # evaluation harness & leaderboards
├─ datasets/                 # dataset cards, loaders, governance
└─ docs/                     # org-wide specs, style guides, templates

Tagging & Naming

Repos: area-name (e.g., architecture-brahma, tooling-corpusforge).
Branches: main (stable), dev (active), exp/<topic> (short‑lived).
Releases: semantic tags vMAJOR.MINOR.PATCH + training/build metadata.

📊 Evaluation & Benchmarks

We care about reasoning over raw next-token loss. Our standard evals:

Language: MMLU, ARC, HellaSwag, TruthfulQA, WinoGrande, BIG‑bench (selected), XNLI subset (Hindi/English)
Coding: HumanEval+, MBPP, Codeforces-style synthetic stacks
Safety: jailbreak suites, refusal correctness, harmful content filters
Image (Klaa): FID-like proxies, CLIP‑score, prompt adherence, style robustness

We publish exact prompts, seeds, decoders, and compute for reproducibility.

🔐 Safety, Security & Governance

Alignment: instruction tuning with preference data; safety rail prompts; content filters on output.
Security: supply-chain checksums, signed releases, deterministic builds when possible.
Privacy: strict dataset licensing review; PII scrubbing; opt‑out channels.
Ethics: transparent data sources; clear intended use; red‑line misuse policy.

📄 Licenses

Code: Apache-2.0 (preferred) or MIT when noted.
Models: Apache-2.0 / OpenRAIL / custom Responsible AI license depending on risk profile.
Datasets: Original content under CC‑BY‑4.0 or CC‑BY‑SA‑4.0; third‑party per-source.

Each repo contains a clear LICENSE and NOTICE with third‑party attributions.

🧪 Reproducibility Policy

For every release we strive to provide:

Training recipe: data mix, token count, curriculum, batch schedulers.
Compute: GPU/TPU type, hours, energy notes.
Exact checkpoints: with SHA256, quantized variants, and safetensors.
Configs: tokenizer, architecture params, decoder settings.

🧱 Contribution Guide (Quick Start)

1) Discuss

Open an issue in the relevant repo with a clear proposal. Use the proposal template.

2) Develop

Fork the repo, create exp/<topic> branch
Follow code style (ruff/black for Py, ortho tests, mypy optional)
Add/Update docs and unit tests

3) Submit

Open a PR to dev with:

motivation, design notes
benchmarks (even small‑scale)
safety considerations

4) Review & Merge

2 approvals minimum for core repos
CI must pass (lint, tests, basic eval sanity)

See CONTRIBUTING.md in each repo for details.

🧾 Templates

Below are copy‑ready card templates you can use across Rhombus repositories.

📘 Model Card (template)

---
license: apache-2.0
language:
  - en
  - hi
library_name: transformers
tags:
  - reasoning
  - small-language-model
  - multilingual
  - rhombus
  - brahma
model-index:
  - name: <MODEL_NAME>
    results:
      - task: {type: text-generation}
        dataset: {name: <DATASET or MIX>, type: <hf-dataset-id>}
        metrics:
          - name: MMLU
            type: mmlu
            value: <score>
          - name: ARC
            type: arc
            value: <score>
---

# <MODEL_NAME>

## Summary
One‑paragraph description, positioning, and key capabilities.

## Intended Use
- **Primary:** education, coding assistant, offline QA
- **Out of scope / disallowed:** …

## Training
- **Tokens:** <N>
- **Data mix:** <list>
- **Compute:** <GPU/TPU, hours>

## Evaluation
Report exact prompts, seeds, decoders, and links to scripts.

## Safety
Known limitations, bias notes, and refusal behavior.

## License
Apache-2.0 (see `LICENSE`).

📗 Dataset Card (template)

---
license: cc-by-4.0
tags:
  - dataset
  - rhombus
  - instruction
language:
  - en
  - hi
pretty_name: <DATASET_NAME>
---

# <DATASET_NAME>

## Summary
High‑level description and purpose.

## Source & Collection
List all sources, filters, dedup steps, and justification.

## Processing
Tokenization, chunking, cleaning, quality enhancement (if any).

## Structure
- **Splits:** train/val/test sizes
- **Fields:** schema and examples

## Licensing
Origin licenses with links; redistribution terms.

## Ethical Considerations
PII policy, redactions, opt‑out mechanism.

📙 Space Card (template)

# <SPACE_NAME>

Interactive demo for `<MODEL_NAME>`.

## Features
- Prompt presets
- Safety toggle
- Quantized checkpoints for CPU

## Run Locally
```bash
pip install -r requirements.txt
python app.py


---

## 🛠️ Tooling & Dev Environment
- **Languages**: Python, Rust (perf‑critical), TypeScript (tools/UI)
- **Core libs**: PyTorch/Transformers, JAX (experiments), ONNX, ggml/gguf
- **Eval**: Eleuther evals, custom harness in `eval/`
- **CI**: pre‑commit, pytest, minimal eval sanity per PR

---

## 📬 Contact & Community
- **Issues**: GitHub/HF Issues (preferred)
- **Security**: [email protected] (PGP available)  
- **General**: [email protected]  
- **Updates**: Follow our HF org and star repos to get release notifications.

> Want to collaborate on **education‑grade** AI that runs on modest hardware? We’d love to hear from you.

---

## 🏷️ Badges (optional examples)
Add these to repo READMEs as needed:

[![HF Spaces](https://img.shields.io/badge/🤗-Spaces-blue.svg)](#)
[![Models](https://img.shields.io/badge/Models-Release-brightgreen.svg)](#)
[![Datasets](https://img.shields.io/badge/Datasets-Live-orange.svg)](#)
[![License: Apache-2.0](https://img.shields.io/badge/License-Apache%202.0-black.svg)](#)
[![Twitter Follow](https://img.shields.io/twitter/follow/rhombus_ai?style=social)](#)

---

## 🧭 Quick Links
- **Releases**: see repo tags
- **Changelogs**: `CHANGELOG.md` in each project
- **Docs**: `/docs` (org‑wide), per‑repo READMEs

---

<p align="center">Made with ◇ by the Rhombus team — *geometry over noise.*</p>

title: README emoji: 🐠 colorFrom: red colorTo: green sdk: static pinned: true

license: apache-2.0 emoji: 📚 colorFrom: gray colorTo: gray

models 1

Rhombus18/Karta-135M-Code

0.1B • Updated Aug 27, 2025

datasets 2

Rhombus18/53M-Token-Instruction-Code-QA-Dataset

Preview • Updated Aug 25, 2025 • 17

Rhombus18/scrapper300

Viewer • Updated Jul 4, 2025 • 257 • 204