You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

SentiV

A Benchmark for Low-Resource Vietnamese Speech–Text Emotion Understanding

This repository releases datasets, code, and pretrained checkpoints for SentiV, a benchmark for Vietnamese emotion understanding across text, speech, and multimodal settings, as described in our paper.

📄 Paper: SentiV: A Benchmark for Low-Resource Vietnamese Speech–Text Emotion Understanding

1. Overview

SentiV focuses on realistic low-resource evaluation for Vietnamese emotion recognition under:

Label imbalance
Limited supervision (1–100% label budgets)
Cross-dataset and cross-modal generalization
Explicit label-space alignment between text and speech

We release:

Text emotion dataset (data + code + checkpoints)
Speech emotion annotations (labels + code + checkpoints)
Reproducible training and evaluation scripts

2. Repository Structure

sentiv/
├── text-training/
│   ├── model/                # Text model checkpoints
│   ├── train_PhoBERT.py      # Training script (PhoBERT)
│   ├── train.xlsx            # Labeled text data
│   └── readme.MD
│
├── voice-training/
│   ├── hubert-large-ls960/   # Speech model checkpoints
│   ├── label/                # Emotion labels and split manifests
│   ├── train_hubert.py       # HuBERT fine-tuning script
│   └── readme.MD
│
└── README.md                 # This file

3. Tasks and Label Space

Task A: Text Emotion Classification

Labels (7): Anger, Disgust, Enjoyment, Fear, Neutral, Sadness, Surprise
Dataset: social media text (comments, posts)
Evaluation: Macro-F1, Accuracy

Task B: Speech Emotion Classification

Labels (6): Anger (includes Disgust), Enjoyment, Fear, Neutral, Sadness, Surprise
Disgust is merged into Anger due to extreme scarcity in speech data

Task C: Multimodal Speech–Text Classification

Same 6-label space as speech
Late fusion over text and speech logits

4. Text Modality (text-training)

Data

Source: public Vietnamese social media posts
Size: 265,011 labeled samples
Average length: ~20 words
Labels: 7 emotions
Anonymized and released strictly for research use

Model

Backbone: PhoBERT (vinai/phobert-base)
Loss: Focal Loss with class reweighting
Max sequence length: 256
Metric: Macro-F1

Training

python train_PhoBERT.py

The script supports:

Class imbalance handling
Oversampling
Low-resource label budgets
Fixed train/dev/test splits

5. Speech Modality (voice-training)

Data

Source audio: VietSpeech dataset (batches 0–10)
We release:
- Emotion labels
- Split manifests
- Training code
Raw audio must be obtained from the original VietSpeech source under its license

Label Mapping

Disgust is merged into Anger for training stability
Final label space: 6 emotions

Model

Backbone: HuBERT Large (ls960)
Input: 16 kHz audio, max 8 seconds
Loss: Weighted Cross-Entropy
Sampler: WeightedRandomSampler
Metric: Macro-F1

Training

python train_hubert.py

6. Multimodal Fusion

We adopt late fusion at logit level for reproducibility.

Fusion Strategy

Average fusion
Concatenation + MLP
Uncertainty-aware late fusion (main method)

Confidence is estimated from entropy or max probability, and fusion weights are adjusted dynamically to down-weight unreliable modalities.

7. Low-Resource Evaluation Protocol

Label budgets: 1%, 5%, 10%, 25%, 50%, 100%
Fixed test set
Only training data is subsampled
3–5 random seeds per setting
Report mean ± std

This protocol is designed to reflect realistic variance under limited supervision.

8. Ethics and Licensing

Text Data

Collected from publicly available social media
All user-identifying information removed
Research-only use
Takedown requests supported

Speech Data

Based on VietSpeech
Speakers provided research consent
We release labels and derived artifacts only

Users must comply with original dataset licenses.

9. Access Policy

This repository is released via Hugging Face with access control enabled.

Users must request access
Access is granted manually for research purposes
Redistribution without permission is not allowed

10. Citation

If you use SentiV, please cite our paper:

@article{sentiv2026,
  title     = {SentiV: A Benchmark for Low-Resource Vietnamese Speech–Text Emotion Understanding},
  author    = {Anonymous},
  year      = {2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support