SentiV
A Benchmark for Low-Resource Vietnamese Speech–Text Emotion Understanding
This repository releases datasets, code, and pretrained checkpoints for SentiV, a benchmark for Vietnamese emotion understanding across text, speech, and multimodal settings, as described in our paper.
📄 Paper: SentiV: A Benchmark for Low-Resource Vietnamese Speech–Text Emotion Understanding
1. Overview
SentiV focuses on realistic low-resource evaluation for Vietnamese emotion recognition under:
- Label imbalance
- Limited supervision (1–100% label budgets)
- Cross-dataset and cross-modal generalization
- Explicit label-space alignment between text and speech
We release:
- Text emotion dataset (data + code + checkpoints)
- Speech emotion annotations (labels + code + checkpoints)
- Reproducible training and evaluation scripts
2. Repository Structure
sentiv/
├── text-training/
│ ├── model/ # Text model checkpoints
│ ├── train_PhoBERT.py # Training script (PhoBERT)
│ ├── train.xlsx # Labeled text data
│ └── readme.MD
│
├── voice-training/
│ ├── hubert-large-ls960/ # Speech model checkpoints
│ ├── label/ # Emotion labels and split manifests
│ ├── train_hubert.py # HuBERT fine-tuning script
│ └── readme.MD
│
└── README.md # This file
3. Tasks and Label Space
Task A: Text Emotion Classification
- Labels (7):
Anger, Disgust, Enjoyment, Fear, Neutral, Sadness, Surprise - Dataset: social media text (comments, posts)
- Evaluation: Macro-F1, Accuracy
Task B: Speech Emotion Classification
- Labels (6):
Anger (includes Disgust), Enjoyment, Fear, Neutral, Sadness, Surprise - Disgust is merged into Anger due to extreme scarcity in speech data
Task C: Multimodal Speech–Text Classification
- Same 6-label space as speech
- Late fusion over text and speech logits
4. Text Modality (text-training)
Data
- Source: public Vietnamese social media posts
- Size: 265,011 labeled samples
- Average length: ~20 words
- Labels: 7 emotions
- Anonymized and released strictly for research use
Model
- Backbone: PhoBERT (vinai/phobert-base)
- Loss: Focal Loss with class reweighting
- Max sequence length: 256
- Metric: Macro-F1
Training
python train_PhoBERT.py
The script supports:
- Class imbalance handling
- Oversampling
- Low-resource label budgets
- Fixed train/dev/test splits
5. Speech Modality (voice-training)
Data
Source audio: VietSpeech dataset (batches 0–10)
We release:
- Emotion labels
- Split manifests
- Training code
Raw audio must be obtained from the original VietSpeech source under its license
Label Mapping
- Disgust is merged into Anger for training stability
- Final label space: 6 emotions
Model
- Backbone: HuBERT Large (ls960)
- Input: 16 kHz audio, max 8 seconds
- Loss: Weighted Cross-Entropy
- Sampler: WeightedRandomSampler
- Metric: Macro-F1
Training
python train_hubert.py
6. Multimodal Fusion
We adopt late fusion at logit level for reproducibility.
Fusion Strategy
- Average fusion
- Concatenation + MLP
- Uncertainty-aware late fusion (main method)
Confidence is estimated from entropy or max probability, and fusion weights are adjusted dynamically to down-weight unreliable modalities.
7. Low-Resource Evaluation Protocol
- Label budgets: 1%, 5%, 10%, 25%, 50%, 100%
- Fixed test set
- Only training data is subsampled
- 3–5 random seeds per setting
- Report mean ± std
This protocol is designed to reflect realistic variance under limited supervision.
8. Ethics and Licensing
Text Data
- Collected from publicly available social media
- All user-identifying information removed
- Research-only use
- Takedown requests supported
Speech Data
- Based on VietSpeech
- Speakers provided research consent
- We release labels and derived artifacts only
Users must comply with original dataset licenses.
9. Access Policy
This repository is released via Hugging Face with access control enabled.
- Users must request access
- Access is granted manually for research purposes
- Redistribution without permission is not allowed
10. Citation
If you use SentiV, please cite our paper:
@article{sentiv2026,
title = {SentiV: A Benchmark for Low-Resource Vietnamese Speech–Text Emotion Understanding},
author = {Anonymous},
year = {2026}
}