You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

SentiV

A Benchmark for Low-Resource Vietnamese Speech–Text Emotion Understanding

This repository releases datasets, code, and pretrained checkpoints for SentiV, a benchmark for Vietnamese emotion understanding across text, speech, and multimodal settings, as described in our paper.

📄 Paper: SentiV: A Benchmark for Low-Resource Vietnamese Speech–Text Emotion Understanding


1. Overview

SentiV focuses on realistic low-resource evaluation for Vietnamese emotion recognition under:

  • Label imbalance
  • Limited supervision (1–100% label budgets)
  • Cross-dataset and cross-modal generalization
  • Explicit label-space alignment between text and speech

We release:

  • Text emotion dataset (data + code + checkpoints)
  • Speech emotion annotations (labels + code + checkpoints)
  • Reproducible training and evaluation scripts

2. Repository Structure

sentiv/
├── text-training/
│   ├── model/                # Text model checkpoints
│   ├── train_PhoBERT.py      # Training script (PhoBERT)
│   ├── train.xlsx            # Labeled text data
│   └── readme.MD
│
├── voice-training/
│   ├── hubert-large-ls960/   # Speech model checkpoints
│   ├── label/                # Emotion labels and split manifests
│   ├── train_hubert.py       # HuBERT fine-tuning script
│   └── readme.MD
│
└── README.md                 # This file

3. Tasks and Label Space

Task A: Text Emotion Classification

  • Labels (7): Anger, Disgust, Enjoyment, Fear, Neutral, Sadness, Surprise
  • Dataset: social media text (comments, posts)
  • Evaluation: Macro-F1, Accuracy

Task B: Speech Emotion Classification

  • Labels (6): Anger (includes Disgust), Enjoyment, Fear, Neutral, Sadness, Surprise
  • Disgust is merged into Anger due to extreme scarcity in speech data

Task C: Multimodal Speech–Text Classification

  • Same 6-label space as speech
  • Late fusion over text and speech logits

4. Text Modality (text-training)

Data

  • Source: public Vietnamese social media posts
  • Size: 265,011 labeled samples
  • Average length: ~20 words
  • Labels: 7 emotions
  • Anonymized and released strictly for research use

Model

  • Backbone: PhoBERT (vinai/phobert-base)
  • Loss: Focal Loss with class reweighting
  • Max sequence length: 256
  • Metric: Macro-F1

Training

python train_PhoBERT.py

The script supports:

  • Class imbalance handling
  • Oversampling
  • Low-resource label budgets
  • Fixed train/dev/test splits

5. Speech Modality (voice-training)

Data

  • Source audio: VietSpeech dataset (batches 0–10)

  • We release:

    • Emotion labels
    • Split manifests
    • Training code
  • Raw audio must be obtained from the original VietSpeech source under its license

Label Mapping

  • Disgust is merged into Anger for training stability
  • Final label space: 6 emotions

Model

  • Backbone: HuBERT Large (ls960)
  • Input: 16 kHz audio, max 8 seconds
  • Loss: Weighted Cross-Entropy
  • Sampler: WeightedRandomSampler
  • Metric: Macro-F1

Training

python train_hubert.py

6. Multimodal Fusion

We adopt late fusion at logit level for reproducibility.

Fusion Strategy

  • Average fusion
  • Concatenation + MLP
  • Uncertainty-aware late fusion (main method)

Confidence is estimated from entropy or max probability, and fusion weights are adjusted dynamically to down-weight unreliable modalities.


7. Low-Resource Evaluation Protocol

  • Label budgets: 1%, 5%, 10%, 25%, 50%, 100%
  • Fixed test set
  • Only training data is subsampled
  • 3–5 random seeds per setting
  • Report mean ± std

This protocol is designed to reflect realistic variance under limited supervision.


8. Ethics and Licensing

Text Data

  • Collected from publicly available social media
  • All user-identifying information removed
  • Research-only use
  • Takedown requests supported

Speech Data

  • Based on VietSpeech
  • Speakers provided research consent
  • We release labels and derived artifacts only

Users must comply with original dataset licenses.


9. Access Policy

This repository is released via Hugging Face with access control enabled.

  • Users must request access
  • Access is granted manually for research purposes
  • Redistribution without permission is not allowed

10. Citation

If you use SentiV, please cite our paper:

@article{sentiv2026,
  title     = {SentiV: A Benchmark for Low-Resource Vietnamese Speech–Text Emotion Understanding},
  author    = {Anonymous},
  year      = {2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support