Hi!
This is a great work. I saw you mentioned that the fine-tuning used 2.5M samples of around 7k hours.
I know that more might be better, but will it work for around 600 hours in another language?
In addition, can you tell us about the limitations of the fine-tuned/original model?
Does it have hallucination problems like those encountered in other TTS models like F5-TTS?