Cool, looks like this is trained on 16 million hours of audio (500B tokens at ~....

Cool, looks like this is trained on 16 million hours of audio (500B tokens at ~.11 seconds per token).

Even the large open source TTS models (see F5 TTS, Mask GCT) are mostly trained on very small audio datasets (say 100k hours) relative to the amount of audio available on the internet, so it's cool to see an open source effort to scale up training significantly.