NVIDIA has introduced a new open dataset and AI models designed to significantly advance multilingual speech recognition and translation. Supporting 25 European languages — including those with limited data resources such as Croatian, Estonian, and Maltese — the release aims to make production-ready speech AI more accessible to developers worldwide.
The resources include:
- Granary, a large-scale open-source multilingual speech dataset featuring around one million hours of audio — with nearly 650,000 hours dedicated to speech recognition and over 350,000 hours for speech translation.
- NVIDIA Canary-1b-v2, a billion-parameter model built on Granary, delivering state-of-the-art transcription accuracy for European languages and translation between English and more than two dozen languages. It currently ranks at the top of Hugging Face’s leaderboard for multilingual speech recognition.
- NVIDIA Parakeet-tdt-0.6b-v3, a 600-million-parameter model optimized for speed and scalability, delivering the highest throughput of multilingual models on Hugging Face — making it well-suited for real-time or high-volume transcription.
These new assets are designed to help developers scale speech AI applications across industries, from multilingual chatbots and customer support voice agents to near-real-time translation services.
Also Read: Narwal Appoints Ravi Tenneti as Chief Strategy and Technology Officer
The Granary paper will be presented at Interspeech 2025 in the Netherlands, August 17–21. Both the dataset and the Canary and Parakeet models are now publicly available on Hugging Face.
Tackling Data Scarcity With Granary
To build Granary, the NVIDIA speech AI team partnered with Carnegie Mellon University and Fondazione Bruno Kessler. Using the NVIDIA NeMo Speech Data Processor, the team transformed vast amounts of unlabeled audio into high-quality, structured data without relying on costly manual annotation. The entire pipeline is open source and accessible on GitHub.
By offering clean, ready-to-use data, Granary gives developers a powerful foundation for building transcription and translation models across nearly all 24 official EU languages — plus Russian and Ukrainian.
For underrepresented languages, Granary provides a breakthrough resource, enabling the creation of more inclusive speech technologies that reflect Europe’s rich linguistic diversity while requiring significantly less training data. Research findings presented in the Interspeech paper show that Granary requires about half the training data of other popular datasets to achieve the same accuracy in automatic speech recognition (ASR) and automatic speech translation (AST).
Accelerating Innovation With NeMo, Canary, and Parakeet
The launch of Canary and Parakeet demonstrates the potential of building specialized models with Granary.
- Canary-1b-v2 is tailored for high-accuracy, complex transcription and translation tasks.
- Parakeet-tdt-0.6b-v3 is engineered for real-time performance and large-scale deployments, offering superior speed and efficiency.
By sharing not only the models but also the methodology behind Granary, NVIDIA is equipping the global developer community to extend these advancements to new languages and domains. This open approach lowers barriers for speech AI development, accelerating innovation across industries.
SOURCE: NVIDIA