News: 0178839788

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

Nvidia Release Massive AI-Ready Open European Language Dataset and Tools (siliconangle.com)

(Saturday August 23, 2025 @05:34PM (EditorDavid) from the machine-language dept.)


"Only a tiny fraction of the more than 7,000 languages on Earth are supported by artificial intelligence models," [1]reported SiliconANGLE this week . So Nvidia announced " [2]a massive new AI-ready dataset and models to support the development of high-quality AI translation for European languages."

> The [3]new [4]dataset , named Granary, is a massive open-source corpus of multilingual audio, including more than a million hours of audio, plus [5]650,000 hours of speech recognition and 350,000 hours of speech translation. Nvidia's speech AI team collaborated with researchers from Carnegie Mellon University and Fondazione Bruno Kessler to process unlabeled audio and public speech data into information usable for AI training... Granary includes 25 European languages, representing nearly all of the European Union's 24 official languages, plus Russian and Ukrainian. The dataset also contains languages with limited available data, such as Croatian, Estonian and Maltese. This is critically important because providing these underrepresented human-annotated datasets will enable developers to create more inclusive speech technologies for audiences who speak those languages, while using less training data in their AI applications and models... The team demonstrated in [6]their research paper that, compared to other popular datasets, it takes around half as much Granary training data to achieve high accuracy for automatic speech recognition and automatic speech translation.

>

> Alongside Granary, Nvidia also released new Canary and Parakeet models to demonstrate what can be created with the dataset... The new Canary is available under a fairly permissive license for commercial and research use, expanding Canary's current languages from four to 25. It offers transcription and translation quality comparable to models three times larger while running inference up to 10 times faster. At 1 billion parameters, it can run completely on-device on most next-gen flagship smartphones for speech translation on the fly.



[1] https://siliconangle.com/2025/08/15/nvidia-releases-massive-high-quality-ai-ready-european-language-dataset-tools/

[2] https://blogs.nvidia.com/blog/speech-ai-dataset-models/

[3] https://huggingface.co/datasets/nvidia/Granary

[4] https://github.com/NVIDIA/NeMo-speech-data-processor/tree/main/dataset_configs/multilingual/granary

[5] https://nvidia-nemo.github.io/blog/2025/08/13/granary-data-for-fine-tune/

[6] https://arxiv.org/pdf/2505.13404



English as an intermediary? (Score:2)

by david.emery ( 127135 )

I wonder if they're doing direct from say, French to German or Spanish to Polish, or using English as an intermediate representation. And, of course, there's always the idempotent test (translate from A to B, then send the result to B to A translation.)

dave

Re:English as an intermediary? (Score:4, Informative)

by test321 ( 8891681 )

Maybe the answer in their paper [1]https://arxiv.org/html/2505.13... [arxiv.org]

[1] https://arxiv.org/html/2505.13404v2

Re: (Score:2)

by david.emery ( 127135 )

I looked through the paper, and frankly I couldn't tell. But I'm not an AI/translation person, so I might have missed something.

But I do note their example is English Croatian. So that just reinforces my question about translation between two arbitrary European languages.

Too stupid to release an FOSS graphiics driver (Score:1)

by gavron ( 1300111 )

NVIDA lots it way years ago. This doesn't change anything.

May their business go down the tubes just like their GPUs on FOSS.

This may help unfuck the EU (Score:2)

by Luckyo ( 1726890 )

This may actually help unfuck the EU as a structure in one of the fundamental ways it's fucked. Comprehension across languages.

Just the bureaucratic translation apparatus between all languages in Brussels is a money black hole on its own, and this has a good chance of removing it. Beyond that, ability to actually communicate in main European languages across the board would be a very welcome thing, as a lot of written and spoken assets are just not available in most European languages at all due to fairly s

Where there's a will, there's an Inheritance Tax.