This open text-to-speech model needs just seconds of audio to clone your voice
- Reference: 1739732289
- News link: https://www.theregister.co.uk/2025/02/16/ai_voice_clone/
- Source link:
Founded in 2021 by Danny Martinelli and Krithik Puthalath, the startup aims to build a multimodal agent system called MaiaOS. To date, these efforts have seen the release of its Zamba family of small language models, optimizations such as tree attention, and now the release of its Zonos TTS models.
Measuring at 1.6 billion parameters in size each, the models were trained on more than 200,000 hours of speech data, which includes both neutral-toned speech such as audiobook narration, and "highly expressive" speech. According to the upstart's [1]release notes for Zonos, the majority of its data was in English but there were "substantial" quantities of Chinese, Japanese, French, Spanish, and German. Zyphra tells El Reg this data was acquired from the web and was not obtained from data brokers.
[2]
The results are actually two Zonos models: One that uses a fully transformer-based architecture, and the other, a hybrid that combines transformer and [3]Mamba state space model (SSM) architectures. The latter, Zyphra claims, makes it the first TTS model to use this arch. While transformer-based models are without a doubt the most commonly used in generative AI today, alternative architectures like Mamba are gaining traction.
[4]
[5]
From a practical standpoint, both models behave similarly to other text-to-speech models. But unlike those developed by ElevenLabs and others, Zyphra has elected to release its model weights on [6]Hugging Face under a permissive Apache 2.0 license.
Testing it out
Zyphra offers a demo environment where you can play with its Zonos models, along with paid API access and subscription plans on their website. But, if you're hesitant to upload your voice to a random startup's servers, getting the model running locally is relatively easy.
We'll go into more detail on how to set that up in a bit, but first, let's take a look at how well it actually works in the wild.
To test it out, we spun up Zyphra's Zonos demo locally on an Nvidia RTX 6000 Ada Generation graphics card. We then uploaded 20- to 30-second clips of ourselves reading a random passage of text, and fed that into the Zonos-v0.1 transformer and hybrid models along with a 50 or so word text prompt, leaving all hyperparameters to their defaults. The goal is to have the trained model predict your voice, and output it as an audio file, from the provided sample recordings and prompt.
[7]
Using a 24-second sample clip, we were able to achieve a voice clone good enough to fool close friends and family — at least on first blush. After revealing that the clip was AI generated, they did note that the pacing and speed of the speech did feel a little off, and that they believed they would have caught on to the fact the audio wasn't authentic given a longer clip.
You can listen for yourself, here are two clips. The first sample is a recording of a real-life human, your humble vulture, reading from H.G. Wells' The Time Machine, while the second is an AI-generated clone reading from Jules Verne's 20,000 Leagues Under the Sea.
Human sample:
[8]
[9]MP3 Audio
AI generated audio using the non-hybrid model:
[10]MP3 Audio
Both pacing and speech are parameters that can be controlled, and Zonos supports audio prefixing, which allows for more dynamic ranges such as whispering.
In its documentation, Zyphra claims its hybrid transformer-Mamba model performed about 20 percent faster than the pure transformer model. This speed up wasn't as noticeable for shorter prompts, but we can say there was a notable difference in how the two models sounded.
At least to our ears, the hybrid model generated a slightly more polished sounding audio, which ironically took away somewhat the authenticity of the cloned voice. Listening to yourself talk is always kind of a strange experience, however, so we'll let you be the judge.
AI generated audio using the hybrid model:
[11]MP3 Audio
The model's performance was also in line with Zyphra's claims of it producing about two seconds of audio for every second of runtime, when running on an RTX 4090. The RTX 6000 Ada — which isn't too far off from an RTX 4090 in terms of compute — required 9 to 10 seconds to convert roughly 50 words into an 18 to 20 second audio clip. We will note that on the first run, we did observe a warm-up period lasting about a minute while the model was loaded in GPU memory, so it won't start outputting right off the bat.
Try it for yourself
If you'd like to use Zonos to clone your own voice, deploying the model is relatively easy, assuming you've got a compatible GPU and some familiarity with Linux and containerization.
What you'll need:
A Linux box with a reasonably modern Nvidia graphics card with at least 8 GB of vRAM. You may be able to get this running on as little as 6 GB, but your mileage may vary. For the operating system, we're using Ubuntu 24.04 LTS.
This guide also assumes you've installed the latest version of Docker Engine and the latest release of Nvidia's Container Runtime. For more information on getting this set up, check out our guide on GPU-accelerated Docker containers [12]here . We also assume you're comfortable with the Linux command line.
To get started, we'll use git to pull down the Zonos repo: git clone https://github.com/Zyphra/Zonos.git
From there, we'll navigate into the folder and spin up the container using Docker Compose: cd Zonos
docker compose up
Note: Depending on your system, you'll probably need to run this docker command with elevated privileges using sudo or, in some cases, doas .
After a few seconds, you should be able to access the Gradio web GUI by navigating to http://localhost:7860 or, if you're running this remotely, you'll need to swap localhost for the machine's IP address or hostname. We highly recommend you don't leave this particular service facing the public internet.
[13]
Zypher's Zonos demo comes packaged with an easy-to-use Gradio dashboard - Click to enlarge
From there, you'll be greeted with a Gradio dashboard. Here you'll want to select which version of the Zonos model you'd like to use, upload or record your sample audio, and input the text you'd like to convert.
Below this, you'll find a variety of hyperparameters that allow you to tweak aspects of the generation, including things like pitch and speaking rate. We won't pretend to fully understand all of these parameters, but, in our testing, we largely left these settings to their defaults.
Once you've got everything dialed in, click on Generate Audio. Depending on your hardware and the length of your input text, this could take anywhere from a few seconds to minutes. Once complete, the clip should begin playing automatically.
[14]AI summaries turn real news into nonsense, BBC finds
[15]DeepSeek or DeepFake? Our vultures circle China's hottest AI
[16]AI agents? Yes, let's automate all sorts of things that don't actually need it
[17]Mental toll: Scale AI, Outlier sued by humans paid to steer AI away from our darkest depths
Broader implications
As we've previously seen with image generation and other AI tech, the voice cloning capabilities presented by Zonos are inherently controversial, from where the training data was mined to how they're actually used in practice.
Considering just how little sample audio is required to achieve a passable result, it's easy to see how this technology could be abused. Companies like Audible are [18]exploring text-to-speech AI to expand audiobook production, allowing narrators to create AI-generated voice clones of themselves. Meanwhile, [19]legal challenges surrounding AI voice cloning are already hitting similar businesses.
We can also see this technology used to scam unsuspecting victims into believing that a loved one is in trouble, and that they just need a few hundred dollars worth of gift cards to get them out of a bind. Or to ruin someone's career by using it to make an abusive call with their voice to their boss. Or generate fake political messages, or... the examples are endless.
Having said that, there are also benevolent uses for these kinds of models. From an accessibility standpoint, voice cloning and text-to-speech could help someone who has suffered trauma to their vocal cords, or has conditions affecting speech, get their voice back. In fact, this is one of the reasons that Apple gave to [20]justify the inclusion of voice cloning tech in iOS in late 2023.
The fact that this technology is already widely available — whether on iDevices or through paid services or as open source models — is why we're even comfortable demonstrating how to deploy and run Zonos locally in the first place.
With that said, if you do choose to embrace AI text-to-voice capabilities, we encourage you to do so in the most respectful and responsible way possible. ®
Editor's Note: The Register was provided an RTX 6000 Ada Generation graphics card by Nvidia, an Arc A770 GPU by Intel, and a Radeon Pro W7900 DS by AMD to support stories like this. None of these vendors had any input as to the content of this or other articles.
Get our [21]Tech Resources
[1] https://www.zyphra.com/post/beta-release-of-zonos-v0-1
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2Z7JuPxeb0I4Tip_FruBxdgAAABg&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://github.com/state-spaces/mamba
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z7JuPxeb0I4Tip_FruBxdgAAABg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z7JuPxeb0I4Tip_FruBxdgAAABg&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[6] https://huggingface.co/Zyphra/Zonos-v0.1-hybrid
[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z7JuPxeb0I4Tip_FruBxdgAAABg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[8] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z7JuPxeb0I4Tip_FruBxdgAAABg&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[9] https://regmedia.co.uk/2025/02/12/source_audio_tobias_wells.mp3
[10] https://regmedia.co.uk/2025/02/12/gen_ai_tobias_verne.mp3
[11] https://regmedia.co.uk/2025/02/13/ai_generated_verne_tobias_hybrid.mp3
[12] https://www.theregister.com/2024/07/07/containerize_ai_apps/
[13] https://regmedia.co.uk/2025/02/12/zonos_demo_dashboard.png
[14] https://www.theregister.com/2025/02/12/bbc_ai_news_accuracy/
[15] https://www.theregister.com/2025/02/01/deepseek_kettle_ai/
[16] https://www.theregister.com/2025/01/27/ai_agents_automate_argument/
[17] https://www.theregister.com/2025/01/24/scale_ai_outlier_sued_over/
[18] https://www.theverge.com/2024/9/9/24239903/amazon-audible-audiobook-narrators-ai-generated-voice-clones
[19] https://www.cbsnews.com/news/two-voice-actors-sue-ai-company-lovo/
[20] https://machinelearning.apple.com/research/personal-voice
[21] https://whitepapers.theregister.com/
Re: In Italy such technologies were used to impersonate the Minister of Defence...
Some fraudsters opened a bank account in my name, using stolen post as 'proof of address'. They then tried to empty one of my pension funds. I found out and contacted the pension fund company, and they called me back on some pretext. Having recorded the voice of the fraudster 'front man' and listened to mine, they concluded that we were two different people and that something naughty was going on. Had the fraudsters obtained enough recordings of my voice to fake it I don't know what would have happened. The issue is whether the AI voice can fool an AI voice recognition application well enough to transfer my pension.
Re: In Italy such technologies were used to impersonate the Minister of Defence...
The idea that a fund transfer can be authorized on the basis of someone recognizing someone's voice is... disturbing. It would be a problem regardless of the existance of this tech.
Magic
Because the one thing I love more than anything is hearing my own voice on recordings.
Cool.
Do they do software to create lifelike masks, 3D print keys and rob banks too?
Bonus points if you manage to phone Putin or Xi and declare war, bigly, in Trump's voice.
Re: Cool.
It would be deeply funny if someone just eroded party politics (not democracy) by generating extreme partisan AI candidates that get funding from the parties while ramping the rhetoric to an extreme. Then maybe we can move past this stupid era of party lines.
My bank...
...makes it extremely difficult to get a simple bank balance over the phone. They coerce me into saying a set phrase so that they can identify me by my voice. I never comply with this and I'm then stuck in a long queue to speak to someone. They then - somehow - after answering a few questions, say they've verified me by my voice. I want to be identified by *proper* security questions, not by the latest fad.
This thread should demonstrate to them that they are playing fast and loose with my money [overdraft]. (I'm not holding my breath). So if someone rings up and hacks my voice and siphons money out of my account, how can I prove that I did not make that call?
Speechless? Yes, probably the best policy under the circumstances.
Re: My bank...
" say they've verified me by my voice "
Good grief. Wasn't voice recognition demonstrated to be flawed thirty three years ago in the movie Sneakers? If they could pull something like that off by splicing tape, imagine what modern tech could do, especially given the number of people likely to have sufficient voice samples available in social media posts.
The basic authentication rule is very simple: Something you have and something you know . If they want to use voice recognition as the "have" then fair enough. But it shouldn't ever be the only test made.
" This thread should demonstrate to them that they are playing fast and loose with my money "
Banks do that as a matter of course. The pay-by-bonk limit just keeps going up "because convenience" (so I've deactivated it on my cards). The PIN for the card is a mere four digits. The special prove-I'm-me PIN is only five digits, and rarely asked for. They still seem to think that possession of my phone means it is me, thus giving a possibility for somebody who stole my phone to access the necessary and required bank app [1], set up the passcode forgotten routine, and get a temporary code by text...to that same phone. Clever.
I think it says a lot that it is more complicated to set up K9 Mail to access a mailbox with the likes of Google [2] than it is to access my money.
1 - God only knows what happens for people who don't have (or want) a smartphone.
2 - The "terribly insecure" application specific password stuff.
Fortunately...
...there is no way anyone would use this with existing media to clone, say, a national leader who has access to top secret information or nuclear weapons.
Or every national leader who has similar access.
So that's a relief.
Well there are a few genuine good uses for this tech as the article point out, I fear its overwhelming going to be used for nefarious purposes more than any genuine useful ones.
Those scams that try to con people into believing that their loved ones desperately need money, will become way more convincing if all they need is 30 seconds of speech to create a AI voice they can then use to send voice messages to people.
Jeff Geerling the Youtuber who does a lot of videos of Raspberry Pi's had his voice cloned by a Chinese company and used on one of their own videos without his permission last year. And hes a relatively small Youtuber.
Does this do (received) English accents properly?
Only asking.
Things have come a long way since Stepford…
When the wives had to read long lists of words.
The data was acquired from the web
and not purchased from a data broker. Or anyone else, like the rights holders?
Nah, it must be ok, because as we all know there is no copyright material on the web and if it just happened to be scraping every podcast, every bit of online radio playable in the browser, one or two million YouTube videos...
Re: The data was acquired from the web
That's the way it is now. If it's online in any form then it's fair picking.
Icon, because fair's fair, right?
Does listening to it sound weird to yourself, like listening to your actual voice on a recording, or come across as someone else's?
In Italy such technologies were used to impersonate the Minister of Defence...
... and ask rich entrepreneurs to send one million euro to a bank account, citing the need to pay secretly to free kidnapped journalists in Middle East. At least one paid.
I wonder why so many people are investing resources in technologies that have few - if any - good uses but are a huge help to crooks.