Mastering Voice Cloning: A Step-by-Step Guide to Producing High-Quality Digital Replicas

Mastering Voice Cloning: A Step-by-Step Guide to Producing High-Quality Digital Replicas

Imagine capturing the essence of someone’s voice, not just the tone or pitch, but the unique characteristics that make it distinctly theirs.

Our journey begins with searching for an example of quality voice. We will use various AIs to extract and improve the quality of our example. With this example, we can teach an AI to speak in the same way with the same timbre of voice. All the AI tools will be found on Replicate. It's a website that offers you lots of free AI models such as LLMs, image, and music generators, etc.

You can surprise anyone with these generated voices. I can guarantee that your audience won't notice a thing!

Step 1 : Search for a good quality voice

The sound quality is important; the better the sound quality, the better the sound generated. A common saying in the AI world is "garbage in = garbage out." If you give the AI bad data, it will give you bad results. To do this, choose a clear voice and a recording with a quality microphone.

In this tutorial, I'll show you the tools and settings to get the best possible results. But don't hesitate to try out the settings yourself or equivalent tools that might be suitable for your application.

Downloading the Voice from YouTube

We need to search for interviews or speeches of the voice we want to clone. The sound quality is important; the better the sound quality, the better the sound generated.

For the example, we will use the interview of Kendall Jenner by Konbini: https://www.youtube.com/watch?v=9h8_FvH8xPQ

To download the audio of the YouTube video, I recommend using YTMP3.

YTMP webpage

Extracting a Short Sequence of the Voice

Then, we need to find a short sequence where the voice to clone is isolated. For this task, I recommend this Audio Cutter. One minute of extracted voice should be more than enough. Here, we will take the section between 8 secs and 1 min 03.

Audio cutter webpage

Removing Unwanted Sounds

Finally, I need to remove the music in the background. This step is optional but in this case, the original interview has background music, and it could trouble the Voice cloning AI. We will use an AI model called “Demucs” to extract the voice from other sounds: https://replicate.com/cjwbw/demucs.

Replicate demucs
1. Uplaod your mp3 in "audio"
2. Select "mdx_extra" in "model_name", it'll have better results than the default model name
3. Select "vocals" in "stem"
4. Click on the "Run" button
5. When the process is terminated, click on "Download"

After the download is successful, we will need to extract the ZIP archive that contains all the separated tracks. We can keep only the vocal track and delete the others.

Here is the final audio:

Step 2 : Clone the Voice

We will continue to use Replicate for the voice cloning. The generation can take several minutes with the free version. But we can greatly speed it up by paying.

Text transcription

The voice cloning AI needs the text transcript of the voice we extracted previously. Luckily, there is (another) AI to do that; we don’t need to do it manually. This AI is called Whisper, and we can use it here: https://replicate.com/vaibhavs10/incredibly-fast-whisper.

Whisper replicate
1. Upload your mp3 in "audio"
2. Click on "Run" button
3. When the transcription is terminated, select only the begging, after "text" :

Here is our transcription:

Dogs. I love dogs. I think dogs are super, like dogs want to be with you. They want to cuddle with you. Sometimes cats are like too independent for me. Demon emoji. I don't know, it's sassier. Calvin Klein, I guess. Batman. I did really like the Joker movie though. Ooh, Coachella or Cannes Festival? It depends, I mean they're different. So, just depends what mood you're in. Are you in an LA desert mood or are you in a French beach side mood? Doberman, I have two Dobermans. My dogs are amazing, I just got one of them. He's a little puppy, so I don't know him very well yet, but he's awesome, he's super sweet and he loves to play. With Six, my older one, Six is awesome. She has such a personality. She's as big as Weirdo. I swear she's me and a dog.

With Whisper, we can also separate the voices if there are multiple actors, using the diarise_audio option. We won't go into the subject here, but perhaps a future tutorial will teach you how to transcribe video meetings into text and summarize them using ChatGPT.

💡 Interested in a tutorial on how to transcribe a meeting to video and can't wait? Contact us to request a free demonstration!

Finally, Voice Cloning

The voice cloning AI is available here: https://replicate.com/cjwbw/voicecraft. We need to upload the MP3 and the transcription we created before. We need to fill in a sub-section of our audio in the model. In the cut_off_sec parameter, we'll give the number of seconds to be extracted; in our case, we'll take Kendall's first response, about his preference between cats and dogs. And in orig_transcript_until_cutoff_time, we'll only put the part of the text that relates to these first 8 seconds.

Voicecraft replicate
1. Upload your mp3 in "orig_audio"
2. Past the transcription in "orig_transcript"
3. Write the text you want the AI to read in "target_transcript"
4. In "cut_off", Write the first seconds needed to the AI to learn the voice
5. In "orig_transcription_unitl_cutoff_time", wite the transcription of the first seconds
6. Click on "run" button

And now, here is the final result:

It's literally stunning how perfectly Kendall Jenner's tone of voice and intonation has been mimicked by AI. You can easily automate the process using applications like Zapier. Thanks to this AI, you can generate content for your videos in different languages, with very little effort.

Don't forget that you need the permission of the people whose voices you want to clone. I've used Kendall Jenner as an example here, but it's for educational purposes only.

Paul CHAUMEIL

CTO @Stackadoc