The issue has been solved

TTS example #249

Purus posted onGitHub

Would be great if there is support for converting text to speech automatically and include as part of the video output.

I am adding a $60 bounty to whoever submits a text-to-speech example that works in preview mode as well as render mode and that uses a cloud TTS API. (We should be able to release the example as part of Remotion but of course you will maintain credit)

posted by JonnyBurger almost 4 years ago

@jonnyburger has funded $60.00 to this issue.

Submit pull request via IssueHunt to receive this reward.
Want to contribute? Chip in to this issue via IssueHunt.
Checkout the IssueHunt Issue Explorer to see more funded issues.
Need help from developers? Add your repository on IssueHunt to raise funds.

posted by issuehunt-app[bot] almost 4 years ago

To get the timing of the words spoken, you can reverse the process and use STT APIs (Google supports 125 languages and accents). Their API gives you a list of phrases spoken (“alternatives” array) and then within each phrase the timing of each individual word. So you can use these as the basis of the text animation timing. Of course STT AI is less reliable than TTS so the words may not always match the original input text sent to TTS so some manual or smart match up editing may be required. I’ve just used this timing info in a prototype app I’m working on for videos last Friday. This word/phrase timing is commonly used to generate video subtitles.

posted by tohagan almost 4 years ago

@JonnyBurger What do you think about credentials for this TTS services? Should store in .env settings and, when running video and use TTS check if it is set?

posted by FelippeChemello almost 4 years ago

Good question!

We don't currently have support for .env files (but this is a great idea, I will create another issue for this). So the best way for the moment I think is using input props: https://www.remotion.dev/docs/parametrized-rendering#input-props

posted by JonnyBurger almost 4 years ago

Do you see any problem at using 2 different voices? One at preview (SpeechSyntesis) and another during render (Cloud Voice). I was thinking about it and found some problem at this, because for render I need to download file, however I can't during preview since I haven't access to filesystem

posted by FelippeChemello almost 4 years ago

Cloud Voice audio output can be saved to the cloud file system as a cache audio file or downloaded locally. Audio file name could be a hash of input text and voice params and so only recomputed when text/params change. You can then stream the local or cloud audio file.

posted by tohagan almost 4 years ago

Swapping Chrome APIs with Cloud APIs

One thing to keep in mind as you plan the roadmap for TTS (and perhaps STT) features is that there are important differences between what you can achieve with SpeechSynthesis and a corresponding cloud service. For simple TTS they could be swapped. Caching audio files generated from cloud service could improve the DX which may be sufficient for a basic MVP solution. However I recommend that you look further down the roadmap to consider where these cloud services might take you as you design the current Remotion speech architecture and APIs.

Cloud service audio events

Cloud TTS services not only generate audio but they can also be used to generate associated events timed to correlated with the audio. Also different cloud services deliver different features.

Word/Phrase events

Using STT you can emit a stream of words and phrases with associated audio track timing. This is typically used to generate subtitles but can be repurposed for animation. This could be used in conjunction with TTS or independently on any audio source containing speech. By pre-capturing these timed events, a frame render might wish to compute the time until the next word/phrase occurs or since the last word/phrase occurred and then use this to animate word text or other related object animations. This may be particularly important when handling text or audio translations as the speech timing can vary widely between translations.

Just discovered that IBM's TTS service can deliver wording timing removing the need for STT however they only support 16 languages

Lip Sync events

Azure STT can emit a stream of viseme lip sync events that can be used to animate lips on a 2D or 3D avatar.

Audio / Video track events

Many existing audio & video files store timed events that could be be very useful in animation.

In summary, rendering may need to supporting multi-tracking timed events as input for the rendering process with the ability to compute time to previous and next events and associated event metadata.

posted by tohagan almost 4 years ago

I had working on it. Here is the example https://github.com/FelippeChemello/Remotion-TTS-Example. At IssueHunt asks me for making a PR, however I think that it wasn't the purpose of it Issue. How can I proceed @JonnyBurger ?

posted by FelippeChemello almost 4 years ago

@FelippeChemello Wow thank you! This looks super awesome! I can only pay out if you submit a PR apparently, can you submit a bogus PR?

I will close it but pay it out immediately 🤝

Plus, it would be cool, if I could become a collaborator of the repository, so that should it be necessary, I can adjust the code, upgrade dependencies or change the README. Our goal is to provide a streamlined set of template for different usecases.

posted by JonnyBurger almost 4 years ago

@jonnyburger has funded $55.00 to this issue.