I’ve done my fair share of manually creating subtitles over the years. It can be quite a tedious and time-consuming process, so for a while now I’ve been on the lookout for a way to make that process faster.

Over the last few days, I’ve also continued to be playing around with Whisper, which I wrote about before. While playing, I found a workflow that I wish I had had access to years ago. It’s a process that allows for quick generation of subtitles using preexisting scripts, without setting a single timestamp manually. I’ve already got several projects I’d like to try this new workflow on, but first I just wanted to share one completed project with that workflow here!

The Song 🎶

So, a couple days ago a dear friend of mine released a new song on his YouTube channel.

I’ve been playing it on repeat since it came out.

It’s sooo good. 😭

Give it a listen! And if you like it, subscribe, like, comment, share it around, etc. 🙃 He’s got a small channel, but I think it should be bigger 😉

Anyways, late last night, after playing the song on my computer for the 200th time, I decided that I wanted to have the song on my phone too. So I used yt-dlp to download the video and audio to my computer.

Then, for the fun of it, I decided to run my Whisper script on it.

Whisper quickly zoomed through the song and created a transcript with its guesses at the lyrics. It also created a VTT subtitle file, with timestamps so that the words could be read in sync with the lyrics being sung in the song.

Now, the tiny.en model that I have configured with my main Whisper command has an implicit trade-off. It’s really fast, but it’s also less accurate because of its speed. It’s normally pretty good, but there are errors.

For example, at one point point, Aaron’s lyrics are:

Are we so different from one another?

Where Whisper outputted:

00:01:11.000 --> 00:01:26.000
 Dowing so different from what another.

Less than ideal.1 😬

At this point, I stopped to ponder for a moment…

“I have the timestamps for when words are said in this song, but tied to the wrong words…
I also have the right words from the description of the video, but they’re not yet linked to timestamps…
If only there was a way to take out the lyrics from the wrong subtitle file, and put in the correct lyrics in instead.

I could try to write a script for that… but the script would basically need to have a grasp of human language to be able to know how to what needs to be replaced where…

Wait a minute! Large Language Models are a thing!

The Prompt 🦾

So, I pulled up a chat window of vim-ai, which allows me to use OpenAI’s GPT models (like ChatGPT) from my terminal, and gave it the following prompt:

Below is lyrics to a song, and a VTT subtitle file generated
by a STT engine guessing at the words for that song.
Please output a corrected VTT, replacing any variances in words or punctuation
in the VTT file with that of the Original Lyrics instead.

# Original Lyrics

We hear their cries and screams
We walk on by offering sympathies
Are we so different from one another?

# VTT Subtitle File


00:00:00.000 --> 00:00:17.000

00:00:17.000 --> 00:00:23.000
 We hear their cries and screams

00:00:23.000 --> 00:00:29.000
 We walk on by offering sympathies

00:01:11.000 --> 00:01:26.000
 Dowing so different from what another.

I copied the original lyrics that Aaron had shared in the description, and also copied the whole VTT subtitle file that Whisper generated (VTT files are just text files with a specific syntax and a different file extension, so you can open them without problem in Notepad, TextEdit, or whatever plain-text editor your OS has.)

Then I ran the prompt, and let it chug.

And, it worked!!

A few seconds later, it finished outputting the new and corrected version of the VTT Subtitle file for the song, using the words from original lyrics instead of the misunderstood words that Whisper had generated.

00:01:11.000 --> 00:01:26.000
 Are we so different from one another?

This is very cool to me!

It means that, if you have any spoken audio (in a language that Whisper supports) and the original script for that audio, you can now generate a (basically?) perfect subtitle file for that audio from that script quite quickly, using just a couple commands! 😎

If you watch the song on YouTube now, the lyrics that show up as subtitles are the ones that were generated with the above process. 😊

Limitations, Further Ponderings, and Future Project Ideas 🤔

I used the ChatGPT4 8k token model on this experiment, which means that it would probably only work with content up to 2,000-3,000 words in length (Two copies of the target text, plus timestamps and prompts). I have a couple projects I want to try this on which will be probably more than 10,000 words each, so the currently available GPT4 model available through API won’t work on it.

My plan right now is to try Claude’s 100k Token Model on those projects. I repeated this experiment with this song on the Claude 9k token model and it worked perfectly. So, I’m hopeful about scaling it up with their larger model.

I’m also curious about the idea of using this with different languages. I have a couple Turkish, Spanish, and Arabic projects I want to try this on. Whisper is great in English, but is less accurate in less-resourced languages. I’m curious to see how various LLMs (GPT, Claude, maybe LLaMa?) perform on this task in other languages, when the source subtitle file might be even less accurate. 2

  1. NOTE: I later ended up running the song through Whisper again with one of their larger, more accurate, slower models, and it performed nearly perfectly, even getting the capitalization on “Father” and “He” correct when talking about God, which I found super impressive. ↩︎

  2. I recently came across poe.com, which allows you to use a number of different LLMs and machine learning models (including a number of OpenAI GPT models, a number of Claude models, and a number of Meta’s LLaMa models) from one single interface. I like to use vim-ai for most of my day-to-day LLM needs, as it’s super fast to access; but the ability that Poe offers, to be able to test a number of LLMs from one interface, to see how each of them perform on the same task, is quite impressive and useful to me! ↩︎