Leaving this for the people who may want the answer to this question later:
For my use case, I ended up not using ffmpeg.exe
at the end after all.
Instead, I used youtube-dl
to extract the video and convert it in a mp4 file and then use moviepy’s VideoFileClip to divide the video in multiple segments and send that video segment to Whisper for transcription.
I would’ve liked to use the audio version, but since I wasn’t aware of how to do it at the time, I used the video version. For the audio version, I recently found this thread on the Forums that will help you to deal with that (the title is a bit misleading, so don’t let that concern you):