OpenAI Whisper

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

GitHub Repository


1pip install git+ 

Fix CUDA not detecting GPU

Whisper will default to the CPU if a GPU is not detected, which is considerably slower.
1pip uninstall torch
2pip install torch torchvision torchaudio --extra-index-url

Example usage

1# Transcribe
2whisper input.mp3 --model medium.en --language en --task transcribe
3# Translate
4whisper japanese.wav --model large --language Japanese --task translate

Available models and languages

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed.

SizeParametersEnglish-only modelMultilingual modelRequired VRAMRelative speed
tiny39 Mtiny.entiny~1 GB~32x
base74 Mbase.enbase~1 GB~16x
small244 Msmall.ensmall~2 GB~6x
medium769 Mmedium.enmedium~5 GB~2x
large1550 MN/Alarge~10 GB1x

For English-only applications, the .en models tend to perform better, especially for the tiny.en and base.en models.