OpenAI Whisper
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.
GitHub Repository
Installation
1pip install git+https://github.com/openai/whisper.git
Fix CUDA not detecting GPU
Whisper will default to the CPU if a GPU is not detected, which is considerably slower.1pip uninstall torch
2pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
Example usage
1# Transcribe
2whisper input.mp3 --model medium.en --language en --task transcribe
3# Translate
4whisper japanese.wav --model large --language Japanese --task translate
Available models and languages
There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed.
Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
---|---|---|---|---|---|
tiny | 39 M | tiny.en | tiny | ~1 GB | ~32x |
base | 74 M | base.en | base | ~1 GB | ~16x |
small | 244 M | small.en | small | ~2 GB | ~6x |
medium | 769 M | medium.en | medium | ~5 GB | ~2x |
large | 1550 M | N/A | large | ~10 GB | 1x |
For English-only applications, the .en models tend to perform better, especially for the tiny.en and base.en models.