Explore

Whisper is OpenAI's open-source speech recognition model that transcribes and translates audio across 99 languages, free to self-host or at $0.006/minute via API.
Research
9.0Word-level timestamp output and multilingual transcription covering 99 languages make Whisper the standard transcription layer for qualitative research involving recorded interviews, focus groups, and field recordings in non-English languages where proprietary ASR services lack coverage.
Content Creation
8.8Podcast transcription, video subtitle generation with word-level timestamps, and lecture-to-text conversion at $0.006/minute via managed API or zero cost self-hosted covers the primary async transcription workflows for content creators without real-time requirements.
Coding
9.2MIT-licensed open-source model with documented Python API, whisper.cpp C++ implementation (38,000 GitHub stars), and faster-whisper optimised production library enable developers to embed transcription into applications without proprietary model lock-in; self-hosted deployment eliminates per-call billing for high-volume voice-enabled products.
Automation
8.5Whisper integrates directly into LLM pipelines as the speech-to-text layer — audio input is transcribed to text and passed to downstream models for summarisation, classification, or command execution without intermediate tooling; 25MB file limit requires pre-processing for long-form audio automation workflows.
Data Analysis
8.0The model's 652 Hugging Face fine-tuned derivative models cover specialised transcription domains including healthcare (clinical notes), legal (deposition transcription), and call-centre audio; word-level timestamps enable speaker turn analysis and silence detection in meeting analytics pipelines.
Whisper is an open-source automatic speech recognition (ASR) model developed by OpenAI, released under the MIT license in September 2022 and trained on 5 million hours of audio across 99 languages. It supports multilingual transcription, English translation from any supported language, language identification, and word-level timestamp generation from a single model. Five size variants — tiny through large-v3 — trade VRAM requirements (1GB to 10GB) against accuracy. The managed API (whisper-1) charges $0.006/minute with no setup cost; self-hosting eliminates per-minute fees but requires GPU infrastructure. The API has a hard 25MB file size limit requiring audio chunking for longer recordings, does not support real-time streaming, and accuracy drops materially on call-centre telephony quality audio (17.7% WER). Speaker diarization is not available on whisper-1 — it requires the separate GPT-4o Transcribe with Diarization model.
Pricing
| Plan | Model | Usage Limits | Price |
|---|---|---|---|
| Self-hosted | whisper-tiny (39M params, 1GB VRAM, 32x real-time), whisper-base, whisper-small, whisper-medium, whisper-large-v2, whisper-large-v3 (1.55B params, 10GB VRAM), whisper-large-v3-turbo (216x real-time); all model weights available on GitHub and Hugging Face | No usage limits; hardware-bound throughput; tiny model processes 32x faster than real-time on consumer CPU; large-v3 requires 10GB VRAM; large-v3-turbo transcribes 60-minute file in approximately 17 seconds on optimised hardware | free |
| Managed API | Runs whisper-large-v2 internally; no speaker diarization; 25MB file limit; JSON output with timestamps; no streaming | 25MB file size limit per API request; no real-time streaming support; rate limits based on tier (Tier 1–5 RPM/RPD limits via OpenAI platform); charged per minute rounded to nearest second; no minimum charge | whisper-1, $0.006/min |
| OpenAI API tier system | Higher API tiers unlock higher RPM and RPD limits; no model version difference between tiers for Whisper | — | Tiers 1–5 |
MIT-licensed model weights on GitHub (75,000 stars), Python package installation via pip, whisper.cpp for C++ and mobile deployment, and faster-whisper for 4x speedup via INT8 quantization provide a complete self-hostable production stack; managed API at $0.006/min eliminates infrastructure overhead for <500 hours/month workloads where per-minute cost is less than GPU amortisation.
99-language coverage including low-resource languages trained on 5 million hours of diverse audio, plus English translation capability from any supported language in a single model, covers cross-lingual qualitative research workflows that proprietary ASR services do not support; self-hosted deployment keeps recorded interview data on institutional infrastructure.
Free self-hosted deployment on consumer-grade hardware (tiny model runs on CPU; medium on 5GB VRAM GPU) covers podcast and video subtitle workflows at zero per-minute cost; managed API at $0.006/minute covers one-hour podcast transcription for $0.36; no real-time streaming means Whisper is unsuitable for live captioning workflows.
652 fine-tuned Whisper derivative models on Hugging Face cover specialised transcription domains; faster-whisper with CTranslate2 INT8 quantization achieves 4x additional throughput for high-volume pipelines; word-level timestamps enable downstream analytics (talk-time ratio, silence detection) without additional tooling; call-centre telephony quality audio produces 17.7% WER requiring post-correction in call analytics applications.
Consider These Instead
Choose Deepgram Nova-3 when real-time streaming transcription with sub-second latency, native speaker diarization, or telephony-optimised audio processing is required — Deepgram is approximately 15x faster than the managed Whisper API for streaming applications and its Nova-2 model is priced at $0.0043/minute for production volume. Choose AssemblyAI when a managed API with built-in speaker diarization, sentiment analysis, entity detection, and content moderation is needed without building post-processing pipelines — AssemblyAI provides these features as first-class API outputs not requiring separate model calls. Choose Google Cloud Speech-to-Text when integration with Google Cloud infrastructure, long-form streaming audio, and enterprise SLAs backed by Google's support tier are the organisational requirement — Google's API supports telephony-optimised models with streaming that Whisper does not.