Jump To

Whisper

Open SourceAI Audio Last updated: April 16, 2026

Whisper is OpenAI's open-source speech recognition model that transcribes and translates audio across 99 languages, free to self-host or at $0.006/minute via API.

AI Audio Best AI Audio Software →

Try Whisper Website

Our General Score

8.4/10

Functionality9.2

Features8.5

Usability7.5

Value9.5

Integrations7.5

Reliability8.0

Plans & Pricing

Use Cases

Research

9.0

Word-level timestamp output and multilingual transcription covering 99 languages make Whisper the standard transcription layer for qualitative research involving recorded interviews, focus groups, and field recordings in non-English languages where proprietary ASR services lack coverage.

Content Creation

8.8

Podcast transcription, video subtitle generation with word-level timestamps, and lecture-to-text conversion at $0.006/minute via managed API or zero cost self-hosted covers the primary async transcription workflows for content creators without real-time requirements.

Coding

9.2

MIT-licensed open-source model with documented Python API, whisper.cpp C++ implementation (38,000 GitHub stars), and faster-whisper optimised production library enable developers to embed transcription into applications without proprietary model lock-in; self-hosted deployment eliminates per-call billing for high-volume voice-enabled products.

Automation

8.5

Whisper integrates directly into LLM pipelines as the speech-to-text layer — audio input is transcribed to text and passed to downstream models for summarisation, classification, or command execution without intermediate tooling; 25MB file limit requires pre-processing for long-form audio automation workflows.

Data Analysis

8.0

The model's 652 Hugging Face fine-tuned derivative models cover specialised transcription domains including healthcare (clinical notes), legal (deposition transcription), and call-centre audio; word-level timestamps enable speaker turn analysis and silence detection in meeting analytics pipelines.

Platforms

DesktopAPI

Capabilities

Context WindowN/A

API PricingVaries

Image Generation✗ No

Memory Persistence✗ No

Computer Use✗ No

API Available✓ Yes

Multimodal◑ Partial

Open Source✓ Yes

Browser Extension✗ No

Overview

Whisper is an open-source automatic speech recognition (ASR) model developed by OpenAI, released under the MIT license in September 2022 and trained on 5 million hours of audio across 99 languages. It supports multilingual transcription, English translation from any supported language, language identification, and word-level timestamp generation from a single model. Five size variants — tiny through large-v3 — trade VRAM requirements (1GB to 10GB) against accuracy. The managed API (whisper-1) charges $0.006/minute with no setup cost; self-hosting eliminates per-minute fees but requires GPU infrastructure. The API has a hard 25MB file size limit requiring audio chunking for longer recordings, does not support real-time streaming, and accuracy drops materially on call-centre telephony quality audio (17.7% WER). Speaker diarization is not available on whisper-1 — it requires the separate GPT-4o Transcribe with Diarization model.

Pricing

Plans & Pricing

Plan	Model	Usage Limits	Price
Self-hosted	whisper-tiny (39M params, 1GB VRAM, 32x real-time), whisper-base, whisper-small, whisper-medium, whisper-large-v2, whisper-large-v3 (1.55B params, 10GB VRAM), whisper-large-v3-turbo (216x real-time); all model weights available on GitHub and Hugging Face	No usage limits; hardware-bound throughput; tiny model processes 32x faster than real-time on consumer CPU; large-v3 requires 10GB VRAM; large-v3-turbo transcribes 60-minute file in approximately 17 seconds on optimised hardware	free
Managed API	Runs whisper-large-v2 internally; no speaker diarization; 25MB file limit; JSON output with timestamps; no streaming	25MB file size limit per API request; no real-time streaming support; rate limits based on tier (Tier 1–5 RPM/RPD limits via OpenAI platform); charged per minute rounded to nearest second; no minimum charge	whisper-1, $0.006/min
OpenAI API tier system	Higher API tiers unlock higher RPM and RPD limits; no model version difference between tiers for Whisper	—	Tiers 1–5

Key Features

Multilingual transcription across 99 languages including low-resource languages trained on 5 million hours of diverse audio
English translation capability converting audio in any supported language directly to English text without intermediate transcription
Word-level timestamp generation enabling subtitle creation, speaker turn detection, and audio segment alignment in downstream pipelines
Five model size variants — tiny (39M params, CPU-compatible) through large-v3-turbo (216x real-time) — covering consumer hardware to GPU production deployment
MIT-licensed open-source model weights with 75,000+ GitHub stars and 652 fine-tuned derivative models on Hugging Face for specialised domains
Managed API (whisper-1) at $0.006/minute with JSON, verbose JSON, text, SRT, and VTT output formats

Pros & Cons

Pros

MIT license enables commercial deployment at zero licensing cost — self-hosted Whisper eliminates per-minute fees entirely for applications processing consistent high audio volume where GPU amortisation beats $0.006/minute API costs at 100,000+ minutes/month
99-language multilingual transcription in a single model eliminates the need to pre-identify input language or maintain separate models per language — a capability that proprietary ASR services require separate API calls or language configuration to achieve
Word-level timestamps output in SRT and VTT formats reduces subtitle and captioning pipeline complexity to a single API call; competing services charge extra or require post-processing for word-level timestamp precision
The 652 fine-tuned derivative models on Hugging Face — covering healthcare, legal, and multilingual specialisations — extend base Whisper accuracy for domain-specific vocabulary that general-purpose ASR training data underrepresents

Cons

Managed API (whisper-1) has a hard 25MB file size limit per request — a one-hour MP3 at standard 128kbps bitrate exceeds this limit, requiring audio chunking and stitching code before any API call for long-form content
No real-time streaming support on the managed whisper-1 API — applications requiring live captions or sub-second transcription latency must use the GPT-4o Realtime API or a competing service such as Deepgram, which is approximately 15x faster for streaming use cases
Speaker diarization (identifying who is speaking) is not available in whisper-1 — it requires the separate GPT-4o Transcribe with Diarization model or a third-party service like Pyannote, adding $0.002–$0.005/minute to effective cost
Accuracy drops significantly for call-centre telephony quality audio (17.7% WER versus 2.7% on clean English audio) — teams processing phone recordings require fine-tuned variants or alternative models optimised for telephony-grade audio

Who It's For

Best For

Developers building voice-enabled applications, podcast tools, or video subtitle generators who need a royalty-free, self-hostable transcription model without per-minute API costs at scale
Researchers conducting multilingual qualitative interviews needing 99-language transcription and English translation in a single model that can run on institutional GPU infrastructure without data leaving local infrastructure
Data science teams building audio analytics pipelines requiring word-level timestamps, domain-specific fine-tuned variants, and integration with downstream LLM summarisation or classification models
Content creators processing podcasts, lectures, and video recordings asynchronously where real-time streaming is not required and per-minute API costs at $0.006 are acceptable for variable workloads

Not Ideal For

Applications requiring real-time or streaming transcription with sub-second latency — the managed whisper-1 API is batch-only and does not support WebSocket streaming
Teams needing automatic speaker identification and labelling (diarization) without additional tooling — whisper-1 does not output speaker labels
Call-centre and telephony operations where audio quality is consistently degraded — the 17.7% WER on telephony audio makes Whisper unreliable without fine-tuning for this audio type
Non-technical users needing a consumer-facing transcription product with a graphical interface — Whisper is a model and API with no native consumer UI; end-user tools like Otter.ai and Fireflies.ai are built on top of models like Whisper

Audience Scores

Developers9.5

MIT-licensed model weights on GitHub (75,000 stars), Python package installation via pip, whisper.cpp for C++ and mobile deployment, and faster-whisper for 4x speedup via INT8 quantization provide a complete self-hostable production stack; managed API at $0.006/min eliminates infrastructure overhead for <500 hours/month workloads where per-minute cost is less than GPU amortisation.

Researchers9.0

99-language coverage including low-resource languages trained on 5 million hours of diverse audio, plus English translation capability from any supported language in a single model, covers cross-lingual qualitative research workflows that proprietary ASR services do not support; self-hosted deployment keeps recorded interview data on institutional infrastructure.

Content Creators8.5

Free self-hosted deployment on consumer-grade hardware (tiny model runs on CPU; medium on 5GB VRAM GPU) covers podcast and video subtitle workflows at zero per-minute cost; managed API at $0.006/minute covers one-hour podcast transcription for $0.36; no real-time streaming means Whisper is unsuitable for live captioning workflows.

Data Scientists8.8

652 fine-tuned Whisper derivative models on Hugging Face cover specialised transcription domains; faster-whisper with CTranslate2 INT8 quantization achieves 4x additional throughput for high-volume pipelines; word-level timestamps enable downstream analytics (talk-time ratio, silence detection) without additional tooling; call-centre telephony quality audio produces 17.7% WER requiring post-correction in call analytics applications.

Use Cases

Research

9.0

Content Creation

8.8

Coding

9.2

Automation

8.5

Data Analysis

8.0

Consider These Instead

When Not To Choose Whisper

Choose Deepgram Nova-3 when real-time streaming transcription with sub-second latency, native speaker diarization, or telephony-optimised audio processing is required — Deepgram is approximately 15x faster than the managed Whisper API for streaming applications and its Nova-2 model is priced at $0.0043/minute for production volume. Choose AssemblyAI when a managed API with built-in speaker diarization, sentiment analysis, entity detection, and content moderation is needed without building post-processing pipelines — AssemblyAI provides these features as first-class API outputs not requiring separate model calls. Choose Google Cloud Speech-to-Text when integration with Google Cloud infrastructure, long-form streaming audio, and enterprise SLAs backed by Google's support tier are the organisational requirement — Google's API supports telephony-optimised models with streaming that Whisper does not.