OpenAI Whisper Speech-to-Text has become popular among developers. Whisper

  • is accurate, achieving cloud-level accuracy,
  • runs on-device, offering privacy,
  • is an open model, allowing even commercial use for free.

Some startups, like Deepgram, have started offering hosted Whisper along with their own speech-to-text offerings, and some startups and open-source projects were born to ride the wave. Later, as expected OpenAI and Microsoft Azure started offering Whisper in the cloud. Despite its popularity, Whisper Speech-to-Text still has some limitations, such as

  • No built-in diarization,
  • No word-level timestamps,
  • No real-time transcription,
  • Limited platform support,
  • No (or limited) enterprise support,
  • No (or limited) off-shelf solution for customization.

Did you know you can use Falcon Speaker Diarization with Whisper to differentiate speakers in transcripts and get timestamps?

Projects and companies modifying and wrapping Whisper Speech-to-Text aim to address some of these challenges. However, some inherent challenges exist due to how Whisper is designed and trained. One of them is real-time transcription.

Can Whisper be used for streaming speech-to-text?

Whisper does not have streaming speech-to-text capability. It is not designed for real-time transcription. Whisper processes audio in segments of 30 seconds, whereas shorter segments of audio need to be transcribed for live transcription as they are received.

Although Whisper processes audio in segments of 30 seconds, there is no limit on the audio input length. Developers have found clever ways to process brief snippets of audio to get transcripts faster, trying to imitate live transcription. There are a few challenges with these workarounds.

  • Latency, which is crucial for live transcription, depends on the size of the Whisper model. The larger the model used, the slower it returns the transcripts.

Compare resource requirements of Whisper models vs. Picovoice Speech-to-Text engines.

  • Whisper artificially pads short chunks with trailing zeros into a 30s chunk when processing audio chunks shorter than 30s, causing two issues:
    • Zeros used to separate audio chunks cause Whisper to interpret each chunk as separate rather than continuous, breaking up words that span across multiple audio chunks.
    • The hallucination problem of Whisper may occur more frequently, resulting in repetitive text due to its sequence-to-sequence architecture when processing padded chunks.

To overcome these issues, developers re-transcribe and finalize the transcription when the audio chunk has reached 30s long. Despite the real-time errors, developers can achieve asynchronous transcription accuracy as time passes.

These workarounds cause performance issues in enterprise applications that aim to improve productivity and/or user experience. Picovoice’s Cheetah Streaming Speech-to-Text is designed for real-time on-device transcription.

Cheetah Streaming Speech-to-Text vs. Whisper Speech-to-Text

Cheetah is designed and optimized for streaming.

Whisper is designed for asynchronous, whereas Cheetah Streaming Speech-to-Text is for real-time transcription.

Cheetah achieves higher accuracy with less.

Cheetah is 20% more accurate than Whisper Tiny, requiring almost half of the resources.

Cheetah can be fine-tuned easily to achieve even higher accuracy.

No-code Picovoice Console allows developers to customize speech-text models and adapt to their domains without requiring coding, let alone a machine-learning experience.

Cheetah returns the fastest responses with minimal latency.

Cheetah processing voice data locally without network latency and minimal compute latency due to low resource requirements.

Cheetah runs across platforms.

Cheetah offers cross-platform support off-shelf. - embedded, web, mobile, desktop, on-prem, serverless, and public cloud.

The Picovoice Team maintains and supports Cheetah.

The Picovoice team maintains Cheetah, offers Enterprise support and continuously improves Cheetah by considering customer feedback.

Custom Cheetah models are available through Picovoice Consulting engagements.

The Picovoice Consulting team creates domain-specific (medical, finance, legal), enterprise-specific private Cheetah libraries for enterprise customer in the language of their choice.

Start Building