Direct Speech Indexing: Accurate, Fast, and Scalable Voice Search

  • Speech-to-Index
  • Voice Search
  • Speech Monitoring
  • Speech-to-Text
  • Acoustic-Only Speech Recognition
August 18, 2020

Can you imagine modern life without Google search? In the mid-1990s, the World Wide Web’s popularity created a deluge of novel text data. Google unlocked the power of this data with the first truly effective web search engine. In the 2010s, multimedia content on the Internet has seen a similar growth pattern, driven by services like YouTube and SoundCloud. Additional audio data is being generated at a huge rate from call centers and Zoom meetings. The ability to effectively index this new wave of Internet data unlocks opportunities including search, media creation, compliance, and real-time sentiment analysis.

The naive solution for the voice search problem is using a Speech-to-Text (STT) engine combined with classic text indexing techniques as shown below. This approach has subtle—but significant—drawbacks.

Voice search based on Speech-to-Text and text indexing.

Naive Approach

This method’s drawback stems from the language model STT engines. The language model defines the set of valid words and how these words can be combined to build full sentences. This limits the usability of STT engines for voice search, as they struggle to find out-of-vocabulary queries with technical jargon and proper nouns. Furthermore, mistakes in the transcription due to competing hypotheses result in search misses. Homophones like “wear” and “where” are a classic example.

Picovoice Approach

Picovoice Speech-to-Index takes a different approach: indexing speech directly without relying on a text representation. This acoustic-only approach boosts accuracy by removing the out-of-vocabulary limitation and eliminating the problem of the competing hypothesis (e.g. homophones).

Due to the compute- and memory-efficient implementation of Picovoice’s acoustic indexing technology, and eliminating the need for a language model, we can index massive audio sets multiple orders of magnitude faster than alternative solutions.

Once the voice data is indexed, the search is lightning fast: you can scan millions of hours of indexed spoken data under a second with commodity computing infrastructure.

Voice search based on Speech-to-Text and text-based indexing.

We have made a live web demo to get you started with Speech-to-Index: