Search engines, especially Google, have fundamentally changed how we access information - so much so, that Google has become a word added to prominent dictionaries. However, Google still indexes text for search. Thus, many developers look for workarounds to “Google search audio files.”

Transcribing an audio file to text using Speech to Text and typing the keyword to retrieve it from the transcript is the common workaround for the “Google search audio files” problem. Google News Initiative also takes this approach to make news searchable despite the known limitations of generic Speech to Text models, such as out-of-vocabulary and competing hypotheses (homophones). Customizing generic Speech to Text to the domain is the best solution to overcome these challenges. However, the news is not domain-specific and includes many proper nouns, making it impossible to train a domain-specific Speech to Text. Finding the best Speech to Text for use cases with proper nouns gets even more challenging. Hence, this problem requires a different approach than an optimized and custom Speech to Text model.

Speech-to-Index is an alternative to Speech to Text that approaches the search problem differently. It breaks the audio input into sounds (phonemes), just like Speech to Text. However, Speech-to-Index keeps them as they are and creates a phonetic-based index instead of converting them into text. Thus, Speech-to-Index removes the limitations of a text-based approach, such as unrecognized words, homophones, and spelling errors, making it a better choice for audio search engines. For example, an open-source benchmark shows that an audio search engine built with Picovoice’s Octopus Speech-to-Index finds keywords in audio files ten times more accurately than the one with Google Speech-to-Text.

Speech recognition for audio search breaks the audio into sounds (phonemes) and creates a phonetic-based index.

Figure: Speech-to-text alternative to creating content-based audio search engine

Speech-to-Index is not a perfect substitution but a complementary for Speech to Text. Speech to Text is the only option for transcription, just like Speech-to-Index is for search. Use cases like Dialogue Search, Social Listening, Legal E-Discovery, or Media Asset Management benefit from both Speech to Text and Speech-to-Index. If you’re unsure how Speech-to-Index can improve the performance of your product, start building with Picovoice’s Free Plan. Getting Octopus Speech-to-Index integrated into your product only takes a few lines of code and minutes! If you’re unsure where and how to start, you can work with experts leveraging Picovoice’s Consulting Services!

Find an Expert