“I have hundreds, thousands of audio files from meetings (lectures, news, call centre recordings, podcasts). Is there software to search particular words in these audio files?” The common answer to this question is transcribing audio files via automatic speech recognition and then searching words or phrases within the text output. However, readers who tried this approach might have experienced some drawbacks. Automatic speech recognition (ASR) struggle to with proper nouns such as brand, product or individual names. For example, an ASR engine would transcribe Toronto’s famous Yonge Street as Young Street.

If you’re looking for technology similar to the Google search engine that enables keyword search by crawling audio files instead of websites, you know it cannot be achieved with automatic speech recognition. It should be something like “Google search for audio”, “audio search engine” “voice search by text” or simply “voice search” then meet Octopus, Picovoice’s Speech-to-Index Engine that enables voice search.

Why acoustic-only speech indexing?

New data created globally in 2020
New data to be created in 2025
Percentage of unstructured data

Structuring unstructured audio and video data for monitoring, compliance and analysis will help enterprises minimize their risks and monetize this large data. Picovoice team has gathered voice search use cases where acoustic processing overperforms text-based search for audio files.

Why speech indexing now?

The common approach to improve speech recognition accuracy has been feeding models with large data, resulting in a need for computational power which can only be achieved by cloud computing. Processing voice data in the cloud comes with inherent costs. Connectivity costs and RTF of large models limit the speed of voice data processing. This might be acceptable for some speech recognition applications such as dictation. However, it’s unacceptable when searching for a keyword within an audio file of similar length.

Imagine searching for something within your voicemails. If finding a phrase within 10-minute long recordings takes 10 minutes, nobody will use it. To understand the impact of connectivity and RTF, think about the last time you waited for Alexa to play the next song for almost a minute, although the phrase is just a few minutes long.

Another reason is the poor performance of speech-to-text when it comes to proper nouns, as mentioned before. Going back to the voicemail example, if a speech-to-text does not transcribe the name mentioned in the message, it will not work either.

Considering these issues, Picovoice leveraged its unique expertise in building lightweight and efficient on-device voice recognition technology to build something specific for this use case: Octopus Speech-to-Index. On a Linux machine running Ubuntu 20.04 with 16GB of RAM and an Intel i7-10710U CPU running at 4.7 GHz, it takes Octopus less than a second to index a-minute long audio file. Even better, once audio files are indexed, an unlimited number of queries can be searched instantly since no request is sent to the cloud.

Check out the Octopus web demo or start building now!