Voice Search Benchmark
One of the main reasons we transcribe audio is to make it searchable. Searching text is fast because there are good indexing algorithms to search lots of it momentarily. Searching voice by proxy (i.e. transcription) has inherent limitations. What if there is a transcription error? How about homophones (e.g. two, too, and to)? What if the words you want to search for are uncommon or made-up? Most search-worthy phrases are specialized or made-up: brands, companies, products, etc.
Picovoice Octopus Speech-to-Index addresses these issues by removing the reliance on text representation and directly indexing audio in the acoustic domain. It gives a significant accuracy boost compared to even the best ASRs available. The benchmark below is an effort to track these claims and hold us accountable.
Methodology
Speech Corpus
We use TED talks for this task. In particular, we use the latest release of TED-LIUM. We search for keywords such as Beethoven
, Blizzard
, Flickr
, Nathaniel
, Warcraft
, etc.
Metrics
We consider both top accuracy (i.e. lowest miss-rate regardless of false alarm rate) and equal-false-alarm-rate accuracy (i.e. fixed number of false alarms per hour).