Making media searchable and discoverable has been a hot topic for a while. Google worked on making podcasts discoverable. Microsoft released Azure Media Indexer and Azure Video Indexer. In the US, The Library of Congress and WGBH started the American Archive of Public Broadcasting to make public radio and television programs easier to search and access. Although all three have different value propositions - adding value to their product, creating a new product and contributing to society, they all use speech-to-text. Yet, speech-to-text has limitations. For example, the American Archive of Public Broadcasting crowdsources it. Many enterprises still use speech-to-text despite the limitations due to the benefits of indexing. Below we listed the benefits of media indexing:

Allows integration of an agenda

Listeners or viewers can directly go to the exact portion of the program and enjoy the rich media experience.

Makes lengthy content accessible

Listeners or viewers can navigate within different sections of the media rather than going through the entire content.

Creates new revenue opportunities

Content owners can dive deeper into archives to find assets that can generate revenue.

Picovoice’s Octopus Speech-to-Index makes audio and video files discoverable and searchable, with a caveat. It uses not-very-famous speech technology: phonetic indexing. Phonetic Indexing and Timing enabled broadcasters and other media businesses to search speech data, e.g. dialogue directly. Thus, it lets media companies search for anything without worrying about logging or transcription.

Speech-to-Index vs. Speech-to-Text

Both Speech-to-Index and Speech-to-Text have advantages over each other. Choosing the right engine depends on what an enterprise wants to achieve. Speech-to-Index is easier to maintain as it works even for proper nouns out-of-the-box. However, Speech-to-Text may require adding or boosting some keywords list with the changing trends. ASR engines trained three years ago may not know “COVID,” whereas speech-to-index can directly bring it up. Think about a news agency. It has to add new words constantly to improve speech-to-text accuracy. Speech-to-Index returns the actual content, i.e. verbatim, within the media. On the other hand, Speech-to-text produces linkable text for the media file.

