Speech Recognition in Python Tutorial

🚀 Best-in-class Voice AI!

Build compliant and low-latency AI apps using Python without sending user data to 3rd party servers.

In this article, we will learn how to perform Speech Recognition in Python.

Speech Recognition and Speech-to-Text are often used interchangeably. But Speech-to-Text is only a subfield of Speech Recognition. Other forms of Speech Recognition include Wake Word Detection, Voice Command Recognition, and Voice Activity Detection (VAD).

Below is the cheat sheet I use when deciding which Speech Recognition algorithm to use:

Do you need to detect if a person is talking and when? Then use Cobra Voice Activity Detection.
Do you need to detect the occurrence of a single phrase? Or one of a few phrases? Then Porcupine Wake Word is the right engine.
Do you need to understand voice commands? Rhino Speech-to-Intent is the correct tool here. Rhino can infer users' intent and accurately extract the request's details (i.e., slot values) using minimum runtime resources.
Do you need to transcribe speech to text in real time? Use Cheetah Streaming Speech-to-Text.
Do you need to transcribe large volumes of speech to text in batch mode? Leopard Speech-to-Text is the right tool.

The SDKs in this tutorial can run on Linux, macOS, Windows, Raspberry Pi, NVIDIA Jetson, and BeagleBone.

Cobra Voice Activity Detection

1- Install the Cobra Voice Activity Detection SDK using PIP:

pip3 install pvcobra

2- Sign up for a free Picovoice Console account and copy your AccessKey. It handles authentication and authorization.

3- Create an instance of the Voice Activity Detection engine:

import pvcobra

cobra = pvcobra.create(access_key='${ACCESS_KEY}')

4- Pass in frames of audio to the .process method:

while True:
    audio_frame = get_next_audio_frame()
    voice_probability = cobra.process(audio_frame)

For more information check Cobra Voice Activity Detection's product page or refer to Cobra's Python SDK quick start guide.

Porcupine Wake Word

1- Install the Porcupine Wake Word SDK using PIP:

pip3 install pvporcupine

2- Sign up for a free Picovoice Console account and copy your AccessKey. It handles authentication and authorization.

3- Create your custom wake word model using Picovoice Console.

4- Create an instance of the Wake Word engine:

porcupine = pvporcupine.create(
  access_key='${ACCESS_KEY}',
  keyword_paths=['${KEYWORD_FILE_PATH}']
)

5- Pass in frames of audio to the .process method:

while True:
  audio_frame = get_next_audio_frame()
  keyword_index = porcupine.process(audio_frame)
  if keyword_index > -1:
      # keyword detected
      pass

For more information check Porcupine Wake Words's product page or refer to Porcupine's Python SDK quick start guide.

Rhino Speech-to-Intent

1- Install the Rhino Speech-to-Intent SDK using PIP:

pip3 install pvrhino

2- Sign up for a free Picovoice Console account and copy your AccessKey. It handles authentication and authorization.

3- Create your Context using Picovoice Console.

4- Create an instance of Rhino Speech-to-Intent to start recognizing voice commands within the domain of the provided context:

import pvrhino

rhino = pvrhino.create(
   access_key='${ACCESS_KEY}',
   context_path='${CONTEXT_FILE_PATH}'
)

5- Pass in frames of audio to the .process function and use the .get_inference function to determine the user's intent:

while True:
   audio_frame = get_next_audio_frame()
   is_finalized = rhino.process(audio_frame)
   if is_finalized:
      # get inference if is_finalized is true
      inference = rhino.get_inference()
      if inference.is_understood:
         # use intent and slots if inference was understood
         intent = inference.intent
         slots = inference.slots

For more information check Rhino Speech-to-Intent's product page or refer to Rhino's Python SDK quick start guide.

Cheetah Streaming Speech-to-Text

1- Install the Cheetah Streaming Speech-to-Text SDK using PIP:

pip3 install pvcheetah

2- Sign up for a free Picovoice Console account and copy your AccessKey. It handles authentication and authorization.

3- Create an instance of Cheetah to transcribe speech to text in real-time:

import pvcheetah

cheetah = pvcheetah.create(access_key='${ACCESS_KEY}')

4- Pass in audio frames as they become available to the .process function:

while True:
    partial_transcript, is_endpoint = cheetah.process(get_next_audio_frame())
    if is_endpoint:
        final_transcript = cheetah.flush()

For more information check Cheetah Streaming Speech-to-Text's product page or refer to Cheetah's Python SDK quick start guide.

Leopard Speech-to-Text

1- Install the Leopard Speech-to-Text SDK using PIP:

pip3 install pvleopard

2- Sign up for a free Picovoice Console account and copy your AccessKey. It handles authentication and authorization.

3- Create an instance of Leopard to transcribe speech to text:

import pvleopard

leopard = pvleopard.create(access_key='${ACCESS_KEY}')

4- Pass in an audio file to Leopard and inspect the result:

transcript, words = leopard.process_file('${AUDIO_PATH}')
print(transcript)
for word in words:
    print(
      "{word=\"%s\" start_sec=%.2f end_sec=%.2f confidence=%.2f}"
      % (word.word, word.start_sec, word.end_sec, word.confidence))

For more information, check Leopard Speech-to-Text's product page or refer to Leopard's Python SDK quick start guide.

Python Speech Recognition

Cobra Voice Activity Detection

Porcupine Wake Word

Rhino Speech-to-Intent

Cheetah Streaming Speech-to-Text

Leopard Speech-to-Text

More from Picovoice