Spanish Speech-to-Text with Python

🚀 Best-in-class Voice AI!

Build compliant and low-latency AI apps using Python without sending user data to 3rd party servers.

Speech-to-Text, also known as Automatic Speech Recognition, is a technology that converts spoken audio into text. The technology has a wide range of applications, from video transcription to hands-free user interfaces.

While many cloud Speech-to-Text APIs are available on the market, most can only transcribe in English. Picovoice's Leopard Speech-to-Text engine, however, supports 8 different languages and achieves state-of-the-art performance, all while running locally on-device.

In this tutorial, we will walk through the process of using the Leopard Speech-to-Text Python SDK to transcribe Spanish audio in just a few lines of code.

Prerequisites

Sign up for a free Picovoice Console account. Once you've created an account, copy your AccessKey on the main dashboard.

Install Python (version 3.7 or higher) and ensure it is successfully installed:

python --version

Install the pvleopard Python SDK package:

pip install pvleopard

Leopard Speech-to-Text Model File

To initialize Leopard Speech-to-Text, we will need a Leopard Speech-to-Text model file. The Leopard Speech-to-Text model files for all supported languages are publicly available on GitHub. For Spanish Speech-to-Text, download the leopard_params_es.pv model file.

Implementation

After completing the setup, the actual implementation of the Speech-to-Text system can be written in just a few lines of code.

Import the pvleopard package:

import pvleopard

Set the paths for all the required files. Make sure to replace ${ACCESS_KEY} with your actual AccessKey from the Picovoice Console, ${MODEL_FILE} with the Spanish Leopard Speech-to-Text model file and ${AUDIO_FILE} with the audio file you want to transcribe:

access_key = "${ACCESS_KEY}"
model_file = "${MODEL_FILE}"
audio_file = "${AUDIO_FILE}"

Initialize Leopard Speech-to-Text and transcribe the audio file:

leopard = pvleopard.create(access_key=access_key, model_path=model_file)
transcript, words = leopard.process_file(audio_file)
leopard.delete()

print(transcript)

Ve por esta calle durante unos cinco minutos

Leopard Speech-to-Text also provides start and end time-stamps, as well as confidence scores for each word:

for word in words:
    print(
        'word="%s" start_sec=%.2f end_sec=%.2f confidence=%.2f'
        % (word.word, word.start_sec, word.end_sec, word.confidence)
    )

word="ve" start_sec=0.83 end_sec=0.90 confidence=0.90
word="por" start_sec=1.02 end_sec=1.12 confidence=0.97
word="esta" start_sec=1.18 end_sec=1.44 confidence=0.94
word="calle" start_sec=1.54 end_sec=1.76 confidence=0.97
word="durante" start_sec=1.86 end_sec=2.27 confidence=0.98
word="unos" start_sec=2.37 end_sec=2.66 confidence=0.94
word="cinco" start_sec=2.72 end_sec=3.01 confidence=0.91
word="minutos" start_sec=3.10 end_sec=3.65 confidence=0.98

Additional Languages

Leopard Speech-to-Text supports 8 different languages, all of which are equally straightforward to use. Simply download the corresponding model file from GitHub, initialize Leopard Speech-to-Text with the file, and begin transcribing.

Spanish Speech-to-Text with Python

Prerequisites

Leopard Speech-to-Text Model File

Implementation

Additional Languages

More from Picovoice