TL;DR: Private, Efficient, Fast, Ready-to-Use Text-to-Speech: Orca Text-to-Speech, the on-device voice generator that converts written text into spoken audio output without network latency or jeopardizing user privacy, is now in public beta!

Today, Picovoice is pleased to announce the public beta release of its Text-to-Speech engine: Orca. Text-to-Speech is a technology that converts text into spoken audio. Text-to-Speech is also known as Speech Synthesis, Speech Synthesizer, Voice Generator, or Generative Voice AI. It’s one of the most known and widely used voice AI technologies, enabling use cases across industries:

Text-to-Speech users have two distinctive expectations: quality and response time.

1. High-quality Text-to-Speech
The quality of a Text-to-Speech engine refers to how similar the produced speech is to a “real” human voice. Media, entertainment, and publishing industries leverage high-quality Text-to-Speech to narrate books, movies, games, broadcasts, or podcasts. Text-to-Speech can reduce voice-over costs while offering more flexibility on speed, pitch, intonation, etc.

2. Fast and Responsive Text-to-Speech
Response time is crucial for voicebots, specifically those that aim to increase productivity. A timely response while picking up items in a warehouse or ordering food at a drive-thru is critical for the continuity of human-like interactions.

Challenges in Text-to-Speech

High-quality Text-to-Speech is generally used for pre- or post-production. Quality is prioritized when there is a trade-off between Text-to-Speech quality and response time. Since the volume for these use cases is typically low, the average cost per minute can be negligible.

On the other hand, a balance between speed and quality of Text-to-Speech is needed for use cases such as voicebots and virtual customer service agents. These high-volume use cases also require enterprises to be mindful of costs.

The recent advances in AI, such as large transformer models, have made high-quality machine-generated voices and voice cloning more accessible. However, these large models require significant resources, causing cloud dependency. Despite the efforts of Text-to-Speech API providers, inherent limitations of cloud computing limit their capability to improve response time and reduce costs.

Why another Text-to-Speech engine?

Orca is not just another Text-to-Speech engine. On-device Text-to-Speech, unlike cloud Text-to-Speech APIs, eliminates network latency and offers guaranteed response time. Running de-centralized models minimizes infrastructure costs. Converting text to speech locally without having to send anything to remote servers is the only way to ensure privacy. Yet, training an on-device Text-to-Speech that enables human-like interactions, just like cloud APIs, while being resource-efficient to run across platforms is challenging. That’s why we built Orca - leveraging Picovoice’s expertise in training efficient on-device voice AI models without sacrificing performance.

Meet Orca Text-to-Speech

Orca Text-to-Speech is a lightweight Text-to-Speech engine that converts text to speech locally, offering fast and private experiences without sacrificing human-like quality.

Orca Text-to-Speech is:

  • compact and computationally efficient, running across platforms:
  • private and fast, powered by on-device processing
  • easy to use with a few lines of code

Start Building!

Any developer can use Orca Text-to-Speech under Picovoice’s Forever-Free Plan. No credit card is required. No machine learning expertise is needed. It’s ready in just a few lines of code.

o = pvorca.create(access_key)
speech = o.synthesize(text)