Orca Streaming Text-to-Speech BETA

Let LLMs talk as they type
Streaming TTS for LLMs

On-device streaming text-to-speech that reads dynamic LLM responses out loud as they emerge - with no latency or pause

English26/200
>
Trusted by thousands of enterprises - from startups to Fortune 500s
Loved by 200,000+ developers
OpenAI

What is Orca Streaming Text-to-Speech?

Orca Streaming Text-to-Speech is a voice generator developed for LLM applications. It concurrently synthesizes speech as LLMs compose their responses.

Orca Streaming Text-to-Speech eliminates the latency between LLMs' text output and TTS's audio output, enabling humanlike interactions with no awkward pauses.

Why Orca Streaming Text-to-Speech?

Orca Streaming Text-to-Speech can handle streaming text input by synthesizing audio while an LLM is still producing the response.

OpenAI TTS, Elevenlabs TTS Streaming, IBM Real-time Speech Synthesis, Amazon Polly, and others start converting text to audio after receiving the entire LLM output. Since Orca Streaming TTS starts converting text to audio way earlier, it finishes in tandem with the LLM, before other TTS engines can even begin.

Build conversational AI agents that respond 10x faster with just a few lines of code

orca = pvorca.create(access_key)
stream = orca.stream_open()
speech = stream.synthesize(
get_next_text_chunk())

Humanlike experiences without inhuman latency

Don’t ruin user experience with awkward silences

Orca Streaming Text-to-Speech

  • 🤖
    Built for LLMs
  • Input and output streaming
  • Guaranteed response time
  • 🔒
    Private-by-design
  • 🤸
    Cloud, on-prem, on-device

Other Text-to-Speech

  • 🔨
    Built for pre-LLM era
  • ▶️
    Output streaming
  • 📶
    High variance in latency
  • 👂
    Required 3rd party data sharing
  • 🌩️
    Cloud-dependent
Built for LLMs

Ship AI agents that respond 10x faster than ChatGPT

Build real-time, engaging, and interactive humanlike voice experiences using TTS developed for LLM-powered AI agents, IVRs, and many more. Give real-time streaming audio output using streaming text input
Guaranteed response time without unnatural pauses

Running fast doesn’t help when you start late

Does being the “lightning-fast” or even “fastest” help you win a race if you start after it ends, or in this case, after a customer gets disengaged? The Open-Source Text-to-Speech Latency Benchmark proves the responsiveness of Orca Streaming Text-to-Speech.
Cloud, on-prem, on-device

Scale Confidently

Expand without worrying about platform support. Orca Streaming Text-to-Speech runs anywhere - including embedded, web, mobile, desktop, on-premise, and private or public cloud.
Get started with

Orca Streaming Text-to-Speech

Embed Orca Streaming Text-to-Speech into your product in less than 10 minutes.

Start Now
Forever Free Plan
  • Real-time
  • Production-ready
  • Cross-platform SDKs
Learn more about

Orca Streaming Text-to-Speech

What can I build with Orca Streaming Text-to-Speech?

Streaming Text-to-Speech shines the most when developers build AI agents and assistants, enabling human-like interactions. AI agents can work in several industries:

  • Finance - AI agents help customers with day-to-day banking, including providing account details and information about new or personalized offerings.
  • Retail - AI agents help customers check order status, provide information on warranty policies, or let them make returns and exchanges.
  • Restaurants - AI agents offer suggestions and help customers make reservations.
  • SaaS - AI agents assist customers with user guides and FAQs, so they can self-serve issues.
  • Transportation - AI agents help passengers check their flight status, make flight changes, or upgrade their purchases.
  • Telecom operators - AI agents assist customers with self-serve account details, track usage, and transfer them to an agent when needed.

What's Streaming Text-to-Speech?

Streaming text-to-speech (TTS) is the technology that converts written text into spoken words in real time as the text is generated. Traditional TTS systems process pre-defined text, in other words, they require the full text to start processing. Once the text is processed, they generate audio as a complete file or stream audio by incrementally playing it back. The latter is called “audio output streaming”. Streaming Text-to-Speech does not just stream audio output but also processes streaming text input. Unlike traditional TTS, streaming TTS doesn’t require the full text to start playing.

How does Orca Streaming Text-to-Speech differ from other Text-to-Speech engines offered for real-time interactions?

The term "streaming" or "real-time" Text-to-Speech (TTS) has been used excessively and often inappropriately, leading to confusion about its true meaning and capabilities. Orca Streaming Text-to-Speech, similar to humans continuously processing text input, reads streaming text inputs out loud as they appear.

The current “real-time” TTS solutions can stream audio as it is generated before the full audio file has been created. They were designed for legacy Natural Language Processing (NLP) techniques that generate output all at once. In the pre-LLM era, this was sufficient. Today's Large Language Models (LLMs) work differently – they produce text incrementally, token by token. Thus, traditional TTS solutions couldn’t catch up with the token-by-token processing concept. They still wait for the whole text to be generated. Orca Streaming Text-to-Speech, besides streaming audio output like traditional TTS solutions, can process streaming text input that is generated on a token-by-token basis.

Imagine attending an event with two interpreters: one translates as the speaker speaks (simultaneous translation) and the other waits for the speaker to pause (consecutive translation). Orca Streaming Text-to-Speech works like the former and processes data simultaneously, whereas other “real-time” Text-to-Speech engines work like the latter.

Which LLMs does Orca Streaming Text-to-Speech support?

Orca Streaming Text-to-Speech works with all closed-source and free and open large language models.

Some examples of closed-source large language models Orca Streaming TTS supports:

  • OpenAI GPTs-4, OpenAI GPT 3.5, OpenAI GPT 3.5 Turbo, OpenAI GPT 3
  • Anthropic Claude, Anthropic Claude 2, Anthropic Claude 3 Sonnet, Anthropic Claude 3 Opus, Anthropic Claude 3 Haiku
  • Cohere Coral

Some examples of open large language models Orca Streaming TTS supports:

  • LLaMA
  • Falcon
  • Gemini
  • Gemma
  • Grok
  • Mistral
  • Mixtral
  • Phi-2
  • DBRX

Does Orca Streaming Text-to-Speech support async processing?

Yes, Orca has the async processing capability, in other words, it can process predefined, i.e., static, text, and convert it into audio streams or audio recordings. Developers can convert pre-defined text into audio as a complete file or by streaming audio incrementally. Please visit our docs for more information.

Which platforms does Orca Streaming Text-to-Speech support?

Orca Streaming Text-to-Speech runs across platforms:

Can I use Orca Streaming Text-to-Speech for free?

Orca Streaming Text-to-Speech is free to use with Picovoice’s Free Plan within its limits.

Which languages does Orca Streaming Text-to-Speech support?

Orca Streaming Text-to-Speech supports English with many more languages, including French, German, Hindi, Italian, Japanese, Korean, Portuguese, and Spanish on the roadmap. Reach out to the Picovoice Consulting team with the details of your project if you have an immediate need.

How can I generate custom voices using Orca Streaming Text-to-Speech?

Picovoice Consulting customizes Orca Streaming Text-to-Speech for brands that want to represent their “voice” via unique, custom voices.

Does Orca Streaming Text-to-Speech allow voice tuning?

Orca Streaming Text-to-Speech base model allows developers to adjust the speed of the selected voice. Custom Orca Streaming Text-to-Speech models can be leveraged for further voice tuning. Contact Picovoice Consulting with your project requirements and get a custom Text-to-Speech model that fits your needs.

Can I add custom industry jargon and terminology to Orca Streaming Text-to-Speech?

Orca Streaming Text-to-Speech can be application, company, domain, or industry-specific with custom vocabulary.

Does Orca Streaming Text-to-Speech generate voices with different emotions?

Custom Orca Streaming Text-to-Speech models generate voices with emotions and styles, including joy, anger, whispering, and shouting. Contact Picovoice Consulting with your project requirements if you don’t want to wait!

Why does Orca Streaming Text-to-Speech have Beta on the name?

Finding an on-device Text-to-Speech solution that is resource-efficient, ready-to-use, and on par with cloud alternatives is a very challenging task, even impossible depending on the requirements. Our customers and users, especially the ones building voicebots using Porcupine Wake Word, Rhino Speech-to-Intent, Leopard Speech-to-Text, or Cheetah Streaming Speech-to-Text, demanded an on-device Text-to-Speech that could cut their cloud dependency. We released the initial version of Orca Text-to-Speech to address the immediate needs of certain use cases. However, we acknowledge there is room for improvement to meet the Picovoice standards, hence “Beta” on the name.

How do I get technical support for Orca Streaming Text-to-Speech?

Picovoice docs, blog, Medium posts, and GitHub are great resources to learn about voice AI, Picovoice technology, and how to add AI-generated voice to your product. You can report bugs and issues on GitHub. If you need help with developing your product, you can purchase the optional Support Add-on or upgrade your account to the Developer Plan.

How can I get informed about updates and upgrades?

Version changes appear in the and LinkedIn. Subscribing to GitHub is the best way to get notified of patch releases. If you enjoy building with Orca Streaming Text-to-Speech, show it by giving a GitHub star!