Using Large Language Models in Voice AI

🎯 Enterprise LLM Consulting

Work with AI consultants to build LLM-powered apps to improve productivity, retention, and time-to-market.

One of the questions that “the internet” looks for an answer to is why Alexa is not as good as ChatGPT. Why doesn’t Apple simply integrate ChatGPT into Siri? It’s not just Reddit or Hackernews; NY Times has also declared that Siri, Alexa, and Google Assistant Lost the AI Race.

Large Language Models (LLMs), such as ChatGPT, enable several applications and can be used in voice products. We saw tens, maybe, hundreds of community projects integrating Leopard Speech-to-Text, Cheetah Streaming Speech-to-Text, Rhino Speech-to-Intent, Porcupine Wake Word, and Cobra Voice Activity Detection with LLMs. If individual developers can do it, why not tech giants?

There can be several reasons why Amazon, Apple, or other enterprises haven’t integrated ChatGPT or another LLM in their voice assistants yet. Although we can never know unless they explain, their decision can have strategic, financial, UX, or legal basis:

Strategic: Protect user data by not sharing them with 3rd parties.
Financial: Not willing to pay millions of dollars to the rival, Microsoft, through OpenAI or add additional cost items when already losing billions of dollars.
UX: Not willing to share false information with users or send their data to a 3rd party server, causing delays and poor user experience.
Legal: Minimize the risk of lawsuits due to fictitious answers or copyright.

They can start integrating LLMs after addressing these issues or accepting the risks. It’s their business choice. They have the expertise and knowledge to make an informed decision, but not everyone. Let’s enable informed decisions for all enterprises:

1. Enterprises do not have control over user data: Enterprises share their user data with 3rd parties when using cloud services, making them vulnerable to breaches and regulatory pressure. For example, a ChatGPT bug revealed other users’ chat titles. Apple, Samsung, JPMorgan, CitiGroup, Bank of America, Deutsche Bank, Goldman Sachs, Wells Fargo, and Italy banned ChatGPT due to privacy and security concerns.

Students getting their code reviewed before submitting assignments may not care about privacy. However, enterprises must protect confidential information, maintain user trust, and stay compliant.

2. Enterprises do not have control over training data: Enterprises cannot know whether the models contain PII or copyrighted content unless vendors disclose the training data. However, algorithms can reveal the data used in training to users later. For example, OpenAI discloses the personal information used in training, and ChatGPT can be biased and discriminative. There are copyright lawsuits filed against LLM vendors, including OpenAI and Microsoft.

Publishing ethical or responsible AI guidelines does not necessarily mean vendors follow them. For example, OpenAI’s four principles include “Build trust” and “Be a pioneer in trust and safety.” Yet, it allows users to object to personal data use only under certain laws, not everywhere, and only after EU regulations force them to do. Microsoft laid off its ethical AI team while investing billions of dollars in OpenAI.

3. LLMs are unreliable: LLMs are like Swiss army knives. (In terms of functionality, not the size.) They’re versatile and useful but not the best knives. Ted Chiang calls ChatGPT the “blurry jpeg of the internet.” Researchers from the University of Maryland School of Medicine asked ChatGPT 25 questions three times and found that 88% of ChatGPT answers were “appropriate.” The remaining 12% were wrong, outdated, and even fictitious, although a simple Google search provides correct responses. Even Google’s LLM, BARD, shared false information about James Webb Space Telescope in the content prepared for the launch event.

4. Hallucinations make LLMs even more unreliable: LLMs have a hallucination problem, meaning they respond as if they know the answer. The hallucination problem results in inappropriate advice, e.g., encouraging suicide, or wrong information with reliable citations, such as a false sexual assault claim citing Washington Post. LLMs, such as ChatGPT, can even be confident about a book that doesn't exist and include details about its content.

Allowing LLMs to design user experiences or provide legal or medical guidance can be risky when enterprises do not have control over the models.

5. LLMs showcase unruly behaviors: After Microsoft added an LLM to Bing Search, the first news that hit the media was Bing insulting, emotionally manipulating users, and spying on them.

If and when LLMs go awry, they hurt enterprises financially and cause reputational damage. BARD’s mistake about Webb Space Telescope caused Alphabet to lose $100 Billion.

6. LLMs do not take action: LLMs can generate text, not trigger actions, such as setting a timer which is one of the most used commands of Alexa. There has to be another model to parse the text and turn it into actions. The cascaded Spoken Language Understanding approach has certain accuracy disadvantages over the modern one.

Rhino Speech-to-Intent is ideal for creating application-specific custom voice commands to control your software.

7. LLMs are expensive to train: As the name suggests, LLMs are large. It costs millions of dollars to train a model. The larger the model is, the more expensive it gets. It’s not surprising that OpenAI got half of the Microsoft investment in Azure credits.

8. LLMs are expensive to run: Large models perform only when they run on powerful infrastructure. Although OpenAI doesn’t disclose any information, estimations of the operational cost of ChatGPT vary from $100,000 to $700,000 per day.

9. LLMs have large carbon footprints:
When Clive Humby said data is the new oil, he probably didn’t consider its carbon footprint. The larger the models get, the higher the carbon footprint. Training GPT-3 for once is equivalent to the carbon emission of eight cars throughout their lifetime. Another research estimates Microsoft used approximately 700,000 liters of fresh water during GPT-3’s training. These numbers do not include the consumption data, i.e., what happens after the deployment.

10. Size does not necessarily mean better performance: In the latest “leaked” document, a Google researcher admits focusing on the largest models on the planet puts them at a disadvantage. Alternatives like Vicuna achieves better results with limited resources. Similarly, Picovoice proved that 20 MB speech-to-text models could achieve higher accuracy than Google’s large ASR models.

Should you integrate LLMs into your voice products?

The answer is it depends. Before deciding which language model to use, work backward from the customer and the problem. Other powerful language models that are easier to customize and more efficient to run can more accurately address your domain and application-specific needs. Imagine you want to add a search bar to your e-commerce website. Do you need an engine that indexes the whole internet, including your competitors, or something that indexes just your products and content? The same logic applies to the language models as well.

If you have questions about where and how to start, request an AI Exploration Workshop from Picovoice experts.

Get Expert Help

Using Large Language Models in Voice AI

Should you integrate LLMs into your voice products?

More from Picovoice