Monica Lam is a Professor in the Computer Science department at Stanford University since 1988. She holds a B.Sc. from the University of British Columbia, and a PhD from Carnegie Mellon University. Monica is a Member of the National Academy of Engineering and a co-author of one of the most popular Computer Science textbooks: Compilers, Principles, Techniques, and Tools (2nd Edition), known as the Dragon book. Professor Lam’s current research centres on virtual assistants with an emphasis on privacy protection. She is the faculty director of the Open Virtual Assistant Lab (OVAL) which builds Genie, the open-source virtual assistant. [1]

Professor Lam and Picovoice met through a shared passion: enabling private voice interactions and making the voice technology available to all, not exclusive to big tech.

Q: How did voice assistants get your interest? Why is voice important?

Voice is the human-to-human interface. Now that computers can finally communicate with humans using the human language, this makes digital information accessible to everybody – that includes the blind, the preliterate, and the illiterate. Voice assistants today are a gateway to many available voice interfaces. They will continue to grow to become personal and will be able to give advice and recommendations to the user. This is an extremely rich research topic, and the results can make a huge impact on society. I want to advance and democratize voice technology, which includes covering low-resource languages.

Q: Speech recognition is one of the AI-complete problems, although as humans, we’re good at it without recognizing what we’re capable of doing. Thinking of all the tough problems that you tackled throughout your career, how has your experience been so far?

The problem of recognizing speech is by and large solved. The challenging problem now is natural language understanding – how do we understand the meaning of what users are saying or typing. This requires machine learning. That normally means that we need tons of training data. The typical process is that we find out what users would say or ask for, then we need to annotate the meaning of those sentences, which needs to be done by an expert. There is so much variety in what people ask for and how they express it, this takes a lot of data and is therefore prohibitively expensive to acquire. Due to this expensive process, not everybody was able to afford it. However, with the advancements in technology, it's possible to reduce the magnitude of data required. Our research shows that it is possible to reduce the amount of training data needed by two orders of magnitude by synthesizing most of the data with the help of large language models. Yet, it requires domain expertise which again not everybody has. This is why we’ve initiated WWvW (World Wide Voice Web) and I appreciate Picovoice’s support and work to make it available to anyone.

Q: Where do you see the voice technology going? What types of applications do excite you the most?

I think that all applications in the future will be multimodal with a voice component. Today there are over 20 million web developers, I think we will have 20 million voice interface developers in the future. This interface is useful across the board for many different applications. What excites me the most is the idea that in the future we will have an assistant on our ears at all times via earbuds. We won’t need to have to pull out a phone to get access to digital information.

Q: The voice market is dominated by big tech. Companies like Amazon and Google, which could afford thousands of deep learning researchers and engineers, now have access to the majority of the voice data. What are the risks of such dominance?

The worldwide web is an open platform where companies can freely put up information for all to see. If the voice market ends up being controlled by just a couple of companies, many businesses will be hurt, which eventually will hurt the consumers. Moreover, virtual assistants are in the position of collecting a massive amount of personal information about the users. Leaving all personal information in the hands of a couple of companies would give them too much power over the consumers. It is thus a huge risk to privacy as well.

Q: For voice products, privacy is one of the major barriers to greater adoption. Some analysts believe that users will become progressively more comfortable and sacrifice privacy over convenience. Like you, we, at Picovoice believe that users can have both privacy and convenience. Why should users have both?

Consumers today assume that privacy is a price that needs to be paid if they want convenience. They do not know that it is technically possible to have both, because the big tech companies do not offer both. People want convenience, tech has made so many things convenient. Privacy is important too -- it is the huge aggregation of personal profiles in social media that has made possible many misinformation campaigns.

Q: Can you tell us a bit about Genie, the voice assistant? Why did you start it?

Genie is an open-source privacy-preserving virtual assistant that we have developed in our Stanford Open Virtual Assistant Lab. It is the first assistant that is built to support conversations, rather than just simple commands. It performs many of the most popular skills such as playing songs, news, podcasts, restaurant recommendations, weather, etc. It can also be used to control IoT devices, through our collaboration with Home Assistant, an open-source home gateway company. Unlike commercial assistants, all the IoT data do not leave your house and are therefore private. What is more significant is that the assistant demonstrates how our open-source Genie toolset makes it possible for a small team to create an assistant. I started this project because we want to advance voice technology and make it publicly available so every company can create voice interfaces easily.

Q: “Hey Genie”, the wake word that activates the assistant is powered by Picovoice Porcupine. How did you come across Picovoice? What stood out?

Picovoice is a leader in the field of wake words. We are very impressed with how easy it is to get a wake word and how well it performs. We have tried other alternatives but they do not perform as well.

Q: “Hey Genie” powered by Porcupine can run on various platforms, not just the current hardware. People, who are not very familiar with this space, generally assume wake word detection is easy to build, but in fact, it’s not. Porcupine is the first and still the only wake word engine that could run across platforms. Why is it difficult to build a wake word engine and why is it important to make this technology accessible to everyone?

It is hard because we need to be able to detect how different people would pronounce the wakeword and it has to be processed on-device. This normally needs a lot of data and the model needs to be small enough to fit on the device. It’s not affordable for every organization to record hundreds of people to train a wake word and optimize the trained model for the target platform. Picovoice solves this problem by enabling instant training on its web console. It’s impressive how Picovoice makes this data-heavy and expensive wake word training process invisible and provides access to optimized wake word models to everyone.

Q: When you think of developing Graphic User Interface, let’s say a website, it’s accessible by every browser and if you want to change the colour, easy! However, what the market is pushing for building Voice User Interfaces is different - no across platform support, not easy to iterate. What was your reaction when you saw you could train a voice model on a simple type-and-click interface on Picovoice Console?

I’m excited that Picovoice is one of those rare enterprises with a vision that we share. There will be so many voice interfaces on so many platforms that we need an easy-to-use Picovoice Console that works for a large variety of platforms. It is this shared vision that makes Picovoice a great partner for our project.

Q: Lastly, what do you think Picovoice should do or focus on next?

The first is echo cancellation, it is an important component of a voice interface.