Speech Recognition and Natural Language Processing are difficult technical problems. Yet, solving technical problems is only one part of the equation. Products cannot be successful unless they’re adopted by users, i.e. solving user problems. One of the most common reasons why voice AI projects fail is not to work backward from customers. While choosing a user interface and designing Voice User Interfaces (VUIs), people focus on technical details rather than user needs. Vendors also push their agenda and explain what voice user interfaces mean, or how VUIs differ from other types of user interfaces, i.e. better or why VUIs should have wake words or multimodality. However, like any user interface design, voice user interfaces should focus on users as the name suggests.

Erika Hall addresses this in the very first pages of her book Conversational Design:

Conversational design is truly human-centered design, every step of the way. There is no next big thing, only the next step in an unfolding story of how people use technology to be more themselves.

Erika Hall, the co-founder of Mule Design, a strategic design consultancy firm in San Francisco and author of Conversational Design and Just Enough Research has kindly accepted to share her expertise.

1. You have been working as a designer for over two decades now. How did voice interfaces get your interest?

I’ve always been interested in the human side of human-computer interaction, particularly the importance of natural language in interfaces. Even in a windows-based GUI, the words are the most important part. (Look how many of the apps are menus). But when we talk about “design” visual design is often the first thing that comes to mind.

If you consider how humans interact with each other, for hundreds of thousands —if not millions— of years, voice has been central. So the conversation is close to the original interface. But because digital systems are not actually human beings, they lack the sensitivity to context and cognitive style that makes conversation work.

A user naturally expects a system that talks like a person to behave like a person. The chasm between this expectation and the reality is huge. And unlike other interfaces, it’s a lot harder to provide cues to the limitations. This makes voice interfaces really challenging. The opportunity is for systems to be more accessible and human-centred, but getting there is much harder than it might seem initially.

2. One of the pitfalls that you mention in your book is the device-centred perspective. For example, asking “how do we get our customers to make more purchases using the Amazon Echo” instead of asking “how can we make the presence of an Amazon Echo in the home provide more value to both Amazon and the customer?” Do you think questions asked by enterprises are evolving?

I think assumptions are evolving more than the questions. The basic questions of “What interface is actually the most human-centred (for a given human in a given place and time)?” and “How do we use technology to deliver both customer and business value?” are still valid and still asked too infrequently.

In 2018, there was a lot of hope and hype. Now we’ve had a few more years of living with voice technology, and we’re finding that getting it right is hard. It’s always tempting to start from ever progressing technical capabilities, rather than stubborn human nature. Now that we are a few years down the line it is possible to examine the track record of successes and failures and see where the introduction of voice interfaces has made a real difference, and proceed more in that direction.

3. The lack of context awareness is another issue you bring up in your book. Could you elaborate on that? Why is context-aware design important?

What each of us needs and does is totally dependent on what’s going on around us at a given time.

Think about going shopping in a store versus online. In the best-case scenario, a salesperson might observe from context cues that you look like you need help, or you know exactly what you want, or that you’re texting, or carrying a sleeping baby. They can adjust their behaviour and how they communicate with you based on their observations. They might gesture, whisper, point, or raise their voice. An e-commerce site or app can’t do that.

A critical category of context for voice interactions is shared spaces: Who else might be speaking or listening at the same time besides the anticipated user?

For example, think of calling a bank’s customer service line. A context-unaware assumption is that because a customer is calling instead of using a mobile app for example, the customer wants to use voice for all parts of the interaction. But what if the customer is on public transportation on the way to the airport and wants to quickly check a balance or make a payment in a public place. They probably won’t want to speak their account number or personally identifying information aloud. However, a customer who is in the middle of cooking dinner might want to check on their account in a completely hands-free manner.

A voice interface that works well when one person is alone at home in private, might be terrible when there are two children and two adults in the same space, as in many households during the pandemic.

Bad assumptions make worse interactions. And the only way you can design a system that works in all likely contexts is to understand as much about the messy lived experiences of your potential users as possible.

4. The current market for building Voice User Interfaces does not offer cross-platform support, not easy to iterate. What’s the impact of this lack of support on you as a design strategist and on your clients?

I think it’s actually too easy to make graphical changes to the surface of an interface. Often the level of visual polish can lead designers and decision-makers to think that the concepts underneath are stronger than they are. And the concepts underneath are the most important part.

Designing and building are two different things, and a lot of the most important work of designing can be accomplished with very unsophisticated tools. Talking things through, sketching, role-playing, making rough prototypes to test concepts. Things like that.

If you are very thoughtful about your research and design process—including being very clear on your organization’s capacity and goals—by the time you get to building out your system you will be less limited by the tools and more able to iterate in a strategic, proactive, and efficient manner.

The cliché that “change is the only constant” is half true. It is very useful to go through the exercise of identifying which aspects of your system are likely to remain the same for a given time period while creating space for continuous listening, learning, adaptation, and growth. Finding this flexible middle ground is the essence of system design. There is no silver bullet or substitute for diligent research and cross-discipline collaboration.

5. Lastly, what’d be your advice for organizations interested in adding voice to their products?

Look at the system holistically to understand how, why, and in what contexts voice interactions will actually make products easier to use and more valuable to both the customers and the business. Because you’re looking at a significant investment and a workflow change... Think across devices and modes, because that’s what humans do. And hire some poets and playwrights.