How to Design Voice-Enabled Apps for Optimal User Experience

Ed Sarfeld | Steve Szoczei | March 4, 2019 | 10 Min Read

Learn how to navigate this critical component of designing voice-enabled applications in order to provide users with a personalized, seamless multimodal experience in today’s world full of Internet-connected products.

Consumers enjoy the convenience that voice offers when it comes to simplifying tasks and increasing product usability. In only a few short years, we’ve now advanced the days of audio-only devices like the Amazon Echo and Google Home and have reached a point where voice user interfaces are connected within various devices (mobile, television, vehicle, computer) and also various products (i.e. sound systems, thermostats, house lights, security systems).

With this evolution of the voice technology experience comes richer, more complex use cases that involve multiple access points, multiple users and increased integration with numerous commercial products. Given the situational complexity of voice interface access today, how do you design voice-enabled applications that are actually useful to consumers and provide them with the best personalized user experience possible? We’ll get to that, but before we do, let’s quickly examine how the landscape of voice technology has evolved to make “context of use” core to the design of voice user interfaces today.

Examining the Current Landscape of Voice Technology

Deloitte Global predicts that the industry for smart speakers—internet-connected speakers with integrated digital voice assistants—will be worth US$7 billion in 2019, selling 164 million units. When we consider this prediction along with the fact that the Google Assistant is available on nearly a billion mobile devices, the impact of proper conversational interface design becomes critical.

Voice access points are becoming ubiquitous through our household, mobile and work environments. The Amazon Echo started the trend with the Google Home following closely behind. These audio-only devices provided a means to communicate with their respective services as well as control other hardware or equipment. These pioneers have since spawned sibling hardware that provide richer or more tailored experiences through touch-screen interfaces, such as the Echo Show and Spot as well as the Google Home Hub. With their ability to connect to other devices, these multi-sensory systems have opened the doors to easily deployed home automation. With these voice first products, users can control their music systems, adjust their home heating systems and control kitchen appliances across not one, but several devices and platforms. Through the expansion of third-party integrators, both Alexa and Google assistant can also be found in many different devices such as lamps, televisions, smart mirrors and even patio sun shades.

It’s evident that in the near future, voice interaction will take a much more valuable role for users than merely acting as their local weather person or DJ. Voice technology provides convenience, can save time, can increase accessibility of technology and offers other valuable benefits. However, if we do not evaluate the possible contexts of use to accurately understand 1. what the user is requesting, 2. where and how they’re requesting it and then 3. provide the intended response in the most appropriate way to the user, then the technology is not being used to its full potential.

Designing Voice-Enabled Apps for Context of Use

Now that voice is being merged with touch display experiences across multiple products, there is more room to cater to the unique environmental contexts of the user and personalize their experience. The voice-enabled applications that have been designed for context of use will be the ones that become adopted in the market and add societal value. When approaching development, we can frame context of use using three aspects:

1. The User’s Physical Context

Understanding the physical context of the user while accessing a voice-enabled application will help designers identify what the ‘right’ interaction is for the user in any possible space and time.

Is the user in a private or public setting? If a user is in a public space, they should be able to interact with the application in a private manner (i.e. via a screen) as allowed by the device and then prompt the voice interface when their privacy is returned in order to create a seamless, fluid experience.

Is there surrounding noise that could affect the accuracy of voice in detecting commands and other speech? For example, if someone were at a concert, voice would not be an effective modality to access an application to order a cab. A screen interaction would be more appropriate given the ambient noise. However, if a few friends were ordering a cab from the confines of their home, voice would be a viable, convenient mode of accessing the application.

How many users exist in the environment? Is one person using the application or can you expect multiple people to use it? Accurate identity recognition of a specific user’s voice in multi-user environments may pose challenges. For example, being able to identify patient vs. doctor vs. visitors in a clinical setting, between family members at home, or between different co-workers within the office. In a multi-user environment where those using the voice-enabled application have different roles or authority, varying levels of access to information may exist as well. So how do we properly identify who is who so as to not breach information security? Currently, identity recognition is established through Alexa or Google Assistant by training the system to recognize a specific voice. In instances where this becomes impractical or unachievable, additional means of user validation is required. Ideally, simple biometric identifiers such as a fingerprint scan can be utilized or hardware based buttons i.e., press to speak. If a user is speaking to their phone, the source is evident; if they are in a multi-person hospital room some means of speaker identification will be required.

Designers will be required to develop a logic structure within the application in order for it to understand ‘where’ it is being used and present the ‘right’ interface based on the devices available. Which brings us to our next point…

2. The Context of the Devices Available

Not only is it important to consider what modalities make sense for the user’s interaction in a specific physical environment, but also how those interactions are presented based on the user’s preferred device, as well as other connected devices that are available. Increased connectivity of devices presents an opportunity to augment and enhance the user’s engagement by continuing their experience with the application across various devices that provide unique benefits. For example, someone with an Amazon Echo device could book a flight by prompting the assistant and initiating the booking process via voice, then receive a visual confirmation of information through their mobile device or another available connected device with a screen (TV, Echo Show, laptop, tablet) before completing the booking. Rather than having all of the user’s details repeated to them by the voice assistant, the user can have an enhanced experience and quickly skim over their information, ensure it’s correct and complete the task faster. Or if someone is using voice to follow a recipe and cook, their experience would be enhanced if they were able to see visuals or videos of what each step of preparation should be to help guide them, instead of relying solely on voice direction.

Identifying which devices are available at the time also helps to provide insight into where the user is and their physical context. If a device picks up the connection of a car’s internal system, it becomes clear that the user is now in a vehicle and they have limited access to their mobile device. If a device picks up the connection of a Google Home, it becomes aware that the user is now at home in a more private environment. If a device picks up a connected microwave, it becomes aware that the user is now in the kitchen and their hands may be busy cooking or grabbing a snack.

After we establish which devices are best to be included in augmenting the voice first user experience, we then have to think about how to design for those specific technologies. For example, vibrations or small screen interactions for a smartwatch, large visuals and audio for a TV, or pre-canned responses for in-car experiences (i.e. “I can’t answer right now, I’m driving”). And going a level deeper, we need to consider platform-specific issues which may require the development of variation specific for each device platform if support is desired. This could include variations for voice-only experiences or combined voice and screen experiences, as well as application variations for Amazon’s platform versus Google’s platform.

With many products/devices accessing disparate data sources for different uses, there is a challenge to save the right data in the right place, allowing the user to have their actions saved back to the system from the current context. This data should be saved invisibly by the systems without the need for user engagement. And not only does data need to be saved automatically across the user’s experience with a voice-enabled application, but it also needs to be kept secure yet accessible across different platforms. This allows multiple users to perform tasks through different modalities and to be able to switch between modalities seamlessly.

3. The Context of the Conversation

By examining the context of the possible conversations at hand, designers can also identify when the user is able to engage in longer interactions or what their available attention span may be. This requires prompting for clearly defined and suitable responses ensuring that users are given an appropriate level of detail based on the context of both the usage environment and the device used. For example, if a user asks their voice assistant what the weather is today, it would be unnecessary, and likely irritating to the user, to provide a response with details of what the weather is in various parts of the city or include what it will be over the following week. They are likely looking for a quick response that tells them what the temperature is now, what it will be later in the day and whether it will rain. If they are traveling to Florida for a week, then they would include “Florida” and “week” in their question to receive specific forecast information for that time period in Florida and not locally.

One critical element that cannot be overlooked when designing voice-enabled applications is if the content of the conversation is sensitive in nature. Is there information stored within the application that needs to be kept private or solely accessible to certain users? Depending on the nature of the voice conversation, greater attention to security and privacy may be required. Interactions within a healthcare environment may require stricter data security to maintain compliance requirements for HIPAA or other government or institutional regulations. In some instances, you may wish to share data across all user groups and in other instances, privacy needs to be maintained. For example, interactions in an office environment where management is exchanging sensitive business information that is private, project teams that are required to keep confidential client information private to their team, or physicians that need access to patient information that should be inaccessible to other staff or visitors in a hospital.

Lastly, visual or vibrational cues need to be designed into the conversational architecture to let users know when voice-enabled devices are listening, have stopped listening, when they’re ready for a response, or when the task is complete and the conversation is over. These cues also need to be in place to indicate when other devices have been connected to the experience and are available for use. For example, playing a sound or prompt when a mobile device is connected to a vehicle lets a user know they can carry on their text conversation via voice.

Leverage Voice to Craft Great Multimodal Experiences

Developing a well-rounded understanding of the user’s environment in the moment they use the application is the voice UI design practice that optimizes the value of voice-enabled applications. Artificial intelligence enhancing a system will play a part in developing that understanding of context of use (AI allows for a system to be cognizant of events, proactive and dynamic), but it’s interaction design that plays the critical role in nailing the user’s experience on the head. By following the guidelines outlined above and being aware of the challenges of these digital ecosystems, voice-enabled solutions can provide a superior multimodal experience.

If you’d like to dig deeper into designing stellar voice-enabled applications, we’d recommend taking a look at our webinar on voice UI design best practices to help execute on delivering exceptional conversational user experiences.

Get Email Updates

Get updates and be the first to know when we publish new blog posts, whitepapers, guides, webinars and more!

Suggested Stories

Increasing Patient Engagement Using Behavior Design

Increasing patient engagement is easier said than done. In this video course, you'll learn how thoroughly understanding the behavior of health consumers can allow stakeholders to increase patient engagement.

Read More

Applications of Voice Assistants in Healthcare

Discover how organizations across the continuum of care can leverage the growing consumer demand for voice-enabled devices to achieve an extensive list of objectives from increased patient engagement to improved outcomes and lowered care costs.

Read More

Voice UI Design Best Practices

Voice assistants are poised to transform customer engagement as well as business models. Discover why voice is the next digital frontier – and what you should know about voice-first solutions.

Read More