Speechly Guidelines for Creating Productive Voice-Enabled Apps

For the past 4 years at Speechly, we have been experimenting and developing ways to make use of voice-enabled touch screen apps fast and straight-forward, in other words — more productive. In this article, we’ll introduce the concepts and guidelines we’ve found effective in creating voice enabled apps that are robust and enable users to complete tasks faster and with less attention.

Our approach takes advantage of customizable voice actions and the availability of a (touch) display for providing real-time visual feedback and options. As a result, the app can be controlled with both the voice user interface (VUI) and the graphic user interface (GUI), allowing the user to choose the best input method for the occasion. A voice interface can be thought of a controller for app actions which makes it retrofittable to an existing application.

We contrast this approach to now-popular voice assistants like Apple’s Siri, Google Home and Amazon Alexa, which are conversational in nature and are typically optimized for generic hands-free use with voice.

Setting the right context

1. Don’t try to build a voice assistant

Voice assistants are digital assistants that react to voice commands, most often by using voice themselves, too. While there are good use cases for voice assistants, their way of using voice is not suitable for touch screen devices.

Instead of question-answer based dialogues, touch screen voice experiences should be based on real-time visual feedback. As the user speaks, the user interface should be instantaneously updated.

2. Users don’t want to converse

When we humans talk with each other, we do more than transmit information by using words. We might greet, persuade, declare, ask or apologize and even the same words can have a different meaning, depending on how we say it and in which situations. This is very human-like, but not the way we want to communicate with a computer.

With voice user interface, speech has only one function and it is to command the system to do what the user wants. Be clear that the user is talking with a computer, don’t try to imitate a human. In most cases, the application should not answer in natural language. It should react by updating the user interface, just like when clicking a button.

3. Give visual guidance on what the user can say

An issue commonly described in voice user interfaces (VUI) users is the uncertainty related to what commands are supported.

The problem arises from the fact that the typical voice assistant experience begins from a blank slate, where the assistants start listening and is expected to be able to help the user with pretty much anything. This is of course not really true as anybody who has tried these services understands.

Understanding the supported functionality with traditional graphical user interfaces (GUI) is less of a problem. Placing a button in the users shopping cart that reads “proceed to checkout” is a very strong signal to the users that checkout is supported and by pressing the button the user will indeed proceed to the checkout process. This aspect is missing from voice-only solutions which cases uncertainty in terms of supported features.

This is why a good voice user interface should be supported if possible by a graphical user interface.

Booking air flights — Graphical user interface supporting the voice user interface

4. Use voice for the tasks it is good for

Good design is about providing the user with the best tools for their use task.

Voice works great for use tasks such as search filtering — “Show me the nearest seafood restaurants with three or more stars”, accessing items from a known inventory — “Add milk, bread, chicken and potatoes”, inputting information: “Book a double room for two in Los Angeles next Friday” and unambiguous commands, such as “Show sports news”.

On the other hand, touch is often the better option for selecting from a couple of options, typing things such as email addresses and passwords and browsing by scrolling a large unknown inventory, for example.

There’s no need to replace your current user interface with a voice user interface. Rather you should evaluate which tasks in your application are the most tedious and easiest to do by using voice and add voice as a modality to those features.

Receiving commands from the user

5. Onboard the user

When your users first see your voice UI, they will need some guidance on how to use it.

These examples should be placed close to where the visual feedback will appear. You can hide the examples after the user has tried the voice user interface.

6. Avoid using a wake word

While voice assistants use a wake word so that they can be activated from a distance, your touch screen application doesn’t need to. Repeating the wake word every time makes the experience jarring, adds latency and decreases the reliability.

The hands free scenario is far less relevant than you might initially think, as the user is already holding the device. There are also privacy risks involved with a wake word.

7. Prefer push-to-talk button mechanism

Push-to-talk is the best way to operate a microphone in a multimodal touch screen application. When the user is required to press a button while talking, it’s completely clear when the application is listening. This also decreases latency by making endpointing very explicit, eliminating the possibility of endpoint false positives (system stops listening prematurely) and false negatives (systems does not finalize request after user has finished the command).

On the desktop you can use the spacebar for activating the microphone.

You can also add a slide as an optional gesture to lock the microphone for a longer period of time. WhatsApp has a good implementation of the design in their app.

8. Signal clearly when the microphone button is pushed down.

To make it sure the user knows that the application is listening, signal clearly when the microphone button is pushed down. This is especially important if using the push-to-talk pattern.

You can use sound, animation, tactile feedback (vibration) or a combination to signal the activation. On a handheld touch screen device, make sure that the activated microphone icon is visible from behind the thumb when push-to-talk is activated.

Giving feedback to the user

9. Use non-interruptive modalities for feedback

Non-interruptive modalities include haptic, non-linguistic auditory, and perhaps most importantly visual feedback. Using these modalities, the application can react fast and without interruption to the user. For instance, in the case of “I’m interested in t-shirts,” the UI would swiftly show the most popular t-shirt products, instantly enabling the user to continue with a refining utterance, “do you have Boss.” This narrows further down the displayed products to show only the Boss branded t-shirts.

On the other hand, voice synthesis is a bad idea for feedback, as any ongoing user utterance will be abruptly interrupted. Voice is also a pretty slow channel for transmitting information and for returning users, hearing the same speech synthesises every time gets annoying very fast.

10. Minimize latency with streaming natural language understanding

One important part of user experience is the perceived responsiveness of the application. Designers are using tricks such as lazy loading, doing tasks on background, visual illusions and preloading of content to make their applications seem faster and this should be done with voice, too.

In voice applications, immediate UI reaction is even more important. Immediate UI reaction encourages the user to use longer expression and to continue the voice experience. In case of an error, it enables the user to recover fast.

11. Steer user’s gaze and visual attention

When using voice effectively the user can control the UI an order of magnitude faster compared to tapping and clicking. This means that a lot of stuff might be happening in the UI. It is important that the user can keep up with these UI reactions.

Typically UI reactions manifest themselves in some sort of visual queues, micro animations and transitions. There is an instinctive inclination in the human visual cognition system to move visual focus to where movement is happening.

Therefore it is an antipattern to scatter visual ui reactions all over the visual field of the user, e.g. streaming transcription animation on top of the screen and other ui reactions at the bottom of the screen. This will result in the user’s gaze bouncing back and forth on the screen making it very hard to understand what is happening in the user interface and inflicting unnecessary cognitive load and annoyance to the user.

For this reason it is important to either centralize all visual UI reactions near one focal point,meaning that both the transcript as well as the visual transitions resulting from the user commands are shown very close to each other. The other option is to steer the users gaze linearly on the screen with a cascade of animations happening e.g. top down, left to right.

12. Minimize visual unrest in triggered events.

While a voice user interface needs to be as close to real-time as possible, minimize flicker and visual unrest. You can use placeholder images and elements to make sure the application looks smooth and reacts fast. Recovering from misinterpretation

13. Show the text transcript

Text transcription of users’ voice input is the most important part of the feedback in case of an error. Lack of action tells the user their input was not correctly understood, but in case of an error in the speech recognition, the transcript can enable them to understand why that happened.

Transcript can also be valuable for the user when everything goes right. It tells the user they are being understood and encourages them to continue.

The transcript should appear always in the same, center place in the users’ field of vision. If you are using Speechly, you can use the tentative transcript to minimise feedback latency.

14. Fail fast: be forward leaning in producing results but offer opportunity to correct

Natural language processing is hard because of many reasons. In addition to the speech recognition failing, even the user might hesitate or mix up their words. This can lead into errors, just like a misclick will lead in to errors in the graphical user interface.

While there are multiple ways to reduce the amount of errors, the most important thing is to offer the user an opportunity to correct themselves quickly. Produce the best guess for correct action as quickly as possible and let the user refine that selection by either voice or touch.

15. Have an intent for verbal corrections

The more complex and long the sentences your users use, the more likely they are to fail and hesitate. It’s not a problem if the users get real-time feedback and can correct themselves naturally.

Multimodality enables users to use the graphical user interface to correct themselves, but make sure to include an intent for verbal corrections, too. This makes it possible for users to say something like “Show me green, sorry I mean red t-shirts”.

16. Use touch for corrections

Another way to make corrections is touch. Touch corrections are typically best done by offering the user a short list of vibale options based on what they have said or done earlier.

If your user is filling a form by using voice commands, for example, they might only need to correct one field. It can be the most intuitive to tap the correct field and make the correction by using touch. Make sure you support both ways for corrections!

17. Offer an alternative way to get the task done without voice.

The big issue with voice assistants is that they are hard to use by using touch. While voice is a great user interface for many use cases, sometimes it’s not feasible. This is why all features in your application should be usable with both voice and touch. For example, you can use a traditional search filtering with dropdown menus and include a microphone for using the filters by using voice. This enables users to choose the modality they need.

Originally published at https://www.speechly.com on November 27, 2020.

Improving user experience through voice