Using voice commands in an Android App

Building a voice-controlled Ping Pong scoreboard

Published in

ITNEXT

4 min readAug 1, 2017

The idea is to build an Android app, which works only with verbal communication without touching the screen. We first take a look at: Google Now & Google Assistant, Voice Actions, SpeechRecognizer, and TextToSpeech and decide which are applicable to our app.

Use Case App: Pingpong Board

The use case for this idea is to create a Ping Pong board app. The app has two screens: First screen shows the leaderboard with available players and their scores and the second screen shows two players who play the game at that moment and their game points.

You can find the source code below:

LINKIT-Group/ping-pong

ping-pong - A ping pong scoreboard app with voice recognition

github.co

Google Now vs Google Assistant:

Google Now is a voice activated personal assistant which can search the web for you and perform some system defined tasks. It is similar to Apple's Siri, however Google Now is platform agnostic and can be run on iOS devices as well.

Google Assistant is like an upgraded sibling of Google Now. It uses machine learning and artificial intelligence to learn about you every day, and you can have a conversation with Google Assistant. It is taking over the place of Google Now, day by day.

There are two ways to activate Google Assistant: by long pressing the home button or by saying "OK, Google". Starting with Android 6.0 Marshmallow, the system may open an overlaying window for the assistant on top of our app (called "Source App"). This gives us the possibility to gather some information from our app, and use this information to take an action outside of our app. However, this doesn't help us to achieve what we want. We want to use a voice command to take an action in our app. (e.g. "OK, Google, point for Efe"). You can find further information here about using the assistant.

Voice Actions:

Google defines two types of voice actions. System voice actions and custom voice actions. At first, it sounds like custom voice actions might help us, unfortunately we see that Google does not accept requests for Custom Voice actions anymore. On the other hand, system voice actions — like “search”, “set alarm”, “initiate a phone call”, “take a picture”, “open url” - are not interesting enough for our use case app.

Voice actions are also classified here as System-provided Voice Actions and App-provided Voice Actions. You can start your app by saying "OK Google, Start MyApp" after defining a label attribute in your manifest file, for the activity that you want to start. Good to know, but App-provided Voice Actions are not what we are looking for. The same documentation page also provides information about Free-form Speech input, which looks promising.

Free-form Speech Input:

From Google documentation, we see that a common way of getting a free-form speech from user is to call startActivityForResult using the ACTION_RECOGNIZE_SPEECH action and receive the result in onActivityResult. This approach displays a default view to get user speech input. But if it fails to understand what user said, you have to touch the screen again. And this conflicts with our goal to have verbal-only communication. Therefore we decide to implement a communication similar to Google Assistant by using SpeechRecognizer and TextToSpeech classes.

Here are the steps that we are going to take in order to choose two players by getting user voice input.

1- Ask user “Who is the first player?”

Text to speech

2- Listen user input after asking our question to choose the first player

3- Ask user “Who is the second player?” (Same as Step 1)

4- Listen user input to choose the second player (Same as Step 2)

The biggest challenge: Continuous Voice Recognition

After selecting two players, the app decides who is going to start the game and speaks (e.g. “Efe is going to start the game”). The challenge starts at this point. We need to listen continuously to hear a phrase like “Point for Efe”. However, Google clearly mentions that this is not the intention of SpeechRecognizer api.

The implementation of this API is likely to stream audio to remote servers to perform speech recognition. As such this API is not intended to be used for continuous recognition, which would consume a significant amount of battery and bandwidth.

Several bugs have already been reported about SpeechRecognizer api during the implementations for the similar approach that we had. In addition, time gaps between restarting the voice recognition cycles, block the flow of continuous voice recognition experience. We experienced that working against the system won’t help achieving our goals and stopped trying furthermore.

To finalize our app, we decided to use touch gestures (instead of voice recognition) to start the game, and to add points for players. However, we kept selecting players by voice recognition to present the approach we used in this tutorial.

Conclusion

If you want to continue to work on this approach, try to build your own implementation by capturing audio directly from the device’s microphone and stream it to a voice recognition service. There are also some libraries which look promising to achieve what we want.