Android and Speech Recognition (1)

Speech recognition, the translation of spoken language to text, as an interdisciplinary subfield of computational linguistics finds its roots and first steps as early as 1952 with Audrey, a device that could recognize numerical digits, created in the Bell Labs, and the 1960’s when IBM developed the Shoebox, a machine that could recognize and arithmetic and sent the result to an attache printer (I highly recommend the video in the IBM archive). A lot of research happened in the field by IBM, DARPA, CMU but it would take more than 2 decades before products hit the shelf to be used by a wider audience. In 1981 it took up to 100 minutes to decode 30 seconds of spoken text (see the Sarasota Business Journal article).

The first time I started working with commerical products in this field was in the mid-end 1990’s with Dragon Dictate and IBM Via Voice, the engine had to be trained for a specific speaker in a 30min+ training session. Once you had passed the training you could achieve decent results when talking to Word using a plugin, the experience was not quite real-time, as you saw the text magically being typed few seconds after you said it. The product also allowed saying commands to control Windows to open or close windows, starting applications and similar simple tasks.

Fast forward to today, and you find speech recognition in quite some consumer applications, most prominently in the assistance area with products like Amazon Alexa, Apple Siri, Google Assistant dominating the market. If we look at the Gartner Hype Cycle 2019, reaching the Plateau of Productivity “Speech Recognition is less than two years to mainstream adoption and is predicted to deliver the most significant transformational benefits of all technologies on the Hype Cycle.” [Quote].

For many simple use-cases or applications a conversational model nor a semantic interpretation is required, we can focus on recognizing keywords. I will discuss pro and cons later in this series.

In this first part I am demonstrating how simple it is to integrate speech recognition in Android Apps. To put this into an aviation context, lets pretend we search for a flight by calling the flightnumber.

We can start with an empty activity skeleton application in Android Studio.
Required is the permission to record audio.

String requiredPermission = Manifest.permission.RECORD_AUDIO;

We need to call the Android Speech by calling the respective intent.

import android.speech.RecognizerIntent;
..
Intent i = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH);
i.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM);
i.putExtra(RecognizerIntent.EXTRA_LANGUAGE, Locale.getDefault());
i.putExtra(RecognizerIntent.EXTRA_PROMPT, "Say something");
..
startActivityForResult(i, 100);

Handle the result of the activity response

@Override
protected void onActivityResult(int requestCode, int resultCode, Intent data) {
	super.onActivityResult(requestCode, resultCode, data);
	if (requestCode == 100) {
		if (resultCode == RESULT_OK && null != data) {
			ArrayList res = data.getStringArrayListExtra(RecognizerIntent.EXTRA_RESULTS);
			Log.d("TTS", res.toString());
		}
	}
}
The default Google style audio input.
Result for “Flight LH 778”

The longer the statement we speak the more results we get back in the text array.

Result for “Departure Information for flight LH 778”

We could defer the search directly into the Google Search by changing the code to another action. Here I have to ask a complete question and cannot just say the flight number. The request is diverted to the web search and wont return to the application. So this is more completeness but not helpful for our use-case.

Intent i = new Intent(RecognizerIntent.ACTION_WEB_SEARCH);
i.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_WEB_SEARCH);
i.putExtra(RecognizerIntent.EXTRA_LANGUAGE, Locale.getDefault());

In the next part we will omit the Google standard screen and read the audio input directly. We will also look at the further processing challenges, as well add speech synthesis.

Stay tuned !

Leave a comment