Android and Speech Recognition (2)

In part 2 about speech recognition we do the reverse, instead od openly calling Google Search or integrating the speech recognition intent (prominently showing the Google logo/splash screen) we call the same recognition method in a headless way, making the experience more seamless. To make it menaingful in the aviation context, we will request BA flight information (through IAG Webservices) through audio.

(Please note, the below code snippets are incomplete and only highlighting the key methods, the complete sourcecode you find at Github, see link at the end of the post.)

Calling the speech recognizer

sr = SpeechRecognizer.createSpeechRecognizer(this);
sr.setRecognitionListener(new listener());

Intent intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH);
intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL,RecognizerIntent.LANGUAGE_MODEL_FREE_FORM);

intent.putExtra(RecognizerIntent.EXTRA_MAX_RESULTS,10);
sr.startListening(intent);

Implementation of the speech recognition class

class listener implements RecognitionListener
{
	public void onResults(Bundle results)
	{
		ArrayList data = results.getStringArrayList(SpeechRecognizer.RESULTS_RECOGNITION);
		StringBuffer result = new StringBuffer();
		for (int i = 0; i < data.size(); i++)
		{
			Log.d(TAG, "result " + data.get(i));
			result.append(i + ":" + data.get(i) + "\n");
		}
		textViewResult.setText(result);
	}
}

With this approach we get the usual 2 acoustic sounds signalling the begin of the speech recognition phase and the end (after a short time out).

If we need to create a hands-free user experience, avoiding the user to touch the screen, we will make use of the play or call button that you usually find on headsets. We can capture the event that gets fired when pressing these hardware buttons too.

Capture hardware button events

@Override
public boolean onKeyDown(int keyCode, KeyEvent event) {
	Log.v(TAG, event.toString());
	if(keyCode == KeyEvent.KEYCODE_HEADSETHOOK || keyCode==KeyEvent.KEYCODE_MEDIA_PLAY){
		Log.i(TAG, event.toString());
		listenHeadless();
		return true;
	}
	return super.onKeyDown(keyCode, event);
}

Text-To-Speech

The missing link to the hands-in-the-pocket experience is the audio response by the app through our headset. We will add the standard Android Text-To-Speech (TTS) implementation.

textToSpeech = new TextToSpeech(getApplicationContext(),null);
...
textToSpeech.setPitch(1.0f);
textToSpeech.setSpeechRate(0.80f);
int speechStatus = textToSpeech.speak(textToSpeak, TextToSpeech.QUEUE_FLUSH, null, null);

if (speechStatus == TextToSpeech.ERROR) {
	Log.e("TTS", "Error in converting Text to Speech!");
}

A remark about text to speech engines, the default speech engines are not very pleasant to listen to, for an application speaking to users repeatedly or over a long time I recommend to look for commercial TTS engines (eg. Nuance, Cereproc,..). Check out the sound samples at the end of the post. TTS engines are usually produced per language. For a multi-lingual application you need to cater for every language.

Adding Webservice

To make this sample app more business relevant (airport related), we use the spoken text to request for flight data. For this purpose I will reuse the BA webservice call that was used in an earlier post. Do not forget to add permission for internet access to our app.

We straight away hit the first challenge here, we receive a text version of the spoken request. As we wont implement a semantic or contextual speech recognition we have to apply regular expression in order to attempt to find the key elements such as fligh number and date. The human requests the information in dozens of possible variations plus the varying interpretations by the speech-to-text engine, some listed below in the regex test screen (link).
To keep it simple we allow the user to request flight info for one particular British Airways flight number between yesterday and tomorrow. We will look for a 2 character 1..4 number combination in the text, plus using the keyword yesterday/tomorrow/today (or no statement representing today).

regular expression to identify flight number

To push the scenario further we can let the TTS engine read the response to the user, after selecting relevant information from the JSON formatted WS response.

Soundsamples

Sourcecode

References

Android and Speech Recognition (1)

Speech recognition, the translation of spoken language to text, as an interdisciplinary subfield of computational linguistics finds its roots and first steps as early as 1952 with Audrey, a device that could recognize numerical digits, created in the Bell Labs, and the 1960’s when IBM developed the Shoebox, a machine that could recognize and arithmetic and sent the result to an attache printer (I highly recommend the video in the IBM archive). A lot of research happened in the field by IBM, DARPA, CMU but it would take more than 2 decades before products hit the shelf to be used by a wider audience. In 1981 it took up to 100 minutes to decode 30 seconds of spoken text (see the Sarasota Business Journal article).

The first time I started working with commerical products in this field was in the mid-end 1990’s with Dragon Dictate and IBM Via Voice, the engine had to be trained for a specific speaker in a 30min+ training session. Once you had passed the training you could achieve decent results when talking to Word using a plugin, the experience was not quite real-time, as you saw the text magically being typed few seconds after you said it. The product also allowed saying commands to control Windows to open or close windows, starting applications and similar simple tasks.

Fast forward to today, and you find speech recognition in quite some consumer applications, most prominently in the assistance area with products like Amazon Alexa, Apple Siri, Google Assistant dominating the market. If we look at the Gartner Hype Cycle 2019, reaching the Plateau of Productivity “Speech Recognition is less than two years to mainstream adoption and is predicted to deliver the most significant transformational benefits of all technologies on the Hype Cycle.” [Quote].

For many simple use-cases or applications a conversational model nor a semantic interpretation is required, we can focus on recognizing keywords. I will discuss pro and cons later in this series.

In this first part I am demonstrating how simple it is to integrate speech recognition in Android Apps. To put this into an aviation context, lets pretend we search for a flight by calling the flightnumber.

We can start with an empty activity skeleton application in Android Studio.
Required is the permission to record audio.

String requiredPermission = Manifest.permission.RECORD_AUDIO;

We need to call the Android Speech by calling the respective intent.

import android.speech.RecognizerIntent;
..
Intent i = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH);
i.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM);
i.putExtra(RecognizerIntent.EXTRA_LANGUAGE, Locale.getDefault());
i.putExtra(RecognizerIntent.EXTRA_PROMPT, "Say something");
..
startActivityForResult(i, 100);

Handle the result of the activity response

@Override
protected void onActivityResult(int requestCode, int resultCode, Intent data) {
	super.onActivityResult(requestCode, resultCode, data);
	if (requestCode == 100) {
		if (resultCode == RESULT_OK && null != data) {
			ArrayList res = data.getStringArrayListExtra(RecognizerIntent.EXTRA_RESULTS);
			Log.d("TTS", res.toString());
		}
	}
}
The default Google style audio input.
Result for “Flight LH 778”

The longer the statement we speak the more results we get back in the text array.

Result for “Departure Information for flight LH 778”

We could defer the search directly into the Google Search by changing the code to another action. Here I have to ask a complete question and cannot just say the flight number. The request is diverted to the web search and wont return to the application. So this is more completeness but not helpful for our use-case.

Intent i = new Intent(RecognizerIntent.ACTION_WEB_SEARCH);
i.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_WEB_SEARCH);
i.putExtra(RecognizerIntent.EXTRA_LANGUAGE, Locale.getDefault());

In the next part we will omit the Google standard screen and read the audio input directly. We will also look at the further processing challenges, as well add speech synthesis.

Stay tuned !