On the use of Google’s Speech Recognition API Version 2

I find that Google’s Speech Recognition service is just as great as many of its other services and tools like Google Translate, Gmail etc. (though I’m not a big fan of Google’s privacy policy).

In recent years, there have been some posts about the use of Google’s speech recognition API version 1. Well, to be more accurate, what is here called a speech recognition API is just a “hack” that sends audio data to Google as if it were sent by a user surfing with chrome or chromium.

Update (22-02-2016): if you’re using the wget command you found on this page, please consider using https instead of http for a better privacy.

Outline

Understanding the reverse engineering behind the use of Google’s Speech Recognition API

Now that version 1 moved to version 2, this hack had to evolve since Google now requires a developer key in order to use the service. In this article I will briefly explain how to use the current version and supply a bash script to record an utterance, send it to Google, and print the N-best hypotheses (the N most likely utterances carried by audio data).

So to get a reply from Google, we have to send an audio file as an HTTP packet that requests this page:

with the following key=value pairs as URL parameters:

  • client=chromium
  • lang=language (where language is en_US for American English, fr_FR for French, de_DE for German, es_ES for Spanish etc.).
  • key=a_developer_key

and the following HTTP header:

  • Content-Type: audio/x-flac; rate=file_sampling_rate (where file_sampling_rate is the sampling rate of the file). 8000, 16000, 32000 and 44100 are all valid values but not the only possible ones).

The final URL should look like this:

Before we try our first example, there are several things that should be kept in mind:

  • This code is only for personal use and for test purposes.
  • The developer key is granted by Google and may be revoked any time.
  • The use of the free key is limited by Google to 50 requests a day for a given user.
  • Google only accepts audio files in the flac format.

A first example with wget

Record an utterance (preferably in English, unless you replace en_US in the wget command by a language of your choice) with your favorite audio recorder and save the file as a 16000 Hz  flac. To do so on a Unix-like system, you can either use parecord:

parecord record.flac --file-format=flac --rate=16000

or rec (requires SoX):

rec -r 16000 -c 1 record.flac

In either case do use Ctrl+C to stop recording. If you don’t know what to use, Audacity is a good choice. Then type the following command, replacing record.flac by the name of your file (and probably the values of rate and key too):

wget -q --post-file record.flac --header="Content-Type: audio/x-flac; rate=16000" -O - "https://www.google.com/speech-api/v2/recognize?client=chromium&lang=en_US&key=AIzaSyAcalCzUvPmmJ7CZBFOEWx2Z1ZSn4Vs1gg"

If Google was able to recognize any speech within the file, you should get a json answer with the best hypotheses (i.e. possible transcriptions of the speech data you sent). Otherwise, it returns an empty result array. In case of success, the best hypothesis appears first in the array, with probably its confidence.

Note that Google would nicely return a reply even if your user agent is wget. If you get any troubles with this, try using -U Chromium/39.0.2171.71 as a wget option.

Getting a developer key

The key we are going to use was generated a short time before this article is posted. If it doesn’t work, follow this link for key generation instructions.

Here is the key used above: AIzaSyAcalCzUvPmmJ7CZBFOEWx2Z1ZSn4Vs1gg

 An All-in-One script

I packaged everything into one bash script that permits the user to interactively interact with Google’s SR. The script records a couple of seconds of audio data, sends the recorded stream to Google for transcription, and filters out the result so that the sorted best hypotheses are printed one per line. The first two lines should contain the best hypothesis and its confidence respectively. The following lines contain alternative hypotheses. If you are not familiar with speech recognition systems, you might be a bit surprised to find out that the perfect output (if there is any) is not always what the system considers as the best hypothesis but one of the alternatives!

You can get hold of the script from this repository github_logo (clone the repository or download it as a Zip file, or simply copy/paste the code in speech-rec.sh file).

Requirements

To run the wget command, you need a running Unix or Linux distribution. To run the script you also need parecord and timeout to be installed on your system.

Some examples

These are also available on https://github.com/amsehili/gspeech-rec

Send a flac audio file and get its transcription

./speech-rec.sh -i record.flac --rate 16000


Use a language other than en_US (default, use --language or -l option)

French

./speech-rec.sh -i record.flac --rate 16000 --language fr_FR

Spanish

./speech-rec.sh -i record.flac -r 16000 -l es_ES


Record your voice for 3 seconds and get what you’ve said

./speech-rec.sh


Talk for 7 seconds (default=3, use --duration or -d option)

./speech-rec.sh -d 7

Note that if you don’t supply an audio file as an argument, the script writes the recorded audio data into a file named record.flac and doesn’t delete it afterwards so that you can listen to what you said. To play the audio file, type:

paplay record.flac

or with sox

play record.flac

Resources

A Python API that nearly does the same thing: https://github.com/Uberi/speech_recognition

Advertisements