In recent years, there have been some posts about the use of Google’s speech recognition API version 1. Well, to be more accurate, what is here called a speech recognition API is just a “hack” that sends audio data to Google as if it were sent by a user surfing with chrome or chromium.
Update (22-02-2016): if you’re using the wget command you found on this page, please consider using https instead of http for a better privacy.
- Understanding the reverse engineering behind the use of Google’s Speech Recognition API
- A first example with wget
- Getting a developer key
- An All-in-One script
- Some examples
Understanding the reverse engineering behind the use of Google’s Speech Recognition API
Now that version 1 moved to version 2, this hack had to evolve since Google now requires a developer key in order to use the service. In this article I will briefly explain how to use the current version and supply a bash script to record an utterance, send it to Google, and print the N-best hypotheses (the N most likely utterances carried by audio data).
So to get a reply from Google, we have to send an audio file as an HTTP packet that requests this page:
with the following key=value pairs as URL parameters:
- lang=language (where language is en_US for American English, fr_FR for French, de_DE for German, es_ES for Spanish etc.).
and the following HTTP header:
- Content-Type: audio/x-flac; rate=file_sampling_rate (where file_sampling_rate is the sampling rate of the file). 8000, 16000, 32000 and 44100 are all valid values but not the only possible ones).
The final URL should look like this:
Before we try our first example, there are several things that should be kept in mind:
- This code is only for personal use and for test purposes.
- The developer key is granted by Google and may be revoked any time.
- The use of the free key is limited by Google to 50 requests a day for a given user.
- Google only accepts audio files in the flac format.
A first example with wget
Record an utterance (preferably in English, unless you replace en_US in the wget command by a language of your choice) with your favorite audio recorder and save the file as a 16000 Hz flac. To do so on a Unix-like system, you can either use parecord:
parecord record.flac --file-format=flac --rate=16000
rec (requires SoX):
rec -r 16000 -c 1 record.flac
In either case do use Ctrl+C to stop recording. If you don’t know what to use, Audacity is a good choice. Then type the following command, replacing record.flac by the name of your file (and probably the values of rate and key too):
wget -q --post-file record.flac --header="Content-Type: audio/x-flac; rate=16000" -O - "https://www.google.com/speech-api/v2/recognize?client=chromium&lang=en_US&key=AIzaSyAcalCzUvPmmJ7CZBFOEWx2Z1ZSn4Vs1gg"
If Google was able to recognize any speech within the file, you should get a json answer with the best hypotheses (i.e. possible transcriptions of the speech data you sent). Otherwise, it returns an empty result array. In case of success, the best hypothesis appears first in the array, with probably its confidence.
Note that Google would nicely return a reply even if your user agent is wget. If you get any troubles with this, try using -U Chromium/39.0.2171.71 as a wget option.
Getting a developer key
The key we are going to use was generated a short time before this article is posted. If it doesn’t work, follow this link for key generation instructions.
Here is the key used above:
An All-in-One script
I packaged everything into one bash script that permits the user to interactively interact with Google’s SR. The script records a couple of seconds of audio data, sends the recorded stream to Google for transcription, and filters out the result so that the sorted best hypotheses are printed one per line. The first two lines should contain the best hypothesis and its confidence respectively. The following lines contain alternative hypotheses. If you are not familiar with speech recognition systems, you might be a bit surprised to find out that the perfect output (if there is any) is not always what the system considers as the best hypothesis but one of the alternatives!
To run the wget command, you need a running Unix or Linux distribution. To run the script you also need parecord and timeout to be installed on your system.
These are also available on https://github.com/amsehili/gspeech-rec
Send a flac audio file and get its transcription
./speech-rec.sh -i record.flac --rate 16000
Use a language other than en_US (default, use
./speech-rec.sh -i record.flac --rate 16000 --language fr_FR
./speech-rec.sh -i record.flac -r 16000 -l es_ES
Record your voice for 3 seconds and get what you’ve said
Talk for 7 seconds (default=3, use
./speech-rec.sh -d 7
Note that if you don’t supply an audio file as an argument, the script writes the recorded audio data into a file named record.flac and doesn’t delete it afterwards so that you can listen to what you said. To play the audio file, type:
A Python API that nearly does the same thing: https://github.com/Uberi/speech_recognition