On the use of Google’s Speech Recognition API Version 2

I find that Google’s Speech Recognition service is just as great as many of its other services and tools like Google Translate, Gmail etc. (though I’m not a big fan of Google’s privacy policy).

In recent years, there have been some posts about the use of Google’s speech recognition API version 1. Well, to be more accurate, what is here called a speech recognition API is just a “hack” that sends audio data to Google as if it were sent by a user surfing with chrome or chromium.

Update (22-02-2016): if you’re using the wget command you found on this page, please consider using https instead of http for a better privacy.

Outline

Understanding the reverse engineering behind the use of Google’s Speech Recognition API

Now that version 1 moved to version 2, this hack had to evolve since Google now requires a developer key in order to use the service. In this article I will briefly explain how to use the current version and supply a bash script to record an utterance, send it to Google, and print the N-best hypotheses (the N most likely utterances carried by audio data).

So to get a reply from Google, we have to send an audio file as an HTTP packet that requests this page:

with the following key=value pairs as URL parameters:

  • client=chromium
  • lang=language (where language is en_US for American English, fr_FR for French, de_DE for German, es_ES for Spanish etc.).
  • key=a_developer_key

and the following HTTP header:

  • Content-Type: audio/x-flac; rate=file_sampling_rate (where file_sampling_rate is the sampling rate of the file). 8000, 16000, 32000 and 44100 are all valid values but not the only possible ones).

The final URL should look like this:

Before we try our first example, there are several things that should be kept in mind:

  • This code is only for personal use and for test purposes.
  • The developer key is granted by Google and may be revoked any time.
  • The use of the free key is limited by Google to 50 requests a day for a given user.
  • Google only accepts audio files in the flac format.

A first example with wget

Record an utterance (preferably in English, unless you replace en_US in the wget command by a language of your choice) with your favorite audio recorder and save the file as a 16000 Hz  flac. To do so on a Unix-like system, you can either use parecord:

parecord record.flac --file-format=flac --rate=16000

or rec (requires SoX):

rec -r 16000 -c 1 record.flac

In either case do use Ctrl+C to stop recording. If you don’t know what to use, Audacity is a good choice. Then type the following command, replacing record.flac by the name of your file (and probably the values of rate and key too):

wget -q --post-file record.flac --header="Content-Type: audio/x-flac; rate=16000" -O - "https://www.google.com/speech-api/v2/recognize?client=chromium&lang=en_US&key=AIzaSyAcalCzUvPmmJ7CZBFOEWx2Z1ZSn4Vs1gg"

If Google was able to recognize any speech within the file, you should get a json answer with the best hypotheses (i.e. possible transcriptions of the speech data you sent). Otherwise, it returns an empty result array. In case of success, the best hypothesis appears first in the array, with probably its confidence.

Note that Google would nicely return a reply even if your user agent is wget. If you get any troubles with this, try using -U Chromium/39.0.2171.71 as a wget option.

Getting a developer key

The key we are going to use was generated a short time before this article is posted. If it doesn’t work, follow this link for key generation instructions.

Here is the key used above: AIzaSyAcalCzUvPmmJ7CZBFOEWx2Z1ZSn4Vs1gg

 An All-in-One script

I packaged everything into one bash script that permits the user to interactively interact with Google’s SR. The script records a couple of seconds of audio data, sends the recorded stream to Google for transcription, and filters out the result so that the sorted best hypotheses are printed one per line. The first two lines should contain the best hypothesis and its confidence respectively. The following lines contain alternative hypotheses. If you are not familiar with speech recognition systems, you might be a bit surprised to find out that the perfect output (if there is any) is not always what the system considers as the best hypothesis but one of the alternatives!

You can get hold of the script from this repository github_logo (clone the repository or download it as a Zip file, or simply copy/paste the code in speech-rec.sh file).

Requirements

To run the wget command, you need a running Unix or Linux distribution. To run the script you also need parecord and timeout to be installed on your system.

Some examples

These are also available on https://github.com/amsehili/gspeech-rec

Send a flac audio file and get its transcription

./speech-rec.sh -i record.flac --rate 16000


Use a language other than en_US (default, use --language or -l option)

French

./speech-rec.sh -i record.flac --rate 16000 --language fr_FR

Spanish

./speech-rec.sh -i record.flac -r 16000 -l es_ES


Record your voice for 3 seconds and get what you’ve said

./speech-rec.sh


Talk for 7 seconds (default=3, use --duration or -d option)

./speech-rec.sh -d 7

Note that if you don’t supply an audio file as an argument, the script writes the recorded audio data into a file named record.flac and doesn’t delete it afterwards so that you can listen to what you said. To play the audio file, type:

paplay record.flac

or with sox

play record.flac

Resources

A Python API that nearly does the same thing: https://github.com/Uberi/speech_recognition

Advertisements

50 thoughts on “On the use of Google’s Speech Recognition API Version 2

  1. Hi!
    wget -q –post-file record.flac –header=”Content-Type: audio/x-flac; rate=16000″ -O – “http://www.google.com/speech-api/v2/recognize?client=chromium&lang=en_US&key=AIzaSyAcalCzUvPmmJ7CZBFOEWx2Z1ZSn4Vs1gg”

    doesn’t work for me, i get only empty response – {“result”:[]} .I’ve tried different keys, and curl instead of wget… a few weeks ago everything worked 😦
    Any ideas?

    Like

  2. hello,
    may be you’ve reached your keys’ 50 requests /day limit or you file is too long (over 15 sec)
    Also try to add a user-agent option to your http request (e.g. -U Chromium/39.0.2171.71)

    Cheers

    Like

  3. Try with a file you’re sure it worked in the past or ask a friend to it for you (it’s not excluded that Google’s blocking you address for a while)
    Don’t forget the language parameter “lang=en_US”, it mostly won’t work if your utterance is in another language than the one in “lang”

    Like

  4. I also had problems running the script from github. The problem was that by default parecord will record 2 channels. Adding –channels=1 to the option list fixed the issue for me.

    Here is the new line 85 (at time of this post):
    timeout $DURATION parecord $INFILE –file-format=flac –rate=$SRATE –channels=1

    Like

    • @slip, thanks for your fix, I confirm stereo audio isn’t accepted by Google anymore (seems the API is dynamically changing)

      The Github’s script is now up to date and supports sox

      Cheers

      Like

    • This is a language-independent process, all you need is send an http request as explained above..
      I don’t know if somebody implemented it in C#, ask Google, or have a look at Github

      Like

  5. Pingback: Getting Speech Recognition to work on Mac

  6. Pingback: Recognise This…? A Quick Tour of Some *-Recognition Service APIs | OUseful.Info, the blog...

  7. Pingback: Using Google ASR – sunilkumarkopparapu

  8. hi,
    Im not using wget method. Just tried the Speech API url using some online HTTP request sender tool. I have used your example “english.wav” file as http message body(Raw data). Appropriate hader values added. But always getting “{“result”:[]}”. Please give me your feedback.

    Like

    • Dear Prem,

      The API works only with flac audio data. Moreover, your file must be mono. If you have ffmpeg use this command to convert from wav to flac:

      ffmpeg -i english.wav -ac 1 -ar 16000 english.flac

      Note that I specified an audio sampling rate (option -ar) of 16000. You can use a different rate, in that case, specify the value you used in the http header (e.g. rate=16000)

      If you don’t have ffmpeg, try Audacity.

      Like

  9. I was using speech API and was using following command for recording.
    arecord -q -f cd -t wav -d 4 -r 16000 | flac – -f –best –sample-rate 16000 -s -o test.flac;

    Somehow it was not working from last few days (it was working previously).
    But now I am using
    # try to record audio with sox
    rec -q -c 1 -r 16000 test.flac trim 0 5

    It helped me to solve my issues.

    Thanks

    Like

  10. Thanks for the tutorial and the shell script. It was very easy to use.
    Unfortunately, I cannot get this to work with my audio files. I have tested the shell script and the wget call with two different API keys, so I excluded the possibility that I am using the API incorrectly. I determined that the problem has to be the audio files. I had existing m4a files and I recorded a short file in audacity. The m4a file I converted with ffmpeg to flac and the audacity recording I saved as flac and both files are at rate 16000. With the shell script I got the “Google was unable to recognize any speech in audio data” error with the wget command I got an empty response.

    I would like to use the Google speech API with existing audio for my project, so recording instructions or suggested tools like parecord and sox are not an option for me. Will I ever be able to use existing audio files with this API?

    Any information is greatly appreciated.

    Like

    • How many channels do your flac files have? You haven’t disclosed this.
      As you can see from other comments, a common problem is using the incorrect number of channels. Make sure you’re only using one channel files.

      Like

      • My files are Stereo originally, but I convert them to Mono. However, the file that I recorded in audacity was Mono from the beginning and I go the same error.

        Like

    • Hello,
      Just tried the script successfully with the key used in this article.

      flac files sent to google must be mono (1-channel), to check out the if a file is mono type a command like:

      file audio.flac | grep mono
      # or
      ffmpeg -i rec.flac 2>&1 | grep mono

      To convert all m4a files in a directory into 16KHz mono flac files use something like:

      for i in `ls *.m4a | sed 's/.m4a//g'`; do ffmpeg -i $i.m4a -ac 1 -ar 16000 $i.flac; done

      Cheers

      Like

      • This is what I get for the converted m4a file for testing purposes when I use the file command.
        > FLAC audio bitstream data, 16 bit, mono, 16 kHz, 963056 samples
        (audio is in Spanish and runs about a minute long)

        Here is the commands I ran:
        wget -q –post-file file-name.flac –header=”Content-Type: audio/x-flac; rate=16000″ -O – “http://www.google.com/speech-api/v2/recognize?client=chromium&lang=en_ES&key=mykey”

        ./speech-rec.sh -i file-name.flac –rate 16000 -l es_ES

        (I ran it with the correct out mykey and the actual file-name)

        Like

      • this is apparently due to the duration of your file, the API is meant for files up to 15 seconds (according to my own tests, I’m not sure about the exact value though).

        Try using part of the file instead of the whole file.
        If you think that splitting your long files into shorter files processed separately (e.g. each containing 2 or 3 sentences) will still be meaningful for what you want to do, have a look at auditok.

        Cheers

        Like

      • Sorry to report, but I have tried a 3 second file in one of my tests and just tried a 6 second audio file with the same error result. Does anyone have an idea why this is failing?

        Like

      • Que raro! Don’t have a clue, if you can send me a sample that didn’t work, I can try it and maybe figure out the problem

        Like

      • Just tested your file, good news, everything went well.

        Your file’s sampling rate is 44100 (i.e. 44.1 KHz). You must indicate the right sampling rate in your command, so if your file is not 16000, it won’t work.

        Here is the command and its output:

        speech-rec.sh -i test-1-spanish.flac -r 44100 -l es_ES

        output:

        Recognition result:
        transcript: dónde puedo llevar a tu casa me dices y es peligroso
        transcript: no te puedo llevar a tu casa me dices y es peligroso
        transcript: dónde puedo llevar a tu casa me dices y si es peligroso
        transcript: no te puedo llevar a tu casa me dices y si es peligroso
        transcript: te puedo llevar a tu casa me dices y es peligroso

        Converting the file into a 16KHz flac and using 16000 instead of 44100 yielded the same result.

        ffmpeg -i test-1-spanish.flac -ar 16000 test-1-spanish-16KHz.flac

        Like

      • Thanks Amine, but it still didn’t work for me. Maybe it is our network here at the university.

        I wanted to share my temporary solution with others (temporary because I would like eventually to use the Google Speech API on our own website)
        If you want to transcribe longer audio files, you can use the Google Speech API Demo website: https://www.google.com/intl/en/chrome/demos/speech.html
        If you use Windows you can follow this tutorial: https://www.youtube.com/watch?v=XgPFOMQ2I64.
        On a Mac downloaded Soundflower instead of VAC: http://manual.audacityteam.org/man/tutorial_recording_computer_playback_on_mac.html
        Then (Mac) switch System Preferences > Sound Output and Input to Soundflower (2ch), play the audio and hit the microphone button on the Demo site.

        Like

      • I finally found out why I kept on getting error messages. As I had suspected I had not enabled the correct API, but I couldn’t find the Speech API listed under the popular APIs on https://console.developers.google.com/. You have search for the Speech API using the “Search all 100 + APIs” search field. Hopefully this will help someone else who is as confused as I was.

        Like

  11. Dear all,
    had a tough while trying google’s speech API – not working from my side, although I use
    – authorized account/API key
    – wget command: wget –post-file good-morning-google.flac –header=”Content-Type: audio/x-flac; rate=16000″ -v -O – “https://www.google.com/speech-api/v2/recognize?client=chromium&lang=en_US&key=KEY”
    – example flac file 16k mono (good-morning-google.flac) from https://github.com/gillesdemey/google-speech-v2/tree/master/audio

    The only (empty) result I get:
    ### code start
    HTTP request sent, awaiting response… 200 OK
    Length: unspecified [application/json]
    Saving to: `STDOUT’

    [ ] 0 –.-K/s “result”:[]
    [ ] 14 –.-K/s in 0s

    2016-05-08 22:36:34 (174 KB/s) – written to stdout [14]
    ### code end

    also tried parsing with sed/awk with piping: ### | sed -e ‘s/[{}]/”/g’ | awk -v k=”transcript” ‘{n=split($0,a,”,”); for (i=1; i<=n; i++) print a[i]}' ###

    What do I miss? Was there a change (again)?

    // btw: i also tried curl, no response either :/

    Thanks for any suggestion on what I am missing!

    Best
    Chris

    Like

  12. (since my first post did not come through…)
    Hi there,
    had a workful weekend trying google’s speech API – unfortunately not working here – although I use
    – an authorized account/API key
    – wget command: wget –post-file good-morning-google.flac –header=”Content-Type: audio/x-flac; rate=16000″ -v -O – “https://www.google.com/speech-api/v2/recognize?client=chromium&lang=en_US&key=KEY”
    – an example flac file 16k mono (good-morning-google.flac) from https://github.com/gillesdemey/google-speech-v2/tree/master/audio

    The result I get is empty:
    ### start code
    HTTP request sent, awaiting response… 200 OK
    Length: unspecified [application/json]
    Saving to: `STDOUT’

    [ ] 0 –.-K/s “result”:[]
    [ ] 14 –.-K/s in 0s

    2016-05-08 22:36:34 (174 KB/s) – written to stdout [14]
    ### end code

    also tried parsing with sed/awk (piping the output: | sed -e ‘s/[{}]/”/g’ | awk -v k=”transcript” ‘{n=split($0,a,”,”); for (i=1; i<=n; i++) print a[i]}' )

    What didI miss? Was there a change in the API (again)?

    // btw: also tried curl, no response either :/

    Thanks for any suggestion on what I am missing!

    Yours
    Chris

    Like

    • Hi Chris,

      1- flac audio file sent to google must be mono (one channel), not stereo. The file you’re using is a stereo with a sampling rate of 44100 Hz (i.e. 44.1 KHz).

      Here is a ffmpeg command you can use to convert the file to mono:

      ffmpeg -i good-morning-google.flac -ac 1 good-morning-google-mono.flac

      2- You have to use the actual rate of your file in your URL, so use rate=44100 instead of rate=16000 (unless you convert the file into 16KHz (option -ar 16000 in ffmpeg command)

      P.S. these issues have been discussed in previous comments

      Like

    • You can do this with the speech API. I implemented it a little while ago. I had to look at the Chromium source code to get a better idea on how the API should be used since it wasn’t publicly documented at the time. Unfortunately, the API must have changed a few months ago and my implementation doesn’t work now.

      What you’ll want to look into is the asynchronous API. https://cloud.google.com/speech/reference/rest/v1beta1/speech/asyncrecognize

      My implementation was in C++. Essentially, you open two connections – one is input and the other is for the results. When you open the connections you specify a unique key which pairs the connections at the remote end. I used libflac to compress the audio before sending it. The results would be received as JSON.

      I hope this information is a nice starting point for you.

      Like

  13. Hi ,
    Everything is working fine for me as expected, only one issue, i.e with the numbers. If I say one two three four the result will be “1234” and if I say one thousand two hundred thirty four the result is still “1234”. Another issue is that with other languages i.e. the word elf in German is eleven. If you say “elf ” the result is 11, instead of elf. I know we have no control over the api but is there any parameters or hacks we can add to this api to force it to return only words .?Nice tutotrial by the way.

    Like

    • Hello,

      To the best of my knowledge, there’s no such option/parameter to force the API to return words instead of digits. However, you can post-process the output using a third-party script. Here is an example script for English: http://www.unix.com/302935464-post6.html

      It’s for ordinal numbers but I think you can update it to handle cardinal numbers.

      Cheers

      Liked by 1 person

      • Thanks for the quick reply.I have already done this in my project, only issue is by this method it can’t distinguish between “one two three four” and “one thousand two hundred thirty four” since in both cases it is trying to convert “1234” into words.This is an issue for me because this is a spelling app for kids and what they usally want is numbers from 1 – 10 in words. Anyway thanks for the link,I have made a small workaroud not perfect but works to an extend.

        Like

  14. Have also the Problem
    ./speech-rec.sh -i file.flac -r 16000 -l de_DE

    I take my flac file with arecord
    “arecord -D plughw:1,0 -f cd -t wav -d 0 -q -r 16000 | flac – -s -f –best –sample-rate 16000 -o file.flac;”
    But get the error:
    “Google was unable to recognize any speech in audio data”

    Audio is mono (tested with aplay)
    Someone an idea why it could be?

    Like

    • Hi,
      try using the script in record mode (don’t specify an input file):

      ./speech-rec.sh -d 4 -l de_DE

      You’ll find a .flac file in the current directory. This requires one of parecord or sox to run.

      Like

      • “Say something…

        rec FAIL formats: can’t open input `default’: snd_pcm_open error: No such file or directory
        Google was unable to recognize any speech in audio data”

        Like

      • What is wrong with my recording device? Is there somewhere to test an audio file for downloading?
        I have Audacity now installed. How can I start recording via ssh?

        Like

      • I have now made a recording with Audacity and it works 🙂
        But my question is still, how can I make a recording without GUI?

        Like

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s