Audio multi-class segmentation, a tutorial

So far this year, I published an audio segmentation tool on Github, auditok. Audio segmentation, in its simplest form, lets us figure out where a sound starts and where it ends within an audio stream.

auditok makes use of log energy of raw audio signal to detect acoustic activities (Figure 1) but cannot tell which class of sound (bird, speech, phone ring, etc.) corresponds to a given acoustic activity.


Figure 1: example of  auditok  audio activity detection output

Since recognizing the nature of sounds that an audio stream contains is a very exciting idea and a very desirable feature, many research works on this theme have been carried out over the past decade.

I initially wanted to publish many introductory articles on this subject before I bring out a practical application. However, after a recent experiment I ran with auditok in an effort to try a segmentation by classification test of audio streams, I ended up with  an article in form of an interactive Jupyter  notebook. Segmentation by classification is a more advanced form of audio segmentation. Not only can it detect the presence of audio activities, it also ranges them into audio classes (Figure 2).


Figure 2: example of an  output when auditok is “tuned” with a GMM classifier

As you can see, there are many advantages of segmentation by classification over energy-based segmentation. Despite its simplicity, energy-based segmentation can’t recognize the class of sound, systematically misses  low energy audio activities (e.g. breath) and treats adjacent audio activities as one single activity (see end of audio stream).

If you want to check out a static version (read only) of the notebook, you can find it here.

If you want to try out everything yourself, please visit this repositorygithub_logo where you will find the interactive notebook, installation instructions as well as training data and models.


Smile detection with OpenCV, the “nose” trick

Object detection and recognition on pictures was the very first door I got through into the fascinating world of machine learning and artificial intelligence. That was many years ago. Back then, I worked on face detection and recognition project and I discovered what I consider as one of the most cutting-edge algorithms in machine learning: the Viola and Jones’ object detection algorithm.

The algorithm, initially designed for face detection, has been used with a variety of objects, including buildings and cars.

Some time ago, I tried OpenCV’s integrated cascade model for  smile detection  just for fun and to see how it works. I was a bit disappointed because of the huge number of false detections (smile detections  where there is no smile or no face at all). My first thought was naturally to detect faces on a picture and then to feed the smile detector with the detected faces.

This solution fairly reduced false alarms caused by the smile detector.  unfortunately my glasses were  still detected as two smiling mouths, not to mention false alarms  generated by the face detector itself.

I then had a look at OpenCV’s available cascade models, and guess what, there is a rather good model for nose detection. Nobody would argue that the mouth is located under the nose. This is why my final recipe, which finally achieved the goal of greatly limiting the number of false alarms, is the following:

  1. Detect all faces on a picture
  2. Detect the nose on each face and discard faces with no nose
  3. If many noses are detected on one face, choose the lowest one as a split point
  4. Try detecting a smile on the lower half of the face starting from the center of the nose


There is no better than a piece of code to illustrate the above recipe:


On the use of Google’s Speech Recognition API Version 2

I find that Google’s Speech Recognition service is just as great as many of its other services and tools like Google Translate, Gmail etc. (though I’m not a big fan of Google’s privacy policy).

In recent years, there have been some posts about the use of Google’s speech recognition API version 1. Well, to be more accurate, what is here called a speech recognition API is just a “hack” that sends audio data to Google as if it were sent by a user surfing with chrome or chromium.

Update (22-02-2016): if you’re using the wget command you found on this page, please consider using https instead of http for a better privacy.


Understanding the reverse engineering behind the use of Google’s Speech Recognition API

Now that version 1 moved to version 2, this hack had to evolve since Google now requires a developer key in order to use the service. In this article I will briefly explain how to use the current version and supply a bash script to record an utterance, send it to Google, and print the N-best hypotheses (the N most likely utterances carried by audio data).

So to get a reply from Google, we have to send an audio file as an HTTP packet that requests this page:

with the following key=value pairs as URL parameters:

  • client=chromium
  • lang=language (where language is en_US for American English, fr_FR for French, de_DE for German, es_ES for Spanish etc.).
  • key=a_developer_key

and the following HTTP header:

  • Content-Type: audio/x-flac; rate=file_sampling_rate (where file_sampling_rate is the sampling rate of the file). 8000, 16000, 32000 and 44100 are all valid values but not the only possible ones).

The final URL should look like this:

Before we try our first example, there are several things that should be kept in mind:

  • This code is only for personal use and for test purposes.
  • The developer key is granted by Google and may be revoked any time.
  • The use of the free key is limited by Google to 50 requests a day for a given user.
  • Google only accepts audio files in the flac format.

A first example with wget

Record an utterance (preferably in English, unless you replace en_US in the wget command by a language of your choice) with your favorite audio recorder and save the file as a 16000 Hz  flac. To do so on a Unix-like system, you can either use parecord:

parecord record.flac --file-format=flac --rate=16000

or rec (requires SoX):

rec -r 16000 -c 1 record.flac

In either case do use Ctrl+C to stop recording. If you don’t know what to use, Audacity is a good choice. Then type the following command, replacing record.flac by the name of your file (and probably the values of rate and key too):

wget -q --post-file record.flac --header="Content-Type: audio/x-flac; rate=16000" -O - ""

If Google was able to recognize any speech within the file, you should get a json answer with the best hypotheses (i.e. possible transcriptions of the speech data you sent). Otherwise, it returns an empty result array. In case of success, the best hypothesis appears first in the array, with probably its confidence.

Note that Google would nicely return a reply even if your user agent is wget. If you get any troubles with this, try using -U Chromium/39.0.2171.71 as a wget option.

Getting a developer key

The key we are going to use was generated a short time before this article is posted. If it doesn’t work, follow this link for key generation instructions.

Here is the key used above: AIzaSyAcalCzUvPmmJ7CZBFOEWx2Z1ZSn4Vs1gg

 An All-in-One script

I packaged everything into one bash script that permits the user to interactively interact with Google’s SR. The script records a couple of seconds of audio data, sends the recorded stream to Google for transcription, and filters out the result so that the sorted best hypotheses are printed one per line. The first two lines should contain the best hypothesis and its confidence respectively. The following lines contain alternative hypotheses. If you are not familiar with speech recognition systems, you might be a bit surprised to find out that the perfect output (if there is any) is not always what the system considers as the best hypothesis but one of the alternatives!

You can get hold of the script from this repository github_logo (clone the repository or download it as a Zip file, or simply copy/paste the code in file).


To run the wget command, you need a running Unix or Linux distribution. To run the script you also need parecord and timeout to be installed on your system.

Some examples

These are also available on

Send a flac audio file and get its transcription

./ -i record.flac --rate 16000

Use a language other than en_US (default, use --language or -l option)


./ -i record.flac --rate 16000 --language fr_FR


./ -i record.flac -r 16000 -l es_ES

Record your voice for 3 seconds and get what you’ve said


Talk for 7 seconds (default=3, use --duration or -d option)

./ -d 7

Note that if you don’t supply an audio file as an argument, the script writes the recorded audio data into a file named record.flac and doesn’t delete it afterwards so that you can listen to what you said. To play the audio file, type:

paplay record.flac

or with sox

play record.flac


A Python API that nearly does the same thing:

Create a podcast from your media files

RSS and Atom feeds, originally meant for HTML content aggregation, have been widely used for multimedia contents distribution. When a feed is used to broadcast multimedia files, it is referred to as podcast. You might be yourself an avid consumer of podcasts or you might have heard this term and you don’t know what exactly is. In either case, you might be interested in the creation of your own podcast, be that to use your favorite podcast client to play your media files or to share content with others.

In this article, we’ll go beyond the “consumption” of podcasts and learn how to turn your media files library (or part of it) into an RSS feed (or say a podcast) and use it within the scope of your home network.


What is a podcast?

Many TV and radio stations make their programs available on-line for a certain spell of time, after the first broadcast, so that their audience can view, review or listen to their missed favorite programs any time they please. With smart-phones and tablets, the use of podcasts has become way more fun.

Well, this is actually (and fortunately) not the only reason for a podcast to exist. Podcasts can emerge from and cover a wide variety of uses and themes, from computer hacks to yoga, not to forget language courses and college lectures. In fact, you don’t need to own a radio station to diffuse your yummy recipes.

What are RSS and Atom?

Suppose you have a number of articles, videos or audio files that you want to make available to people to whom it may be interesting. One  solution to do so is drop your content into a directory on your website and share the link to that directory. This works but is not very brilliant because people will have to visit your page regularly in search of new content, unless they want to miss your latest article. Moreover, any web masters wishing to include your incredibly interesting articles (or links to them) into their own website(s) will be brassed off because of your dynamically changed contents.

One way to overcome this is to present your content in a way that computers can understand and manage efficiently, without a “human” intervention. RSS (Really Simple Syndication) and Atom formats are used to create a computer-comprehensible interface to your content, so that an RSS/Atom client (also known as feed reader) can automatically fetch your recent publications as you put them on-line.

Using RSS

In a nutshell, RSS and Atom use a standard XML file, the so called feed, to represent the content. As they both use plain text in XML format, the feeds are also readable and editable by humans. There are no substantial differences between RSS and Atom. In this article (and code), we only use RSS 2.0. For more information about feed formats, you can read this article.

Now, this is an example of a RSS 2.0 feed containing two items (an item is an article, a web page, a media file etc.) hosted on a web site called

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
      <title>Healthy Cooking</title>
      <description>This podcast is about healthy recipes</description>
         <title>How to add olive oil to almost everything you cook</title>
         <description>Did you know that olive oil can be added to almost everything you cook...</description>
         <pubDate>Sun, 9 Nov 2014 12:00:00 -0000</pubDate>
         <enclosure url="" type="audio/mpeg" length="40946"/>
         <title>A salad with avocado and honey</title>
         <description>Have you ever tried a half avocado with a bit of pure bee honey</description>
         <pubDate>Tue, 23 Dec 2014 13:07:20 -0000</pubDate>
         <enclosure url="" type="audio/mpeg" length="61358"/>

As we can see, a feed contains a channel made up of a title, a description, a link and a sequence of item tags. Each item tag hosts the following elements:

  • guid: a unique identifier of the item. It can be the URL of the item.
  • link: URL of the item.
  • title: a short text to whet the appetite of the users (or to save their time !).
  • description: a longer text for the potentially interested users.
  • pubDate: date of publication of the item in RFC-822 format.
  • enclosure: some podcast clients won’t download the media file unless you explicitly set it as an enclosure.

Note that some of these elements are not mandatory but they can greatly improve the experience of using the podcast. Take the pubDate tag for instance. If you don’t supply it, a podcast reader may organize the items in a way you don’t expect (and like).

The core of this article is a tool (genRSS) that can do (almost) everything for you and generate a podcast feed from a directory that contains your media files. But before we go any further, I am going to try to answer the unavoidable question:

Why do I need/want to create my own podcast?

Here are a few reasons:

  • On your smart-phone, you’d rather use your nice podcast reader to play your audio files than a media player
  • You’ve a bunch of FSI language courses with no mp3 tags and you want to use them correctly on your computer or mobile device (mp3 tags such as artist, album, year, etc., are used by media players to organize your media library so you can effectively use it)
  • You’ve a very limited space on your mobile device and you seek a way to stream your videos from your computer to the device (like a Video On Demand)  using your home wifi.
  • You want to share some content with your co-tenants, friends  or family members without juggling with a USB stick.
  • Use your imagination !

What do you need?

  • Python 2.7 or higher.
  • A home wifi connection. If you don’t have a wifi network, you can still use your computer to test everything locally.
  • A web server. Don’t worry if you don’t have an installed and configured web server on your machine. If you have Python, you almost have a basic web server.
  • A couple of media files.

Download the code and test it

Download the code from (click on Download ZIP): and extract the zip file into a directory on your file system. You can also clone the code from the repository:

mkdir genRSS
git clone genRSS

Try the python web server

In order to expose any content to other machines, you need a protocol. Podcasts are normally diffused using the web protocol, HTTP. So to make a podcast viewable by other devices, you will need to transform your machine into a web server. Perhaps you will face the need to install a decent web server on your machine such as Apache or Lighttpd in the future. To keep things simple, everything we need in this article can already be fulfilled by the script.

To launch the server  type:


or simply run this command (found on

python -m SimpleHTTPServer 8080

Then open a new tab and type (or simply click on) http://localhost:8080

If everything went well, you must be having a web page that contains a link to itself and anything within the same directory.

localhost here is the name of your machine and 8080 is the port that our server is listening at (you can modify the source if you want to use another port). In order for other devices sharing the same network to reach the server, they need to know the IP address of your machine. Go to your connection information panel and copy your IP address. If your IP address is you can reach your server with the following address:

Generate the first podcast

genRSS is delivered with a directory (test/media) that contains empty media files and subdirectories. We will use it here to test

Here is the simplest command to generate a podcast feed from files in test/media

python -d test/media --host localhost:8080 -o feed.rss

If you open http://localhost:8080/feed.rss with your browser you will find the content of the just generated feed (1.mp3, 1.mp4, 1.mp4, 1.ogg 2.MP3).

This is a really basic feed, you may want your podcast to:

  • Have a title and a description
python -d test/media --host localhost:8080 --title "A simple podcast" --description "This is the description" -o feed.rss


python -d test/media -H localhost:8080 -t "A simple podcast" -p "This is the description" -o feed.rss

You can now refresh the page and see the change.

  •  Include all files in subdirectories
python -d test/media --recursive -H localhost:8080 --t "A simple podcast" -p "This is the description" -o feed.rss
  •  Only include files with a given extension(s)
python -d test/media -e mp3,ogg -H localhost:8080 -t "A simple podcast" -p "This is the description" -o feed.rss
  •  Use an IP address instead of localhost
python -d test/media -H -t "A simple podcast" -p "This is the description" -o feed.rss
  • Add an image
python -d test/media -H -i images/logo.jpg -t "A simple podcast" -p "This is the description" -o feed.rss

Obviously, image/logo.jpg must be visible to the web server (i.e. located in the same directory as You can also use an image from the web by supplying its full http or https URL to the -i option.

Check your feed with a validator

If you want to check the validity of your feed, you can use the W3C feed validaor (copy the text of your feed and paste it into the text area). Don’t worry if you get a warning for a “missing atom:link”.

Test you podcast on a mobile device

Get some real media files, put them in a directory where is located and generate a podcast for the directory. On a tablet or a smart-phone connected to the same network as your local machine, open a web browser and type your machine’s IP address plus :8080. You should be able to see the content of the server. Copy the link of your feed, open a podcast reader and add the link as a new feed. If the the reader has no trouble finding the podcast, you can start streaming or downloading your media files.

Comments to the code

I think the code is simple and fairly documented. Here are however a couple of comments that can be helpful for the interested developer.

Test the code

The code is tested with doctest. If you wish to re-run the test, open, activate the test mode (TESTRUN = 1), and type:

python -v


  • getFiles(dirname, extensions=None, recursive=False): crawls a given directory looking for files and returns a list of relative paths.  extensions is a list of string used to restrict the files to the desired set of case-insensitive extensions.
  • buildItem(link, title, guid = None, description=””, pubDate=None, indent = ”   “, extraTags=None): builds a RSS 2.0 item and returns it as string (new line characters included).

How items are sorted in the feed

Unless an explicit -C (or –sort-creation) command line option for is given, which means that items are to appear in the podcast sorted by their date of creation (newest item on top), we assume that items should be sorted by name.

This is however problematic because we need to supply a pubDate tag for each item so that a podcast reader can sort them the way we want it to. What the code does is to consider that the first item in a list of items sorted by name is published right now (at the time of creation of the feed), the second item, n minutes and f1 seconds ago, the third item 2*n minutes and f2 seconds ago and so on. In the code, n=1 and f1, f2, … fk is a random number of seconds between 0 and 10.

fileNames = getFiles(dirname.encode("utf-8"), extensions=opts.extensions, recursive=opts.recursive)
        if opts.sort_creation:
            pubDates = [os.path.getctime(f) for f in fileNames]
            sortedFiles = sorted(zip(fileNames, pubDates),key=lambda f: - f[1])
            now = time.time()
            import random
            pubDates = [now - (60 * d + (random.random() * 10)) for d in xrange(len(fileNames))]
            sortedFiles = zip(fileNames, pubDates)

Enclosures for audio and video items

As mentioned above, we need to add an enclosure tag to an item so that a client downloads it. This is particularly useful for audio and video files. To check the type of a file we use the mimetypes package:

        import mimetypes
        mtype = mimetypes.guess_type(fname)[0]
        if "audio" in mtype or "video" in mtype:
           # generate an enclosure tag

RFC-822 dates

pubDates must follow the RFC-822 format. Here are the lines of code that do so:

import time
now = time.time()
time.strftime("%a, %d %b %Y %H:%M:%S -0000", time.localtime(now))


Code’s repository:

RSS and Atom tutorials:

Feed validator:

RFC-822 date format: (see section 5)