So far this year, I published an audio segmentation tool on Github, auditok. Audio segmentation, in its simplest form, lets us figure out where a sound starts and where it ends within an audio stream.
auditok makes use of log energy of raw audio signal to detect acoustic activities (Figure 1) but cannot tell which class of sound (bird, speech, phone ring, etc.) corresponds to a given acoustic activity.
Figure 1: example of auditok audio activity detection output
Since recognizing the nature of sounds that an audio stream contains is a very exciting idea and a very desirable feature, many research works on this theme have been carried out over the past decade.
I initially wanted to publish many introductory articles on this subject before I bring out a practical application. However, after a recent experiment I ran with auditok in an effort to try a segmentation by classification test of audio streams, I ended up with an article in form of an interactive Jupyter notebook. Segmentation by classification is a more advanced form of audio segmentation. Not only can it detect the presence of audio activities, it also ranges them into audio classes (Figure 2).
Figure 2: example of an output when auditok is “tuned” with a GMM classifier
As you can see, there are many advantages of segmentation by classification over energy-based segmentation. Despite its simplicity, energy-based segmentation can’t recognize the class of sound, systematically misses low energy audio activities (e.g. breath) and treats adjacent audio activities as one single activity (see end of audio stream).
If you want to check out a static version (read only) of the notebook, you can find it here.
If you want to try out everything yourself, please visit this repository where you will find the interactive notebook, installation instructions as well as training data and models.