Recognition of bird songs on the basis of their spectrograms

Recorded bird songs in audio file formats are converted into spectrograms and are preprocessed to identical frequency spectra as well as recording length (Fig. 1).

Fig. 1: The filtered magnitude spectrogram to dB scale of a Parus major. The bird's frequencies are shown in Hertz (y-axis) over time (x-axis) and are colour coded by dB. The frequency spectra cover an amplitude between 1024 and 8192 Hz.

Data base: https://xeno-canto.org

The applied model - selected from several tested models (customised, simple Transformer, simple CNN, improved CNN) - was trained, tested and validated using a training dataset (certain percentage of the selected data). After 12 out of 70 epochs, our model achieves an accuracy of 70.96 %.

Abstract

Data size: 2722 Audio samples encompass roughly 55 GB → storage space in Google Colab is too
small & RAM is overloaded

Solution approach: updating to Google Pro and Google Colab Pro, use of the cluster of Jan's employer, this project was limited to 10 native bird species

Http-timeouts while data downloading

Solution approach: dataset was easily downloaded via the API of xeno-canto

conversion of undefined file formats to .wav audio files

Solution approach: librosa & other packages available for converting .mp3/.mp4 to .wav files

Handling of 2-channel audio files

Solution approach: converting to mono channel

Binning

Solution approach: implemented function that calculates relevant bins based on most relevant frequencies of birds which are from 1 to 8 kHz.

Introduction

Climate change and ecosystem change are becoming increasingly widespread and are also causing changes in the bird world. Spatial and temporal variation in bird diversity is often used as a benchmark for assessing environmental change. Traditionally, such data has been collected by specialised experts. Increasingly, this acoustic data is also being collected by laypeople, including requests from the state to carry out bird counts in their own gardens. Our deep learning algorithm is intended to reliably enable, simplify and improve the exact determination of the existing bird abundance via audio recordings (especially using unprofessional recording techniques). The Shiny-app, that will be developed for this purpose, will enable a direct retrieval in a user-friendly manner.

Methods

We implement the Residual Neuronal Network (ResNET), that has been succeeded well in image detection and identification as well as has proven a quite low failure rate. Besides, this model permits additional layers/parameter. However, this benefit can turn into a disadvantage in form of saturation which is not caused by „over-fitting“ but rather by the network initialisation, the optimization function, or by vanishing/exploding gradients (Deep Residual Networks (ResNet, ResNet50) 2024 Guide – viso.ai). The solution of this phenomenon was solved by including so-called skip connections (Fig. 2).

Fig. 2: Skip connection

As the term implies, one layer is omitted and the one after is activated. Excluded layers are then stacked together and afterwards recognised and adapted by the network as Residual image. The advantage of adding these skip connections is that every layer, that impairs the architecture of the network, can be skipped by „regularisation“. The beneficial consequence leads to a deep neural network that can be easily trained without arising problems of vanishing/exploding gradients. Another possibility of circumventing gradient vanishing offers the transformer network which builds entirely on the attention mechanism. Hence, instead of sequential computation it applies parallelisation using the multi-head attention in the “encoder-decoder attention” layer, the “self-attention” in the encoder layer, and similarly, in the decoder layer (Fig. 3, Vaswani et al., “Attention is all you need”, 2017).

Fig. 3: Model Architecture of a Transformer. The Transformer follows this overall structure of stacked self-attention and point-wise, fully connected layers for both the encoder and decoder (left and right halves, respectively). The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers consisting of a multi-head self-attention mechanism and a simple, positionwise fully connected feed-forward network. Around each of the two sub-layers, a residual connection is built on and followed by layer normalization. Hence, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. All sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512. The decoder is also composed of a stack of N = 6 identical layers. Additionally, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. The self attention sub-layer in the decoder stack is modified to prevent positions from attending to subsequent positions. This masking, combined with the fact that the output embeddings are the offset by oneposition, allows the predictions for one position i can depend only on the known outputs at positions less than i (Vaswani et al., 2017).

Transformers are previously developed for machine translation, whereby the encoder is used to encode the initial sentence and the decoder to perform the translation. The transformer processes whole sequences in parallel, making it faster and better than sequentially running operations.

Libraries and packages retrieved for model implementation:

pytorch audio
sklearn
numpy
pytorch vision
seaborn
IPython
pandas

The following packages were used for format conversion, automatic download, and data
preprocessing

librosa
numpy
matplotlib
pydub
AudioSegment
tqdm

Steps for the underlying approach/procedure:

automatic download from the platform xeno-canto
conversion of .mp3 to .wav data format and consequentially into spectrograms
visualisation of spectrograms
Model training & testing
Model validation
implementing other, new datasets recorded by different technical audio devices

Results

The model applying the ResNET is able to predict a bird species with 52.39 % accuracy on the basis of its spectrogram, which was converted first from an .mp3 audio file to .wav file format. A 52 % correctly recognised species is not a desirable, envisaged result. Possible reasons and associated suggestions for improvement are first, a shortening of the spectrogram length, second, instead of padding to „0“ to rather cut sequences of noise or background and third, to add further features/ layers/ parameters. Changing the model's algorithm to the “custom” model and shortening the spectrogram length to 5.94s has shown to elevate the validation efficiency by ~20% (70.96%, Fig.4).

Fig. 4: Training Efficiency. The validation accuracy considering a sample length of 512, hop length of 512 and sample rate of 44,100 Hz resulting in a recording length of 5.94 s peaks after 12 out of 70 epochs with 70.96 %.

The difference between the two deep learning architectures relies on the implementation of a transformer. The incorporated parallel running operations and the self-attention architecture explained above enable besides faster computational operations and more efficient and effective use of memory capacity, better and more reliable predictions. Therewith, training time as well as drops in performance can be reduced due to long dependencies. Further, positional embeddings replace recurrence by using fixed or learned weights which encode information related to a specific position of a token in a sentence (for translations) or in this case, a spectrogram, respectively. Note, the model is merely trained with 10 German domestic species. Very closely related subspecies such as those from the tit (yet in Germany alone there are 30 different species of tits) not considered here, might probably not be identified or discriminated from tits other than those typically found in Germany. The downsides from the technical perspective are the strong computing power and storage capacity required for this kind of deep learning algorithms and abundance of data. Only powerful computing systems that also offer a large amount of RAM can enable reliable computing operations, which in turn are more difficult to access in private use and if so, then at additional financial cost.