BIODIVERSITY
2025-03-156 min read

Fine-Tuning BirdNET on Custom Data: Tailoring AI for Local Bird Monitoring

Atchayapraba, Guneet Mutreja
Fine-Tuning BirdNET on Custom Data: Tailoring AI for Local Bird Monitoring

Bird sound recognition has become a transformative tool for biodiversity monitoring, with BirdNET standing out as a popular model driving this change. This deep learning system is designed to identify hundreds, and now thousands, of bird species by sound. Initially, the 2018–2020 version recognized 984 species across North America and Europe, expanding to over 3,000 species worldwide by 2021, and currently covering more than 6,000 species globally.

Developed by Kahl et al. (2021), BirdNET works by analyzing audio spectrograms to automatically detect bird species from their songs and calls. This capability is profoundly significant for conservation biology and ornithology, enabling large-scale passive acoustic monitoring. Instead of manual listening, researchers can scan hundreds of hours of recordings for bird occurrences. BirdNET forms the backbone of popular tools like the BirdNET app from Cornell Lab and various research initiatives, significantly lowering the technical hurdle for bird sound identification.

While BirdNET’s broad coverage is incredibly useful, a single model can face challenges. Its performance may decline in regions or habitats that differ significantly from the data it was originally trained on. Factors like regional dialects in bird calls, specific background noise profiles, or the presence of species not included in the initial training data can cause a one-size-fits-all model to struggle. For example, a model primarily trained on North American and European sounds might misidentify calls in a tropical rainforest.

The Power of Fine-Tuning

The good news is that we can fine-tune BirdNET using site-specific data to significantly improve its accuracy for local species and even incorporate new species that weren’t part of the original model. Fine-tuning means retraining the model on new, relevant data, allowing it to adapt to a specific location or species list.

For ornithologists, this means a BirdNET model can be precisely tailored to the local avifauna, which is invaluable for monitoring rare species or analyzing bird communities within a particular reserve. If you’re working in a specific area, you can train BirdNET to focus on the species present there and learn their unique acoustic soundscape.

How to Fine-Tune BirdNET on Custom Data: A Step-by-Step Approach

Fine-tuning BirdNET on your custom audio data involves a structured process. The high-level steps are:

  1. Setting Up the Environment: Preparing your computer with necessary software.
  2. Collecting Site-Specific Audio: Gathering local bird sound recordings.
  3. Preprocessing Audio: Standardizing the format and length of recordings.
  4. Generating Synthetic Bird Calls: Using AI to create additional audio for underrepresented species.
  5. Preparing the Dataset: Organizing all audio clips for training.
  6. Training the BirdNET Model: Running the fine-tuning process.

Let’s dive into each step in detail.

Step 1: Setting Up Your Python Environment

Fine-tuning BirdNET requires installing several specialized Python libraries and tools. You’ll need Python 3.9+ (64-bit recommended).

Crucially, you need the BirdNET-Analyzer repository, which you should clone from GitHub https://github.com/birdnet-team/BirdNET-Analyzer. (keep the directory path handy).

The process also relies on PyTorch, preferably installed with CUDA support if you have a compatible GPU for faster training.

For generating synthetic audio, you’ll install AudioLDM2 via Hugging Face’s Diffusers:

pip install diffusers transformers

Finally, install essential audio and data processing libraries:

pip install librosa soundfile pydub pandas scipy

Once these libraries are installed, your environment will be ready for the fine-tuning process.

Step 2: Collecting Location-Specific Bird Audio

To fine-tune BirdNET, you first need a collection of local bird audio recordings. Xeno-canto is an excellent resource for this, hosting thousands of bird recordings from around the world. The xenopy library allows you to automate downloads.

You can define the geographic area of interest:

from xenopy import Query q = Query(cnt="India", box="21.5504,88.2518,22.2017,89.0905")

Then, you retrieve metadata and download the audio files:

q.retrieve_meta(verbose=True) q.retrieve_recordings(outdir="output/xeno_canto", multiprocess=True, nproc=10)

After downloading, you’ll organize the audio files by species, typically using the metadata to rename folders into a clear format like Genus species_CommonName (e.g., Psittacula krameri_Rose-ringed Parakeet). This provides clean, location-specific audio, sorted by species, ready for the next stages.

Step 3: Preprocessing Audio Clips

Audio recordings from sources like Xeno-canto vary widely in length and sample rate. BirdNET, however, requires uniform, short audio clips, typically 3 seconds long, in 48kHz mono WAV format, to function correctly.

Preprocessing involves two main steps:

  • Resampling: Convert all audio files to a standard 48kHz mono format.
  • Slicing: Divide each audio file into consecutive 3-second clips. Any leftover segments shorter than 3 seconds are discarded.

The result is a set of species folders, each containing uniform 3-second WAV clips. The 3-second duration is chosen because it typically captures most bird calls clearly and aligns well with the window size BirdNET uses for analyzing spectrograms.

Step 4: Generating Synthetic Bird Calls

A common challenge when training machine learning models on real-world data is class imbalance. Some bird species may have very few available recordings, leading to insufficient data for the model to learn from them effectively.

To address this, you can use AudioLDM2, a text-to-audio generative model, to create realistic synthetic bird calls. The approach is to augment your dataset by generating enough synthetic clips to match the number of clips in the most-represented species class.

You guide the audio generation using prompts like “Generate bird vocalizations of [species]”. Adding negative prompts (e.g., “bad quality”) helps avoid generating poor outputs. The generated audio is typically saved per species at 16kHz. These synthetic clips are then processed — resampled to 48kHz and chunked into 3-second segments — just like the real data. This process ensures that underrepresented species have sufficient training data, improving model balance and overall robustness.

Step 5: Preparing the Training Dataset

Now, you combine your preprocessed real audio clips and the processed synthetic clips. The crucial step here is organization. All 3-second, 48kHz clips must be placed into species-specific folders. The structure should look something like this:

preprocessed_audio/ 
├── Psittacula krameri_Rose-ringed Parakeet/ 
│   ├── XC12345_1.wav 
│   ├── XC12345_2.wav 
│   ├── ... (real clips) 
│   ├── generated_0.wav 
│   ├── generated_0_1.wav 
│   ├── ... (processed synthetic clips) 
├── Turdoides striata_Jungle Babbler/ 
│   ├── ... 
├── ...

The names of these folders are critical as they serve as the class labels during training. Use clear, unique names.

Before moving to training, it’s essential to perform a quick checklist:

  • ✅ Class balance: Verify that species have a similar count of clips
  • ✅ Audio quality: Manually check a sample of clips to filter out noise or mislabeled recordings.
  • ✅ Naming consistency: Ensure there are no duplicate or ambiguous folder names that could confuse the model.

With this step complete, you have a clean, balanced, site-specific dataset perfectly structured for training BirdNET.

Step 6: Training the BirdNET Model

The final step is to train the BirdNET model to recognize only the bird species present in your prepared dataset. You run the training process from within the BirdNET-Analyzer directory using a Python command:

python -m birdnet_analyzer.train --i "/path/to/preprocessed_audio" --o "/path/to/output_model_dir" --epochs 50

Here, --i points to your input folder containing the species subfolders of audio clips, --o specifies the directory where the trained model will be saved, and --epochs sets the number of training iterations (you may need to adjust this based on your dataset size and computational resources).

During training, several things happen:

  • Data Splitting: The dataset is automatically divided into training and validation sets (often based on a val_split parameter, like 20%).
  • Feature Extraction: Audio is converted into spectrograms or mel features, which the model can process.
  • Model Structure: BirdNET uses a ResNet-based Convolutional Neural Network (CNN). It employs transfer learning, fine-tuning either just the last layer or the full model depending on configuration. You can configure aspects like hidden_units, dropout, and learning_rate.
  • Augmentation: If mixup is enabled in the configuration, it applies mixup augmentation to improve the model's ability to generalize.

The output of the training process includes the model files, which can be saved in various formats (.tflite, SavedModel, or PyTorch) based on the specified model_format. A label map is also created, mapping species indices used by the model back to their names.

Conclusion

By fine-tuning BirdNET on local data, you significantly enhance its capability to accurately recognize bird species specific to your region. This is particularly valuable for species that the base model might struggle with due to limited training exposure or unique local acoustic environments. This process improves accuracy by adapting the model to local bird calls, the specific background noise of the environment, and the regional ecological context.

Furthermore, fine-tuning enables the recognition of endemic or previously unsupported species, making BirdNET a more powerful tool for conservation efforts, such as monitoring rare birds through automated acoustic recordings. A localized model strengthens passive acoustic monitoring initiatives, allowing researchers and citizen scientists to reliably track bird activity, migration patterns, and overall biodiversity with higher confidence and greater regional relevance. You now have a fine-tuned BirdNET model specialized for your local bird community, ready for deployment.

References

  • Wood, Connor M., Stefan Kahl, Ashakur Rahaman, and Holger Klinck. “The machine learning–powered BirdNET App reduces barriers to global bird research by enabling citizen science participation.” PLoS Biology 20, no. 6 (2022): e3001670. https://www.birdnet.cornell.edu.
  • Márquez-Rodríguez, Alba, Miguel Ángel Mohedano-Munoz, Manuel J. Marín-Jiménez, Eduardo Santamaría-García, Giulia Bastianelli, Pedro Jordano, and Irene Mendoza. “A Bird Song Detector for improving bird identification through Deep Learning: a case study from Do\~ nana.” arXiv preprint arXiv:2503.15576 (2025). BioRxiv. https://doi.org/10.1101/2022.01.01.
Want to analyze biodiversity in your projects?
Get in touch

Share Article