Diffusion-based Audio-to-Visual Generation for High-Quality Bird Images 


Vol. 14,  No. 3, pp. 135-142, Mar.  2025
https://doi.org/10.3745/TKIPS.2025.14.3.135


PDF
  Abstract

Accurately identifying bird species from their vocalizations and generating corresponding bird images is still a challenging task due to limited training data and environmental noise in audio data. To address this limitation, this paper introduces a diffusion-based audio-to-image generation approach that satisfies both the need to accurately identify bird sounds and generate bird images. The main idea is to use a conditional diffusion model to handle the complexities of bird audio data, such as pitch variations and environmental noise while establishing a robust connection between the auditory and visual domains. This enables the model to generate high-quality bird images based on the given bird audio input. Plus, the proposed approach is integrated with deep audio processing to enhance its capabilities by meticulously aligning audio features with visual information and learning to map intricate acoustic patterns to corresponding visual representations. Experimental results demonstrate the effectiveness of the proposed approach in generating better images for bird classes compared to previous methods

  Statistics


  Cite this article

[IEEE Style]

A. Toleubekova, J. Y. Shim, X. Piao, J. Kim, "Diffusion-based Audio-to-Visual Generation for High-Quality Bird Images," The Transactions of the Korea Information Processing Society, vol. 14, no. 3, pp. 135-142, 2025. DOI: https://doi.org/10.3745/TKIPS.2025.14.3.135.

[ACM Style]

Adel Toleubekova, Joo Yong Shim, XinYu Piao, and Jong-Kook Kim. 2025. Diffusion-based Audio-to-Visual Generation for High-Quality Bird Images. The Transactions of the Korea Information Processing Society, 14, 3, (2025), 135-142. DOI: https://doi.org/10.3745/TKIPS.2025.14.3.135.