Region of Interest Extraction and Bilinear Interpolation Application for Preprocessing of Lipreading Systems 


Vol. 13,  No. 4, pp. 189-198, Apr.  2024
https://doi.org/10.3745/TKIPS.2024.13.4.189


PDF
  Abstract

Lipreading is one of the important parts of speech recognition, and several studies have been conducted to improve the performance of lipreading in lipreading systems for speech recognition. Recent studies have used method to modify the model architecture of lipreading system to improve recognition performance. Unlike previous research that improve recognition performance by modifying model architecture, we aim to improve recognition performance without any change in model architecture. In order to improve the recognition performance without modifying the model architecture, we refer to the cues used in human lipreading and set other regions such as chin and cheeks as regions of interest along with the lip region, which is the existing region of interest of lipreading systems, and compare the recognition rate of each region of interest to propose the highest performing region of interest In addition, assuming that the difference in normalization results caused by the difference in interpolation method during the process of normalizing the size of the region of interest affects the recognition performance, we interpolate the same region of interest using nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation, and compare the recognition rate of each interpolation method to propose the best performing interpolation method. Each region of interest was detected by training an object detection neural network, and dynamic time warping templates were generated by normalizing each region of interest, extracting and combining features, and mapping the dimensionality reduction of the combined features into a low-dimensional space. The recognition rate was evaluated by comparing the distance between the generated dynamic time warping templates and the data mapped to the low-dimensional space. In the comparison of regions of interest, the result of the region of interest containing only the lip region showed an average recognition rate of 97.36%, which is 3.44% higher than the average recognition rate of 93.92% in the previous study, and in the comparison of interpolation methods, the bilinear interpolation method performed 97.36%, which is 14.65% higher than the nearest neighbor interpolation method and 5.55% higher than the bicubic interpolation method. The code used in this study can be found a https://github.com/haraisi2/Lipreading-Systems.

  Statistics


  Cite this article

[IEEE Style]

J. H. Han, Y. K. Kim, M. H. Kim, "Region of Interest Extraction and Bilinear Interpolation Application for Preprocessing of Lipreading Systems," The Transactions of the Korea Information Processing Society, vol. 13, no. 4, pp. 189-198, 2024. DOI: https://doi.org/10.3745/TKIPS.2024.13.4.189.

[ACM Style]

Jae Hyeok Han, Yong Ki Kim, and Mi Hye Kim. 2024. Region of Interest Extraction and Bilinear Interpolation Application for Preprocessing of Lipreading Systems. The Transactions of the Korea Information Processing Society, 13, 4, (2024), 189-198. DOI: https://doi.org/10.3745/TKIPS.2024.13.4.189.