Document Type : Original Article


1 Department of Computer Since and Systems Engineering, Ayandegan Institute of Higher Education, Tonekabon, Iran.

2 Department of Computer Science, Islamic Azad University Central Tehran Branch, Tehran, Iran.

3 Department of Industrial Engineering, Sadra University, Tehran, Iran.


In today's world, where speech recognition has become an integral part of our daily lives, the need for systems equipped with this technology has increased dramatically in the past few years. This research aims to locate the two selected Persian words in any given audio file. For this purpose, two standard and native datasets were prepared for this model one for train and the other for the test. Both datasets were converted into images of audio waveforms. Using the object detection technique, the model could extract different bounding boxes for each test audio, and then each box image goes through a CNN classifier and returns a corresponding label. Finally, a threshold is set so that only boxes with high accuracy are displayed as output. The results showed 93% accuracy for the CNN classifier and 50% accuracy for testing the model with object detection.


  • Rudnicky, A. I., Hauptmann, A. G., & Lee, K. F. (1994). Survey of current speech technology. Communications of the ACM37(3), 52-57.
  • Guo, J., & Gould, S. (2015). Deep CNN ensemble with data augmentation for object detection. Retrieved from
  • Jiao, L., Zhang, F., Liu, F., Yang, S., Li, L., Feng, Z., & Qu, R. (2019). A survey of deep learning-based object detection. IEEE access7, 128837-128868.
  • Vadwala, A. Y., Suthar, K. A., Karmakar, Y. A., Pandya, N., & Patel, B. (2017). Survey paper on different speech recognition algorithm: challenges and techniques. Int J comput appl175(1), 31-36.
  • Wang, Y., Mohamed, A., Le, D., Liu, C., Xiao, A., Mahadeokar, J., ... & Seltzer, M. L. (2020, May). Transformer-based acoustic modeling for hybrid speech recognition. ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP)(pp. 6874-6878). IEEE.
  • Kriman, S., Beliaev, S., Ginsburg, B., Huang, J., Kuchaiev, O., Lavrukhin, V., ... & Zhang, Y. (2020, May). Quartznet: deep automatic speech recognition with 1d time-channel separable convolutions. ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP)(pp. 6124-6128). IEEE.
  • Chan, W., Park, D., Lee, C., Zhang, Y., Le, Q., & Norouzi, M. (2021). Speechstew: simply mix all available speech recognition data to train one large neural network. Retrieved from
  • Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. Retrieved from
  • Han, W., Zhang, Z., Zhang, Y., Yu, J., Chiu, C. C., Qin, J., ... & Wu, Y. (2020). Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. Retrieved from
  • Chen, Y., Li, W., Sakaridis, C., Dai, D., & Van Gool, L. (2018). Domain adaptive faster R-CNN for object detection in the wild. Proceedings of the IEEE conference on computer vision and pattern recognition(pp. 3339-3348). IEEE.
  • Park, S., Jeong, Y., & Kim, H. S. (2017). Multiresolution CNN for reverberant speech recognition. 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA) (pp. 1-4). IEEE. DOI: 1109/ICSDA.2017.8384470
  • Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017). Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE journal of selected topics in signal processing, 11(8), 1240-1253. DOI: 1109/JSTSP.2017.2763455
  • Passricha, V., & Aggarwal, R. K. (2020). A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition. Journal of intelligent systems, 29(1), 1261-1274.
  • Qian, Y., Bi, M., Tan, T., & Yu, K. (2016). Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM transactions on audio, speech, and language processing24(12), 2263-2276.
  • Han, S., Kang, J., Mao, H., Hu, Y., Li, X., Li, Y., ... & Dally, W. B. J. (2017, February). Ese: Efficient speech recognition engine with sparse LSTM on FPGA. Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays(pp. 75-84).
  • Huang, Z., Dong, M., Mao, Q., & Zhan, Y. (2014, November). Speech emotion recognition using CNN. In Proceedings of the 22nd ACM international conference on Multimedia(pp. 801-804).
  • Gall, J., & Lempitsky, V. (2013). Class-specific Hough forests for object detection. In Decision forests for computer vision and medical image analysis(pp. 143-157). Springer, London.
  • Gales, M. J. (1998). Maximum likelihood linear transformations for HMM-based speech recognition. Computer speech & language12(2), 75-98.
  • Moreno, P. J., Raj, B., & Stern, R. M. (1996, May). A vector Taylor series approach for environment-independent speech recognition. 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings(Vol. 2, pp. 733-736). IEEE. DOI: 1109/ICASSP.1996.543225
  • Woodland, P. C., & Povey, D. (2002). Large scale discriminative training of hidden Markov models for speech recognition. Computer speech & language16(1), 25-47.
  • Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F., Ghoshal, A., ... & Thomas, S. (2011). The subspace Gaussian mixture model—A structured model for speech recognition. Computer speech & language25(2), 404-439.
  • Zeng, F. G., Nie, K., Stickney, G. S., Kong, Y. Y., Vongphoe, M., Bhargave, A., ... & Cao, K. (2005). Speech recognition with amplitude and frequency modulations. Proceedings of the national academy of sciences102(7), 2293-2298.
  • Zhu, Y., Zhao, C., Wang, J., Zhao, X., Wu, Y., & Lu, H. (2017). Couplenet: Coupling global structure with local parts for object detection. Proceedings of the IEEE international conference on computer vision(pp. 4126-4134). IEEE.
  • Cai, Z., & Vasconcelos, N. (2018). Cascade R-CNN: delving into high quality object detection. Proceedings of the IEEE conference on computer vision and pattern recognition(pp. 6154-6162). IEEE.
  • Ghiasi, G., Lin, T. Y., & Le, Q. V. (2019). Nas-fpn: Learning scalable feature pyramid architecture for object detection. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition(pp. 7036-7045). IEEE.