Speech Recognition System Based on Machine Learning in Persian Language

Mohammadi, Shahed; Hemati, Niloufar; Mohammadi, Neda

doi:10.22105/cand.2022.146462

Document Type : Original Article

Authors

¹ Department of Computer Since and Systems Engineering, Ayandegan Institute of Higher Education, Tonekabon, Iran.

² Department of Computer Science, Islamic Azad University Central Tehran Branch, Tehran, Iran.

³ Department of Industrial Engineering, Sadra University, Tehran, Iran.

https://doi.org/10.22105/cand.2022.146462

Abstract

In today's world, where speech recognition has become an integral part of our daily lives, the need for systems equipped with this technology has increased dramatically in the past few years. This research aims to locate the two selected Persian words in any given audio file. For this purpose, two standard and native datasets were prepared for this model one for train and the other for the test. Both datasets were converted into images of audio waveforms. Using the object detection technique, the model could extract different bounding boxes for each test audio, and then each box image goes through a CNN classifier and returns a corresponding label. Finally, a threshold is set so that only boxes with high accuracy are displayed as output. The results showed 93% accuracy for the CNN classifier and 50% accuracy for testing the model with object detection.

Keywords

References

Rudnicky, A. I., Hauptmann, A. G., & Lee, K. F. (1994). Survey of current speech technology. Communications of the ACM, 37(3), 52-57.
Guo, J., & Gould, S. (2015). Deep CNN ensemble with data augmentation for object detection. Retrieved from https://doi.org/10.48550/arXiv.1506.07224
Jiao, L., Zhang, F., Liu, F., Yang, S., Li, L., Feng, Z., & Qu, R. (2019). A survey of deep learning-based object detection. IEEE access, 7, 128837-128868.
Vadwala, A. Y., Suthar, K. A., Karmakar, Y. A., Pandya, N., & Patel, B. (2017). Survey paper on different speech recognition algorithm: challenges and techniques. Int J comput appl, 175(1), 31-36.
Wang, Y., Mohamed, A., Le, D., Liu, C., Xiao, A., Mahadeokar, J., ... & Seltzer, M. L. (2020, May). Transformer-based acoustic modeling for hybrid speech recognition. ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP)(pp. 6874-6878). IEEE.
Kriman, S., Beliaev, S., Ginsburg, B., Huang, J., Kuchaiev, O., Lavrukhin, V., ... & Zhang, Y. (2020, May). Quartznet: deep automatic speech recognition with 1d time-channel separable convolutions. ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP)(pp. 6124-6128). IEEE.
Chan, W., Park, D., Lee, C., Zhang, Y., Le, Q., & Norouzi, M. (2021). Speechstew: simply mix all available speech recognition data to train one large neural network. Retrieved from https://doi.org/10.48550/arXiv.2104.02133
Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. Retrieved from https://doi.org/10.48550/arXiv.1904.08779
Han, W., Zhang, Z., Zhang, Y., Yu, J., Chiu, C. C., Qin, J., ... & Wu, Y. (2020). Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. Retrieved from https://doi.org/10.48550/arXiv.2005.03191
Chen, Y., Li, W., Sakaridis, C., Dai, D., & Van Gool, L. (2018). Domain adaptive faster R-CNN for object detection in the wild. Proceedings of the IEEE conference on computer vision and pattern recognition(pp. 3339-3348). IEEE.
Park, S., Jeong, Y., & Kim, H. S. (2017). Multiresolution CNN for reverberant speech recognition. 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA) (pp. 1-4). IEEE. DOI: 1109/ICSDA.2017.8384470
Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017). Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE journal of selected topics in signal processing, 11(8), 1240-1253. DOI: 1109/JSTSP.2017.2763455
Passricha, V., & Aggarwal, R. K. (2020). A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition. Journal of intelligent systems, 29(1), 1261-1274.
Qian, Y., Bi, M., Tan, T., & Yu, K. (2016). Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM transactions on audio, speech, and language processing, 24(12), 2263-2276.
Han, S., Kang, J., Mao, H., Hu, Y., Li, X., Li, Y., ... & Dally, W. B. J. (2017, February). Ese: Efficient speech recognition engine with sparse LSTM on FPGA. Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays(pp. 75-84). https://doi.org/10.1145/3020078.3021745
Huang, Z., Dong, M., Mao, Q., & Zhan, Y. (2014, November). Speech emotion recognition using CNN. In Proceedings of the 22nd ACM international conference on Multimedia(pp. 801-804). https://doi.org/10.1145/2647868.2654984
Gall, J., & Lempitsky, V. (2013). Class-specific Hough forests for object detection. In Decision forests for computer vision and medical image analysis(pp. 143-157). Springer, London.
Gales, M. J. (1998). Maximum likelihood linear transformations for HMM-based speech recognition. Computer speech & language, 12(2), 75-98. https://doi.org/10.1006/csla.1998.0043
Moreno, P. J., Raj, B., & Stern, R. M. (1996, May). A vector Taylor series approach for environment-independent speech recognition. 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings(Vol. 2, pp. 733-736). IEEE. DOI: 1109/ICASSP.1996.543225
Woodland, P. C., & Povey, D. (2002). Large scale discriminative training of hidden Markov models for speech recognition. Computer speech & language, 16(1), 25-47. https://doi.org/10.1006/csla.2001.0182
Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F., Ghoshal, A., ... & Thomas, S. (2011). The subspace Gaussian mixture model—A structured model for speech recognition. Computer speech & language, 25(2), 404-439. https://doi.org/10.1016/j.csl.2010.06.003
Zeng, F. G., Nie, K., Stickney, G. S., Kong, Y. Y., Vongphoe, M., Bhargave, A., ... & Cao, K. (2005). Speech recognition with amplitude and frequency modulations. Proceedings of the national academy of sciences, 102(7), 2293-2298.
Zhu, Y., Zhao, C., Wang, J., Zhao, X., Wu, Y., & Lu, H. (2017). Couplenet: Coupling global structure with local parts for object detection. Proceedings of the IEEE international conference on computer vision(pp. 4126-4134). IEEE.
Cai, Z., & Vasconcelos, N. (2018). Cascade R-CNN: delving into high quality object detection. Proceedings of the IEEE conference on computer vision and pattern recognition(pp. 6154-6162). IEEE.
Ghiasi, G., Lin, T. Y., & Le, Q. V. (2019). Nas-fpn: Learning scalable feature pyramid architecture for object detection. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition(pp. 7036-7045). IEEE.

Computational Algorithms and Numerical Dimensions

Speech Recognition System Based on Machine Learning in Persian Language

References

References

Volume 1, Issue 2
June 2022
Pages 72-83

Speech Recognition System Based on Machine Learning in Persian Language

References

References

Volume 1, Issue 2June 2022Pages 72-83

Volume 1, Issue 2
June 2022
Pages 72-83