Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 10, No. 3, p.233-239

ISSN (print) :

2287-5255

Received : 31 December 2020Revised : 14 February 2021Accepted : 07 March 2023

DOI :

https://doi.org/10.5573/IEIESPC.2021.10.3.233

Regular Paper

Review Paper: This paper reviews the recent progress, possibly including previous works in a particular research topic, and has been accepted by the editorial board through the regular reviewing process.

Speech Recognition using Machine Learning

VashishtVineet^* PandeyAditya Kumar YadavSatya Prakash

(Department of Information Technology, ABES Institute of Technology (ABESIT), Ghaziabad-201009, India )

^* Corresponding Author: Vineet Vashisht, vashishtvineet01@gmail.com

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Speech recognition is one of the fastest-growing engineering technologies. It has several applications in different areas, and provides many potential benefits. A lot of people are unable to communicate due to language barriers. We aim to reduce this barrier via our project, which was designed and developed to achieve systems in particular cases to provide significant help so people can share information by operating a computer using voice input. This project keeps that factor in mind, and an effort is made to ensure our project is able to recognize speech and convert input audio into text; it also enables a user to perform file operations like Save, Open, or Exit from voice-only input. We design a system that can recognize the human voice as well as audio clips, and translate between English and Hindi. The output is in text form, and we provide options to convert audio from one language to the other. Going forward, we expect to add functionality that provides dictionary meanings for Hindi and English words. Neural machine translation is the primary algorithm used in the industry to perform machine translation. Two recurrent neural networks used in tandem to construct an encoder-decoder structure are the architecture behind neural machine translation. This work on speech recognition starts with an introduction to the technology and the applications used in different sectors. Part of the report is based on software developments in speech recognition.

Keywords

Speech recognition, Speech emotion recognition, Statistical classifiers, Dimensionality reduction techniques, Emotional speech databases, Vision processing, Computational intelligence, Machine learning, Computer visit

1. Introduction

In this project we are trying to reduce the language barriers among people with a communication technique from amongst speech-trained systems that achieves better performance than those trained with normal speech. Speech emotion recognition is also used in call center applications and mobile wireless communications. This encouraged us to think of speech as a fast and powerful means of communicating with machines. The method of converting an acoustic signal, captured by a microphone or other instrument, into a set of words is speech recognition ^[1]. We use linguistic analysis to achieve speech comprehension. Everybody needs to draw in with those in the public arena, and we need to see one another.It is also normal for individuals to expect computers to have a speech interface. In the present era, humans also need complex languages for interactions with machines that are hard to understand and use. A speech synthesizer converts written text into spoken language. Speech synthesis is also referred to as text-to-speech (TTS) conversion, as shown in Fig. 1.

$\textbf{Speech synthesis}$ is the artificial production of human speech. A computer used for this purpose is called a $\textbf{speech computer,}$ or $\textbf{speech synthesizer}$, and it can be implemented in software or hardware products. A $\textbf{text-to-speech}$ ($\textbf{TTS}$) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.

Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity. For specific usages, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create completely synthetic voice output. Text-to-speech (TTS) is a type of assistive technology that reads digital text aloud. It is sometimes called read-aloud technology. With the click of a button or the touch of a finger, TTS can take words on a computer or other digital device and convert them into audio. A simple solution could be to ease this touch barrier. A simple solution may be to ease this communication barrier with spoken language that can be understood by a computer. In this area, great progress has been made, but such systems still face the problem of limited vocabulary or complex grammar, along with the problem of retraining the system under different circumstances for different speakers. For applications that require normal human-machine interaction, such as web movies and computer demonstration applications, detection of emotion in speech is particularly useful, where the reaction of the system to the user depends on sensed emotion. Speech recognition-interface implementations include voice (e.g., "call home"), call routing (e.g., "I would like to make a collect call"), home appliance control, keyword search (e.g., locating a podcast where particular words are spoken), basic data entry (e.g., entering a credit card number), formal document preparation (e.g., creating a radiology report), and defining organisms (usually termed direct voice input, where particular words are spoken) ^[8].

Visual processing is a term used to refer to the brain's ability to use and interpret visual information. The process of converting light energy into a meaningful image is a complex process that is facilitated by numerous brain structures and higher-level cognitive processes. In the areas of human-computer interactions, biometric applications, protection and surveillance, and most recently in computational behavioral analysis, the advancements in speech- and visual-processing systems has facilitated considerable research and growth. Although IS has been enriched for several decades by conventional machine learning and evolutionary computations to solve complicated pattern recognition issues, these methods have limits on their abilities to handle natural data or images in raw data formats. A variety of computational steps are used before implementing machine learning models to derive representative features from raw data or images.

$\textbf{Speech Recognition Terminology:}$ Recognition of speech is a technology that enables a device to catch the words spoken by a human into a microphone. These words are later processed through speech recognition and, ultimately, the system outputs recognized words. The speech recognition process consists of different steps that are discussed one by one in the following sections ^[6]. Speech translation is important because it allows speakers from around the world to communicate in their own languages, erasing the language gap in global business and cross-cultural exchanges. It would be of immense scientific, cultural, and economic importance for humanity to achieve universal speech translation. Our project breaks down the language barrier so that individuals can interact with each other in their preferred language. Speech recognition systems can be categorized according to their ability to understand the terms and lists of words they have in a number of groups. A desirable condition in the speech recognition process is where the spoken word is heard. The recognition engine respects all words spoken by a person, but in practice, the speech recognition engine's efficiency depends on a variety of factors. The key variables that are counted as dependent variables for a speech recognition engine are terminology, concurrent users, and noisy settings.

$\textbf{Speech Recognition Process:}$ The communication of meaning from one language (the source) to another language (the target) is translation. Basically, speech synthesis is used for two main reasons. First and foremost, dictation is the conversion of spoken words into text in the form of speech processing, and secondly, control of devices requires the creation of software that enables a person to run various voice applications ^[3]. The PC sound card generates the corresponding digital representation of received audio through microphone input. The method of translating the analog signal into digital form is digitization. Sampling transforms a continuous signal into a discrete signal; thus, quantization is defined as the method of approximating a continuous set of values.

Attention models are input processing techniques for neural networks that allow the network to focus on specific aspects of complex input, one at a time, until the entire dataset is categorized. The goal is to break down complicated tasks into smaller areas of attention that are processed sequentially. An attention model works in broad strokes; attention is expressed as a function that maps a query and ``s set'' of key-value pairs to an output, one in which the query, keys, values, and final output are all vectors. The output is then calculated as a weighted sum of the values, with the weight assigned to each value expressed by a compatibility function of the query with the corresponding key-value.

$\textbf{Neural Machine Translation:}$ This machine translation technique is used in artificial neural networks to predict the probability of a series of terms, usually in a single integrated model, modeling entire sentences. In recent years, in a variety of ways, technology using neural networks has been used to solve problems.

In the natural speech processing area, the use of neural machine translation (NMT) is a example of this. The missing translation is the phenomenon in which, in terms of context or word translation, text that was present in the source is missing. Neural machine translation is the use of a neural network to learn a mathematical model for NMT. The key benefit of the methodology is that a single framework can be trained directly on the source and target text, which no longer requires a pipeline of complex systems used in statistical machine learning ^[5].

$\textbf{· Connected Speech:}$ Linked words or connected speech are identical to independent speech, and except for brief delays between them, they make separate utterances.

$\textbf{· Continuous Speech:}$ Continuous speech allows the user to speak almost naturally; it is also called computer dictation.

$\textbf{· Spontaneous Speech:}$ At a simple level, this can be viewed as speech that is natural-sounding and not rehearsed. An ASR device with random speech abilities should be able to accommodate a variety of normal speech features, such as sentences that run together, and that include "ums" and "ahs" and even slight stutters.

$\textbf{Machine Translation:}$ Machine translation usually models whole sentences with the use of an artificial neural network to predict a sequence of terms. Typically, it models entire sentences in a single integrated model through the use of an artificial neural network to predict the sequence of words. Initially, word sequence modeling is usually carried out using a recurrent neural network (RNN). Unlike the traditional phrase-based translation method that consists of many small subcomponents that are tuned separately, neural machine translation is used to build and train a single, broad neural network that reads a phrase and outputs the correct translation. Neural machine translation by end-to-end systems is said to be a neural machine translation system because only one model is needed for translation. The transfer of scientific, metaphysical, literary, commercial, political, and artistic knowledge through linguistic barriers is an integral and essential component of human endeavor ^[4].

Translation is more prevalent and available today than ever before. Organizations with larger budgets may choose to hire a translation company or independent professional translators to manage all their translation needs; organizations with smaller budgets, or that deal in subjects that are unknown to many translators, may choose to combine the services of professional translators.

Fig. 1. Speech Synthesis [2].

Fig. 2. Recognition Process.

2. Literature Review

$\textbf{Mehmet Berkehan Akçay et al.}$ ^[1] explained that neural networks are mainly limited to industrial control and robotics applications. However, recent advances in neural networks through the introduction of intelligent travel, intelligent diagnosis and health monitoring for precision medicine, robotics and home appliance automation, virtual online support, e-marketing, weather forecasting and natural disaster management, among others, have contributed to successful IS implementations in almost every aspect of human life.

$\textbf{G. Tsontzos et al.}$ ^[2] clarified how feelings allow us to better understand each other, and a natural consequence is to expand this understanding to computers. Thanks to smart mobile devices capable of accepting and responding to voice commands with synthesized speech, speech recognition is now part of our daily lives. To allow devices to detect our emotions, speech emotion recognition (SER) could be used.

$\textbf{T. Taleb et al.}$^[7] said they were motivated by understanding that these standards place higher boundaries on the improvement that can be achieved when using HMMs in speech recognition. In an attempt to improve robustness, particularly under noisy conditions, new modeling schemes that can explicitly model time are being explored, and this work was partially funded by the EU-IST FP6 HIWIRE research project. Spatial similarities, including dynamic linear models (LDM), were initially proposed for use in speech recognition.

$\textbf{Vinícius Maran et al.}$ ^[6] explained that learning speech is a dynamic mechanism in which the processing of phonemes is marked by continuities and discontinuities in the path of the infant towards the advanced production of ambient language segments and structures.

$\textbf{Y. Wu et al.}$ ^[3] noted that discriminative testing has been used for speech recognition for many years now. The few organizations that have had the resources to implement discriminatory instructions for large-scale speech recognition assignments have mostly used the full shared information system in recent years (MMI). Instead, in the extension of the studies first presented, we reflect on the minimum classification error (MCE) paradigm for discriminatory instruction.

$\textbf{Peng et al.}$ ^[4] stated that identification of speakers refers to identifying people by their voice. This technology is increasingly adopted and used as a kind of biometrics for its ease of use and non-interactivity, and soon became a research hotspot in the field of biometrics.

$\textbf{Shahnawazuddin and Sinha}$ ^[10] discussed how the work presented was an extension of the current quick adaptation approaches based on acoustic model interpolation. The basis (model) weights are calculated in these methods using an iterative process that uses the maximum likelihood (ML) criterion.

$\textbf{Varghese et al.}$ ^[5] stated there are many ways to understand feelings from expression. Many attempts have been made to identify states by vocal information. To understand feelings, some essential voice function vectors have been picked, in which utterance level statistics are measured.

$\textbf{D.J. Atha et al.}$ ^[9] pointed out that a long-term target is the creation of an automated real-time translation device where the voice is the source. Recent developments in the area of computational translation science, however, boost the possibility of widespread adoption in the near future.

3. Proposed Method

In this research, the work is based on the flowchart below. According to working model of speech. The models illustrated previously are made up of millions of parameters, from which the instruction corpus needs to be learned. We make use of additional information where appropriate, such as text that is closely linked to the speech we are about to translate ^[7]. It is possible to write this text in the source language, the target language, or both.

Future development will reach billions of smart phone users for the most complex intelligent systems focused on deep learning. There is a lengthy list of vision and voice technologies that can increasingly simplify and assist the visual and auditory processing of humans to a greater scale and consistency, from sensation and emotion detection to the development of self-driving autonomous transport systems. This paper serves scholars, clinicians, technology creators, and consumers as an exemplary analysis of emerging technologies in many fields, such as behavioral science, psychology, transportation, and medicine.

Fig. 3. Working Model of Speech Recognition.

Fig. 4. The Attention Model.

4. Results

Voice detection with real-time predictive voice translation device optimization using multimodal vector sources of information and functionality was presented. The key production and commitment of this work is the manner in which external information input is used to increase the system's accuracy, thereby allowing a notable improvement, compared to the processes of nature. In addition, a new initiative has been launched from an analytical standpoint, while remaining a realistic one, and was discussed. The system we want converts Hindi to English, as per our discussion and planning, and vice versa.

4.1 Initial Test

The constraint of the encoder-decoder model, attention, is proposed to encrypt the input sequence to one fixed length vector from which each output time stage is decoded. With long sequences, focus is proposed as a strategy to both align and interpret, and this problem is believed to be more of a concern. Instead of encoding the input sequence into a single fixed-context vector, the attention model produces a context vector that is filtered independently for each output time step. The approach is extended, as with the encoder-decoder text, to a machine translation problem, and uses GRU units rather than LSTM memory cells ^[9]. In this case, bidirectional input is used where both forward and backward input sequences are given, which are then concatenated before being passed to the decoder. The input is positioned into an encoder model that gives us the output of the form encoder and the hidden shape state encoder.

Facilitated communication (FC), or supported typing, is a scientifically discredited technique that attempts to aid communication by people with autism or other communication disabilities, and who are non-verbal. The facilitator guides the disabled person's arm or hand, and attempts to help them type on a keyboard or other device.

The calculations applied are:

FC= Totally Connected Layer (Dense)

Output EO=Encoder

H = Concealed State

X=Entry into the Decoder

And with the pseudo-code:

FC (tanh (FC(EO) + FC(H))) score =Weights for focus = SoftMax (score, axis = 1). It is implemented on the last axis by default, but we want to implement it on the first axis here, as the score form is as follows: batch size, max length, secret size. The length of our input is Max length. Since we are attempting to assign a weight to each input, it is important to add SoftMax on that axis. Context vector = sum (weights of focus * EO, axis = 1). The same explanation as above applies for an axis selection of 1. Embedding output = The input is transferred through an embedding layer to the decoder. Integrated vector = concept (embedding output, context vector).

Fig. 5. First Translation from English to Hindi.

Fig. 6. Second Translation from English to Hindi.

Fig. 7. Third Translation from English to Hindi.

4.2 Translate

The assessment function is close to the teaching loop, except that we are not pressuring teachers here. At each time point, the input to the decoder, along with the hidden state and the encoder output, are its previous predictions. Avoid forecasting the final token as the model predicts it. Store the attention weights for each step in time. A translator is a programming language processor that converts a computer program from one language to another. It takes a program written in source code and converts it into machine code. It discovers and identifies errors during translation. It translates a high-level language program into a machine language program that the central processing unit (CPU) can understand. It also detects errors in the program.

5. Conclusion

In the past few years, the complexity and precision of speech recognition applications have evolved exponentially. This paper extensively explores the recent advancements in intelligent vision and speech algorithms, their applications on the most popular smart phones and embedded platforms, and their application limitations. In spite of immense advances in success and efficacy from deep learning algorithms, training the machine with other knowledge sources, which are the framework, also contributes significantly to the class subject.

6. Future Scopes

This work can be explored in depth in order to improvise and incorporate new functionality to the project, and it can be worked on further.. In order to accumulate a larger number of samples and maximize productivity, the new software does not accommodate a broad vocabulary ^[10]. Only a few parts of the notepad are protected by the current edition of the app, but more areas can be covered, and efforts will be made in this respect.

ACKNOWLEDGMENTS

The authors would like to express their sincere thanks to the editor-in-chief for his valuable suggestions to improve this article.

REFERENCES

Mehmet Berkehan Akçay , Kaya Oğuz , 2020, Speech emotion recognition: Emotional models, databases,features, preprocessing methods, supporting modalities, and classifiers, Speech Communication, Volume 116, ISSN 0167-6393, pp. 56-76

Tsontzos G., Diakoloukas V., Koniaris C., Digalakis V., 2013, Estimation of General Identifiable Linear Dynamic Models with an sApplication in Speech characteristics vectors, Computer Standards & Interfaces, ISSN 0920-5489, Vol. 35, No. 5, pp. 490-506

Wu Y., et al. , 2016, Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation., arXiv preprint arXiv:1609.08144

Peng Shuping, Lv Tao, Han Xiyu, Wu Shisong, Yan Chunhui, Zhang Heyong, 2019, Remote speaker recognition based on the enhanced LDV-captured speech, Applied Acoustics, ISSN 0003-682X, Vol. 143, pp. 165-170

Varghese A. A., Cherian J. P., Kizhakkethottam J. J., 2015, Overview on emotion recognition system, 2015 International Conference on Soft-Computing and Networks Security (ICSNS) Coimbatore, pp. 1-5

Maran Vinícius, Keske-Soares Marcia, 2021, towards a speech therapy support system based on phonological processes early detection, Computer Speech & Language, ISSN 0885-2308, Vol. 65

Taleb T., Samdanis K., Mada B., Flinck H., Dutta S., Sabella D., 2017, On Multi-Access Edge Computing: A Survey of the Emerging 5G Network Edge Cloud Architecture and Orchestration, IEEE Communications Surveys & Tutorials, Vol. 19, No. 3, pp. 1657-1681

Amengual J. C., Castaño A., Castellanos A., Jiménez V.M., Llorens D., Marzal A., Prat F., Vilar J.M., Benedi J.M., Casacuberta F., Pastor M., Vidal E., 2000, The EuTrans spoken language translation system, Machine Translation, Vol. 15, pp. 75-103

Atha D. J., Jahanshahi M. R., 2018, Evaluation of deep learning approaches based on convolutional neural networks for corrosion detection, Struct. Health Monit., Vol. 17, No. 5, pp. 1110-1128

Shahnawazuddin S., Sinha Rohit, 2017, Sparse coding over redundant dictionaries for fast adaptation of speech recognition system, Computer Speech & Language, ISSN 0885-2308, Vol. 43, pp. 1-17

Author

Satya Prakash Yadav

Satya Prakash Yadav is currently on the faculty of the Information Technology Department, ABES Institute of Technology (ABESIT), Ghaziabad (India). A seasoned academician having more than 13 years of experience, he has published three books (Programming in C, Programming in C++, and Blockchain and Cryptocurrency) under I.K. International Publishing House Pvt. Ltd. He has undergone industrial training programs during which he was involved in live projects with companies in the areas of SAP, Railway Traffic Management Systems, and Visual Vehicles Counter and Classification (used in the Metro rail network design). He is an alumnus of Netaji Subhas Institute of Technology (NSIT), Delhi University. A prolific writer, Mr. Yadav has filed two patents and authored many research papers in the Web of Science indexed journals. Additionally, he has presented research papers at many conferences in areas of Image Processing and Programming such as Image Processing, Feature Extraction and Inforamtion Rectrival . He is also the lead editor in CRC Press, Taylor and Francis Group Publisher (U.S.A), Science Publishing Group (U.S.A.), and Eureka Journals, Pune (India).

Vineet Vashisht

Vineet Vashisht is currently a research scholar in the Information Technology Department at Dr. A.P.J. Abdul Kalam Technical University, Lucknow. Vineet Vashisht is supervised by Ass. Prof. Satya Prakash Yadav of the Information Technology Department, ABES Institute of Technology (ABESIT).

Aditya Kumar Pandey

Aditya Kumar Pandey is currently a research scholar in the Information Technology Department at Dr. A.P.J. Abdul Kalam Technical University, Lucknow. Aditya Kumar Pandey is supervised by Ass. Prof. Satya Prakash Yadav of the Information Technology Department, ABES Institute of Technology (ABESIT).

Article Information (continued)

Regular Paper

Keywords :

Keywords

Keyword :

Speech recognition

Keyword :

Speech emotion recognition

Keyword :

Statistical classifiers

Keyword :

Dimensionality reduction techniques

Keyword :

Emotional speech databases

Keyword :

Vision processing

Keyword :

Computational intelligence

Keyword :

Machine learning

Keyword :

Computer visit

This display is generated from NISO JATS XML with jats-style.xsl. The XSLT engine is Saxonica.