VashishtVineet*
PandeyAditya Kumar
YadavSatya Prakash
-
(Department of Information Technology, ABES Institute of Technology (ABESIT), Ghaziabad-201009,
India )
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Speech recognition, Speech emotion recognition, Statistical classifiers, Dimensionality reduction techniques, Emotional speech databases, Vision processing, Computational intelligence, Machine learning, Computer visit
1. Introduction
In this project we are trying to reduce the language barriers among people with a
communication technique from amongst speech-trained systems that achieves better performance
than those trained with normal speech. Speech emotion recognition is also used in
call center applications and mobile wireless communications. This encouraged us to
think of speech as a fast and powerful means of communicating with machines. The method
of converting an acoustic signal, captured by a microphone or other instrument, into
a set of words is speech recognition [1]. We use linguistic analysis to achieve speech comprehension. Everybody needs to draw
in with those in the public arena, and we need to see one another.It is also normal
for individuals to expect computers to have a speech interface. In the present era,
humans also need complex languages for interactions with machines that are hard to
understand and use. A speech synthesizer converts written text into spoken language.
Speech synthesis is also referred to as text-to-speech (TTS) conversion, as shown
in Fig. 1.
$\textbf{Speech synthesis}$ is the artificial production of human speech. A computer
used for this purpose is called a $\textbf{speech computer,}$ or $\textbf{speech synthesizer}$,
and it can be implemented in software or hardware products. A $\textbf{text-to-speech}$
($\textbf{TTS}$) system converts normal language text into speech; other systems render
symbolic linguistic representations like phonetic transcriptions into speech.
Synthesized speech can be created by concatenating pieces of recorded speech that
are stored in a database. Systems differ in the size of the stored speech units; a
system that stores phones or diphones provides the largest output range, but may lack
clarity. For specific usages, the storage of entire words or sentences allows for
high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal
tract and other human voice characteristics to create completely synthetic voice output.
Text-to-speech (TTS) is a type of assistive technology that reads digital text aloud.
It is sometimes called read-aloud technology. With the click of a button or the touch
of a finger, TTS can take words on a computer or other digital device and convert
them into audio. A simple solution could be to ease this touch barrier. A simple solution
may be to ease this communication barrier with spoken language that can be understood
by a computer. In this area, great progress has been made, but such systems still
face the problem of limited vocabulary or complex grammar, along with the problem
of retraining the system under different circumstances for different speakers. For
applications that require normal human-machine interaction, such as web movies and
computer demonstration applications, detection of emotion in speech is particularly
useful, where the reaction of the system to the user depends on sensed emotion. Speech
recognition-interface implementations include voice (e.g., "call home"), call routing
(e.g., "I would like to make a collect call"), home appliance control, keyword search
(e.g., locating a podcast where particular words are spoken), basic data entry (e.g.,
entering a credit card number), formal document preparation (e.g., creating a radiology
report), and defining organisms (usually termed direct voice input, where particular
words are spoken) [8].
Visual processing is a term used to refer to the brain's ability to use and interpret
visual information. The process of converting light energy into a meaningful image
is a complex process that is facilitated by numerous brain structures and higher-level
cognitive processes. In the areas of human-computer interactions, biometric applications,
protection and surveillance, and most recently in computational behavioral analysis,
the advancements in speech- and visual-processing systems has facilitated considerable
research and growth. Although IS has been enriched for several decades by conventional
machine learning and evolutionary computations to solve complicated pattern recognition
issues, these methods have limits on their abilities to handle natural data or images
in raw data formats. A variety of computational steps are used before implementing
machine learning models to derive representative features from raw data or images.
$\textbf{Speech Recognition Terminology:}$ Recognition of speech is a technology that
enables a device to catch the words spoken by a human into a microphone. These words
are later processed through speech recognition and, ultimately, the system outputs
recognized words. The speech recognition process consists of different steps that
are discussed one by one in the following sections [6]. Speech translation is important because it allows speakers from around the world
to communicate in their own languages, erasing the language gap in global business
and cross-cultural exchanges. It would be of immense scientific, cultural, and economic
importance for humanity to achieve universal speech translation. Our project breaks
down the language barrier so that individuals can interact with each other in their
preferred language. Speech recognition systems can be categorized according to their
ability to understand the terms and lists of words they have in a number of groups.
A desirable condition in the speech recognition process is where the spoken word is
heard. The recognition engine respects all words spoken by a person, but in practice,
the speech recognition engine's efficiency depends on a variety of factors. The key
variables that are counted as dependent variables for a speech recognition engine
are terminology, concurrent users, and noisy settings.
$\textbf{Speech Recognition Process:}$ The communication of meaning from one language
(the source) to another language (the target) is translation. Basically, speech synthesis
is used for two main reasons. First and foremost, dictation is the conversion of spoken
words into text in the form of speech processing, and secondly, control of devices
requires the creation of software that enables a person to run various voice applications
[3]. The PC sound card generates the corresponding digital representation of received
audio through microphone input. The method of translating the analog signal into digital
form is digitization. Sampling transforms a continuous signal into a discrete signal;
thus, quantization is defined as the method of approximating a continuous set of values.
Attention models are input processing techniques for neural networks that allow the
network to focus on specific aspects of complex input, one at a time, until the entire
dataset is categorized. The goal is to break down complicated tasks into smaller areas
of attention that are processed sequentially. An attention model works in broad strokes;
attention is expressed as a function that maps a query and ``s set'' of key-value
pairs to an output, one in which the query, keys, values, and final output are all
vectors. The output is then calculated as a weighted sum of the values, with the weight
assigned to each value expressed by a compatibility function of the query with the
corresponding key-value.
$\textbf{Neural Machine Translation:}$ This machine translation technique is used
in artificial neural networks to predict the probability of a series of terms, usually
in a single integrated model, modeling entire sentences. In recent years, in a variety
of ways, technology using neural networks has been used to solve problems.
In the natural speech processing area, the use of neural machine translation (NMT)
is a example of this. The missing translation is the phenomenon in which, in terms
of context or word translation, text that was present in the source is missing. Neural
machine translation is the use of a neural network to learn a mathematical model for
NMT. The key benefit of the methodology is that a single framework can be trained
directly on the source and target text, which no longer requires a pipeline of complex
systems used in statistical machine learning [5].
$\textbf{· Connected Speech:}$ Linked words or connected speech are identical to independent
speech, and except for brief delays between them, they make separate utterances.
$\textbf{· Continuous Speech:}$ Continuous speech allows the user to speak almost
naturally; it is also called computer dictation.
$\textbf{· Spontaneous Speech:}$ At a simple level, this can be viewed as speech that
is natural-sounding and not rehearsed. An ASR device with random speech abilities
should be able to accommodate a variety of normal speech features, such as sentences
that run together, and that include "ums" and "ahs" and even slight stutters.
$\textbf{Machine Translation:}$ Machine translation usually models whole sentences
with the use of an artificial neural network to predict a sequence of terms. Typically,
it models entire sentences in a single integrated model through the use of an artificial
neural network to predict the sequence of words. Initially, word sequence modeling
is usually carried out using a recurrent neural network (RNN). Unlike the traditional
phrase-based translation method that consists of many small subcomponents that are
tuned separately, neural machine translation is used to build and train a single,
broad neural network that reads a phrase and outputs the correct translation. Neural
machine translation by end-to-end systems is said to be a neural machine translation
system because only one model is needed for translation. The transfer of scientific,
metaphysical, literary, commercial, political, and artistic knowledge through linguistic
barriers is an integral and essential component of human endeavor [4].
Translation is more prevalent and available today than ever before. Organizations
with larger budgets may choose to hire a translation company or independent professional
translators to manage all their translation needs; organizations with smaller budgets,
or that deal in subjects that are unknown to many translators, may choose to combine
the services of professional translators.
Fig. 1. Speech Synthesis [2].
Fig. 2. Recognition Process.
2. Literature Review
$\textbf{Mehmet Berkehan Akçay et al.}$ [1] explained that neural networks are mainly limited to industrial control and robotics
applications. However, recent advances in neural networks through the introduction
of intelligent travel, intelligent diagnosis and health monitoring for precision medicine,
robotics and home appliance automation, virtual online support, e-marketing, weather
forecasting and natural disaster management, among others, have contributed to successful
IS implementations in almost every aspect of human life.
$\textbf{G. Tsontzos et al.}$ [2] clarified how feelings allow us to better understand each other, and a natural consequence
is to expand this understanding to computers. Thanks to smart mobile devices capable
of accepting and responding to voice commands with synthesized speech, speech recognition
is now part of our daily lives. To allow devices to detect our emotions, speech emotion
recognition (SER) could be used.
$\textbf{T. Taleb et al.}$[7] said they were motivated by understanding that these standards place higher boundaries
on the improvement that can be achieved when using HMMs in speech recognition. In
an attempt to improve robustness, particularly under noisy conditions, new modeling
schemes that can explicitly model time are being explored, and this work was partially
funded by the EU-IST FP6 HIWIRE research project. Spatial similarities, including
dynamic linear models (LDM), were initially proposed for use in speech recognition.
$\textbf{Vinícius Maran et al.}$ [6] explained that learning speech is a dynamic mechanism in which the processing of
phonemes is marked by continuities and discontinuities in the path of the infant towards
the advanced production of ambient language segments and structures.
$\textbf{Y. Wu et al.}$ [3] noted that discriminative testing has been used for speech recognition for many years
now. The few organizations that have had the resources to implement discriminatory
instructions for large-scale speech recognition assignments have mostly used the full
shared information system in recent years (MMI). Instead, in the extension of the
studies first presented, we reflect on the minimum classification error (MCE) paradigm
for discriminatory instruction.
$\textbf{Peng et al.}$ [4] stated that identification of speakers refers to identifying people by their voice.
This technology is increasingly adopted and used as a kind of biometrics for its ease
of use and non-interactivity, and soon became a research hotspot in the field of biometrics.
$\textbf{Shahnawazuddin and Sinha}$ [10] discussed how the work presented was an extension of the current quick adaptation
approaches based on acoustic model interpolation. The basis (model) weights are calculated
in these methods using an iterative process that uses the maximum likelihood (ML)
criterion.
$\textbf{Varghese et al.}$ [5] stated there are many ways to understand feelings from expression. Many attempts
have been made to identify states by vocal information. To understand feelings, some
essential voice function vectors have been picked, in which utterance level statistics
are measured.
$\textbf{D.J. Atha et al.}$ [9] pointed out that a long-term target is the creation of an automated real-time translation
device where the voice is the source. Recent developments in the area of computational
translation science, however, boost the possibility of widespread adoption in the
near future.
3. Proposed Method
In this research, the work is based on the flowchart below. According to working model
of speech. The models illustrated previously are made up of millions of parameters,
from which the instruction corpus needs to be learned. We make use of additional information
where appropriate, such as text that is closely linked to the speech we are about
to translate [7]. It is possible to write this text in the source language, the target language, or
both.
Future development will reach billions of smart phone users for the most complex intelligent
systems focused on deep learning. There is a lengthy list of vision and voice technologies
that can increasingly simplify and assist the visual and auditory processing of humans
to a greater scale and consistency, from sensation and emotion detection to the development
of self-driving autonomous transport systems. This paper serves scholars, clinicians,
technology creators, and consumers as an exemplary analysis of emerging technologies
in many fields, such as behavioral science, psychology, transportation, and medicine.
Fig. 3. Working Model of Speech Recognition.
Fig. 4. The Attention Model.
4. Results
Voice detection with real-time predictive voice translation device optimization using
multimodal vector sources of information and functionality was presented. The key
production and commitment of this work is the manner in which external information
input is used to increase the system's accuracy, thereby allowing a notable improvement,
compared to the processes of nature. In addition, a new initiative has been launched
from an analytical standpoint, while remaining a realistic one, and was discussed.
The system we want converts Hindi to English, as per our discussion and planning,
and vice versa.
4.1 Initial Test
The constraint of the encoder-decoder model, attention, is proposed to encrypt the
input sequence to one fixed length vector from which each output time stage is decoded.
With long sequences, focus is proposed as a strategy to both align and interpret,
and this problem is believed to be more of a concern. Instead of encoding the input
sequence into a single fixed-context vector, the attention model produces a context
vector that is filtered independently for each output time step. The approach is extended,
as with the encoder-decoder text, to a machine translation problem, and uses GRU units
rather than LSTM memory cells [9]. In this case, bidirectional input is used where both forward and backward input
sequences are given, which are then concatenated before being passed to the decoder.
The input is positioned into an encoder model that gives us the output of the form
encoder and the hidden shape state encoder.
Facilitated communication (FC), or supported typing, is a scientifically discredited
technique that attempts to aid communication by people with autism or other communication
disabilities, and who are non-verbal. The facilitator guides the disabled person's
arm or hand, and attempts to help them type on a keyboard or other device.
The calculations applied are:
FC= Totally Connected Layer (Dense)
Output EO=Encoder
H = Concealed State
X=Entry into the Decoder
And with the pseudo-code:
FC (tanh (FC(EO) + FC(H))) score =Weights for focus = SoftMax (score, axis = 1). It
is implemented on the last axis by default, but we want to implement it on the first
axis here, as the score form is as follows: batch size, max length, secret size. The
length of our input is Max length. Since we are attempting to assign a weight to each
input, it is important to add SoftMax on that axis. Context vector = sum (weights
of focus * EO, axis = 1). The same explanation as above applies for an axis selection
of 1. Embedding output = The input is transferred through an embedding layer to the
decoder. Integrated vector = concept (embedding output, context vector).
Fig. 5. First Translation from English to Hindi.
Fig. 6. Second Translation from English to Hindi.
Fig. 7. Third Translation from English to Hindi.
4.2 Translate
The assessment function is close to the teaching loop, except that we are not pressuring
teachers here. At each time point, the input to the decoder, along with the hidden
state and the encoder output, are its previous predictions. Avoid forecasting the
final token as the model predicts it. Store the attention weights for each step in
time. A translator is a programming language processor that converts a computer program
from one language to another. It takes a program written in source code and converts
it into machine code. It discovers and identifies errors during translation. It translates
a high-level language program into a machine language program that the central processing
unit (CPU) can understand. It also detects errors in the program.
5. Conclusion
In the past few years, the complexity and precision of speech recognition applications
have evolved exponentially. This paper extensively explores the recent advancements
in intelligent vision and speech algorithms, their applications on the most popular
smart phones and embedded platforms, and their application limitations. In spite of
immense advances in success and efficacy from deep learning algorithms, training the
machine with other knowledge sources, which are the framework, also contributes significantly
to the class subject.
6. Future Scopes
This work can be explored in depth in order to improvise and incorporate new functionality
to the project, and it can be worked on further.. In order to accumulate a larger
number of samples and maximize productivity, the new software does not accommodate
a broad vocabulary [10]. Only a few parts of the notepad are protected by the current edition of the app,
but more areas can be covered, and efforts will be made in this respect.
ACKNOWLEDGMENTS
The authors would like to express their sincere thanks to the editor-in-chief for
his valuable suggestions to improve this article.
REFERENCES
Mehmet Berkehan Akçay , Kaya Oğuz , 2020, Speech emotion recognition: Emotional models,
databases,features, preprocessing methods, supporting modalities, and classifiers,
Speech Communication, Volume 116, ISSN 0167-6393, pp. 56-76
Tsontzos G., Diakoloukas V., Koniaris C., Digalakis V., 2013, Estimation of General
Identifiable Linear Dynamic Models with an sApplication in Speech characteristics
vectors, Computer Standards & Interfaces, ISSN 0920-5489, Vol. 35, No. 5, pp. 490-506
Wu Y., et al. , 2016, Google's Neural Machine Translation System: Bridging the Gap
between Human and Machine Translation., arXiv preprint arXiv:1609.08144
Peng Shuping, Lv Tao, Han Xiyu, Wu Shisong, Yan Chunhui, Zhang Heyong, 2019, Remote
speaker recognition based on the enhanced LDV-captured speech, Applied Acoustics,
ISSN 0003-682X, Vol. 143, pp. 165-170
Varghese A. A., Cherian J. P., Kizhakkethottam J. J., 2015, Overview on emotion recognition
system, 2015 International Conference on Soft-Computing and Networks Security (ICSNS)
Coimbatore, pp. 1-5
Maran Vinícius, Keske-Soares Marcia, 2021, towards a speech therapy support system
based on phonological processes early detection, Computer Speech & Language, ISSN
0885-2308, Vol. 65
Taleb T., Samdanis K., Mada B., Flinck H., Dutta S., Sabella D., 2017, On Multi-Access
Edge Computing: A Survey of the Emerging 5G Network Edge Cloud Architecture and Orchestration,
IEEE Communications Surveys & Tutorials, Vol. 19, No. 3, pp. 1657-1681
Amengual J. C., Castaño A., Castellanos A., Jiménez V.M., Llorens D., Marzal A., Prat
F., Vilar J.M., Benedi J.M., Casacuberta F., Pastor M., Vidal E., 2000, The EuTrans
spoken language translation system, Machine Translation, Vol. 15, pp. 75-103
Atha D. J., Jahanshahi M. R., 2018, Evaluation of deep learning approaches based on
convolutional neural networks for corrosion detection, Struct. Health Monit., Vol.
17, No. 5, pp. 1110-1128
Shahnawazuddin S., Sinha Rohit, 2017, Sparse coding over redundant dictionaries for
fast adaptation of speech recognition system, Computer Speech & Language, ISSN 0885-2308,
Vol. 43, pp. 1-17
Author
Satya Prakash Yadav is currently on the faculty of the Information Technology Department,
ABES Institute of Technology (ABESIT), Ghaziabad (India). A seasoned academician having
more than 13 years of experience, he has published three books (Programming in C,
Programming in C++, and Blockchain and Cryptocurrency) under I.K. International Publishing
House Pvt. Ltd. He has undergone industrial training programs during which he was
involved in live projects with companies in the areas of SAP, Railway Traffic Management
Systems, and Visual Vehicles Counter and Classification (used in the Metro rail network
design). He is an alumnus of Netaji Subhas Institute of Technology (NSIT), Delhi University.
A prolific writer, Mr. Yadav has filed two patents and authored many research papers
in the Web of Science indexed journals. Additionally, he has presented research papers
at many conferences in areas of Image Processing and Programming such as Image Processing,
Feature Extraction and Inforamtion Rectrival . He is also the lead editor in CRC Press,
Taylor and Francis Group Publisher (U.S.A), Science Publishing Group (U.S.A.), and
Eureka Journals, Pune (India).
Vineet Vashisht is currently a research scholar in the Information Technology Department
at Dr. A.P.J. Abdul Kalam Technical University, Lucknow. Vineet Vashisht is supervised
by Ass. Prof. Satya Prakash Yadav of the Information Technology Department, ABES Institute
of Technology (ABESIT).
Aditya Kumar Pandey is currently a research scholar in the Information Technology
Department at Dr. A.P.J. Abdul Kalam Technical University, Lucknow. Aditya Kumar Pandey
is supervised by Ass. Prof. Satya Prakash Yadav of the Information Technology Department,
ABES Institute of Technology (ABESIT).