Wideband Speech Codec Algorithm based on Compressed Sensing and Fractional Calculus
WangXiuhuan1
-
(College of General Education, Chongqing Vocational and Technical University of Mechatronics,
Chongqing, 402760, China
wang_xiuhuan0@163.com
)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Calculus, Broadband voice, Encoding and decoding, Compressed sensing, Wavelet transform, Speech synthesis
1. Introduction
With the continuous development of information technology, voice communication plays
an important role in our daily lives [1]. From telephone communication to modern applications such as voice chat, voice assistants,
and speech recognition, voice communication is one of the most important ways for
people to communicate and interact. However, speech coding and decoding algorithms
face a number of challenges under the requirements of high quality, high efficiency,
and low latency.
Speech coding and decoding algorithms typically use digital signal processing and
compression techniques to reduce the amount of data in speech signals and achieve
efficient transmission and storage [2]. However, these algorithms have several limitations when processing speech signals.
First, they typically require a large amount of computational resources and storage
space, resulting in insufficient real-time and practicality [3]. Second, they have limited accuracy in recovering speech signals and their ability
to preserve details, so they cannot meet users' needs for high-quality speech. In
addition, the algorithms have poor processing performance for non-stationary signals,
which limits their application in practical environments.
In order to overcome the limitations of speech coding and decoding algorithms, new
technologies such as compressed sensing and fractional calculus have been introduced
in the field of speech signal processing in recent years. Compressed sensing is a
signal processing technique based on sparse representation and compressed sampling
that can achieve efficient representation and recovery of signals through compressed
sampling and reconstruction of signals. Fractional calculus is a mathematical tool
that lies between integer calculus and integral differentiation and can process non-stationary
signals more accurately and flexibly [4].
The wideband speech codec algorithm is based on compressed sensing and fractional
calculus and has been one of the focuses in recent research. This algorithm achieves
high-quality encoding and decoding of speech signals through sparse representation
and compressed sampling of wideband speech signals. It then reconstructs and restores
speech signals through a fractional calculus algorithm.
One speech signal encoding and decoding algorithm is based on predictive coding and
vector coding and combined linear predictive coding, Self-Organizing Map (SOM) neural
network vector coding, and Huffman coding [5]. Speech signal encoding and decoding experiments were carried out with MATLAB. The
experimental results showed that the algorithm retains the advantages of good decoded
speech quality and simple algorithm of the waveform coding algorithm. It also had
a smaller compression rate than the general waveform coding algorithm. The coding
rate of the coding algorithm reached 12.8 kbits/s, but the algorithm had the problem
of long coding and decoding delay.
Another distributed speech codec algorithm was based on the ITU G.722.1 speech encoder
[6]. The algorithm used a complementary encoder based on the G.722.1 encoder. At the
encoding end, the same frame of speech was encoded with the G.722.1 encoder and its
complementary encoder. At the decoding end, when receiving any of the speech code
streams, the G.722.1 decoder was used for decoding, and the speech quality was not
lower than that of the G.722.1 encoder.
When two speech code streams were received, the G.722.1 decoder was used to decode
each of the two speech code streams and then process the decoding results together.
The final speech quality was significantly improved (i.e., there was a certain coding
gain). Simulation results showed a clear anti-packet-loss effect of the algorithm,
but there was a problem of low coding and decoding rate.
A high-throughput implementation method of speech coding and decoding algorithm was
designed through an analysis of the G.729 speech coding algorithm and the characteristics
of real-time speech processing [7]. Focusing on the calculation process, filtering process, and reading characteristics
of SRAM in hardware design, a high-throughput speech encoding and decoding algorithm
was achieved by combining parallel and pipeline structures. The clock cycle required
to complete the same calculation was only 1/68 of the cycles required by the optimized
DSP design.
A minimum mean error algorithm based on calculus was proposed [8]. This algorithm adopted a new concept of parameter-free error-related energy and
signal normalization and ensured high convergence, stability, and low steady-state
error. It can also automatically adjust the learning rate according to the error and
shows good performance in a signal enhancement experiment. A signal enhancement technique
was proposed based on fractional calculus to solve the problem of noise affecting
speech signal processing [9]. The results showed that it can achieve noise-free and fast processing.
In summary, it can be seen that the current methods mainly utilize signal decomposition
to achieve enhancement, but application in speech encoding and decoding still faces
some challenges. The compressed sensing algorithm has certain assumptions about the
sparsity of the signal, but the sparsity of the speech signal in the time and frequency
domains is not significant, resulting in an unsatisfactory effect in speech coding
and decoding. Secondly, the compressed sensing algorithm needs to sparsely represent
the signal, which may introduce additional distortion and affect the quality of speech.
In order to overcome these problems, this paper discusses how to use a compressed
sensing algorithm to achieve efficient encoding and decoding of speech signals. Fractional
calculus is also introduced to improve the signal sparsity and quality of speech encoding
and decoding. At the same time, the feasibility and limitations of this algorithm
were studied in practical applications, and its potential application value in fields
such as speech communication and speech storage was explored. The results could contribute
to improving the development of broadband speech coding and decoding and the efficiency
and quality of speech communication and storage. It could also support the widespread
application of speech information in internet and multimedia applications. At the
same time, this research could also provide new ideas and references for the application
of compressed sensing and fractional calculus in other fields.
2. Broadband Voice Processing
2.1 Compressed Sensing of Broadband Speech Signal
To improve the encoding and decoding rate and reduce the delay of speech encoding
and decoding, compressed sensing theory [10] is applied to the encoding and decoding of speech signals. This method can improve
the encoding and decoding efficiency and provides a theoretical basis for studying
speech encoding and decoding methods with lower complexity. In compressed sensing
theory, an important premise is that the signal has some sparsity, meaning that most
elements in the signal are 0 or close to 0, while a few elements are non-zero [11]. However, in the real world, not all signals are sparse.
Therefore, the signal can be expressed sparsely in a certain transform domain that
also meets the requirements of compressed sensing theory. The premise of this transformation
is the need for a sparse basis, which can be described mathematically as:
$w$ represents an $N\times N$ sparse transformation matrix and is generally used as
a basis for the wavelet transform. $s$ represents the original signal, and $\phi $
represents the sparse coefficient after sparse transformation. The wavelet transform
is a well-known sparse representation method [12]. Each wavelet transform operation divides the signal into two parts: a low-frequency
part and a high-frequency part. The low-frequency coefficient often represents the
main information of the signal, while the high-frequency coefficient determines the
detailed information of the signal. The low-frequency coefficients play an extremely
important role in speech signal reconstruction [13].
The reconstruction methods of compressed sensing are mainly divided into convex optimization
methods and greedy methods. Convex optimization methods are represented by the base
pursuit method. The signal recovery effect reconstructed by a convex optimization
method is good, but the computational complexity is high, resulting in slow reconstruction
time. Greedy algorithms [14] are represented by an orthogonal matching pursuit algorithm. A greedy algorithm has
a fast recovery time, but the reconstruction effect is not good.
Considering their advantages and disadvantages and the requirements of speech codec
for coding quality and delay, a new compressed sensing method called the ``incrowd''
method was selected. This method can greatly reduce the reconstruction time and improve
the recovery efficiency by considering multiple non-zero elements at a time instead
of considering a single non-zero element at a time. The algorithm steps are as follows:
Step 1: Set the initial $s_{0}$ as an $n\times 1$ zero vector and the residual $R=h-Qs_{0}$.
Step 2: Set the active set $U$ as an empty set.
Step 3: Make $u_{k}=\left| < R,U_{k}> \right| $, where the complement $U^{c}$ of $k$
belongs to $U$.
Step 4: If there is no $u_{k}> \delta $ in $U^{c}$ in the complement of $U$, the program
terminates.
Step 5: Otherwise, add the largest vector in $u_{k}$ to set $U$, but do not add the
part of $u_{k}< \delta $.
Step 6: Obtain the solution in the subspace composed of all components in $U$ and
use the current value of $s_{0}$ to ``warm start'' the solver.
Step 7: Remove the zero value element of the exact solution obtained in step 6 from
$U$.
Step 8: Set all parts of $s_{0}$ to 0 except the part in $U$. Set these to the values
found by the exact solution from step 6.
Step 9: Update residual $R=h-Qs_{0}$. Note: when $k$ belongs to $U^{c}$ and $s_{j}=0$,
$U_{s}$ can be found in the sub problem of step 6.
2.2 Wavelet Reconstruction of Broadband Speech Signal
Before wideband speech coding and decoding, a lifting wavelet module is used to divide
the wideband speech signal after compressed sensing, and then the signals in each
sub-band are linearly predicted. The signal correlation in the sub-band is more prominent,
so the result of linear prediction is more accurate, which can improve the compression
performance of the whole wideband speech.
Any wavelet transform can be realized by cascading finite lifting steps. In addition,
the implementation framework of the lifting transform can be realized by the integer
wavelet transform [15]. Fig. 1 shows the basic steps of the wavelet transform: lifting and updating.
Fig. 1. Wavelet decomposition scheme.
As shown in Fig. 1, the separation module separates $y\left[n\right]$ into an odd sequence $y_{0}\left[n\right]=y\left[2n+1\right]$
and an even sequence $y\left[2n\right]$. The prediction module uses the correlation
between odd and even sequences to estimate even sequences with odd sequences, thus
removing the redundancy between the data and preserving the high frequency details
of the signal:
where $g\left[n\right]$ represents the prediction residual, and $\varepsilon $ represents
the prediction operator.
The update module can obtain the low-frequency information of $y\left[n\right]$, and
the corresponding operation is the prediction residual $g\left[n\right]$, which is
added to the even sequence $y\left[2n\right]$ through the update operator $E$:
This equation is the decomposition algorithm of the lifting scheme corresponding to
its reconstruction algorithm. Each step is the inverse transformation of the module
in Fig. 1, as shown in Fig. 2.
Fig. 2. Wavelet reconstruction scheme.
2.3 Speech Signal Enhancement based on Fractional Calculus
The fractional differential operator can enhance the signal in the high-frequency
range, and the larger the order of the fractional order is, the stronger the nonlinear
enhancement ability will be. Therefore, the fractional differential cannot only provide
denoising, but also ensures that the speech signal is not affected by the filter.
Most speech enhancement techniques based on fractional calculus perform numerical
operations in the Fourier transform domain, and the change of the calculus order with
the frequency domain can be observed more intuitively in the frequency domain.
For any square integrable real function $l\left(t\right)$, the Fourier transform is
expressed as:
The corresponding inverse transformation is expressed as:
The general real $\xi $-order differential of $l\left(t\right)$ can be expressed as:
According to Eq. (6), the form of $D^{\xi }l\left(t\right)$ in the Fourier transform domain can be obtained
as follows:
The derivative of the signal in the frequency domain is obtained from the basics of
signal processing. The Fourier transform process is derived as follows:
$A^{\xi }$ represents the differential operator of $\xi $. Through the transformation
and inverse transformation, the noise in speech can be effectively suppressed, and
the speech signal can be effectively enhanced.
3. Wideband Speech Coding and Decoding Algorithm
3.1 Random Excitation Linear Prediction Synthesis Model
The basic idea of the random excitation linear prediction synthesis model is to use
a signal to excite two time-varying linear recursive filters. There is a predictor
for each filter feedback loop, and one of them is a long-term predictor (or pitch
predictor) $V\left(z\right)$. It is used to generate the tone structure (the detailed
structure of the spectrum) of voiced speech. The other is a short-term predictor $C\left(z\right)$,
which is used to recover the short-term spectral envelope of speech. The linear prediction
model with random excitation is derived from its inverse process. The normalized residual
signal obtained by two-level prediction approximately follows a standard normal distribution.
In general, the short-time predictor transfer function [16] is expressed as:
$\omega _{i}$ represents the predictor coefficient, and $r^{-i}$ represents the order
of the predictor, which is generally between 8 and 16. At the receiving end, the transfer
function of the short-term synthesis filter is:
$x_{i}\left(t\right)$ represents the linear prediction error filter. The predictor
coefficient $\omega _{i}$ is generally corrected every 20 to 30~ms.
The transfer function of the pitch predictor is:
$\eta _{i}$ represents pitch delay, and $D$ represents the pitch predictor coefficient.
Generally, $D$ is corrected together with $\eta _{i}$, and the correction rate is
usually higher than the coefficient of the short-term predictor. Generally, it is
corrected every 5 to 10 ms. Based on the speech synthesis model in Fig. 3, the transfer function of the pitch synthesis filter is:
The excitation parameter optimization process of the model in Fig. 3 uses the sensory weighted mean square error minimum criterion instead of the ordinary
mean square error minimum criterion. The reason is that with a low bit rate, the average
number of bits allocated to each speech sample is usually less than 1, which makes
it very difficult to accurately match the speech waveform. Therefore, the weighted
mean square error minimum criterion was adopted to avoid this situation and improve
the quality of speech synthesis [17].
Fig. 3. Speech synthesis model.
3.2 Speech Codec
A code book is a table that stores a series of fixed length vectors or code words.
In the process of coding, the input signal is represented by finding the code word
that best matches the input speech signal. When decoding, the corresponding code word
is found from the digital book and the original speech signal is recovered. According
to the different speech state and the actual codec rate, the structure and bit allocation
of the digital copy can be adjusted. Structure refers to the internal organization
of the code book, such as the number of code words, vector length, etc. Bit assignment
refers to the number of bits assigned to the code book, which determines the precision
and representation power of the code word. According to specific needs, the structure
and bit allocation of the digital copy can be adjusted to better adapt to different
speech conditions and codec rates. Embedded codec refers to the technology of embedding
the encoding and decoding process into a device or system. In the embedded codec of
wideband speech, in order to achieve higher quality audio transmission, digital book
pulse generation can be added to assist the codec process. These pulses serve as additional
information to enhance the encoding of the original speech signal. Layer by layer.
In the process of encoding and decoding, the searches for pulses are not independent
of each other, and they have an embedding and inclusion relationship with each other.
The pulses of each stage are obtained by searching several pulses that give more details
for the updated error signal based on retaining the previous stage. In addition, the
algorithm updates the adaptive codebook and synthesis filter states of each rate independently
to ensure the synchronization between the encoder and decoder. Fig. 4 shows the principle block diagram of the speech codec.
Fig. 4. Principle of speech codec.
As shown in Fig. 4, the process of the wideband speech codec is mainly based on the CELP coding model,
including ① preprocessing, ② short-term linear prediction, ③ adaptive codebook search,
and ④ a core layer and enhancement layer, which replace the digital book search. The
adaptive codebook gain includes the core layer, enhanced layer 1, and enhanced layer
2 generation codebook gain.
Broadband speech is obtained by passing the excitation signal through a synthetic
filter based on the short-term predictor coefficient. The transfer function of the
synthetic filter is:
$\mu _{i}$ represents the linear prediction coefficient, $\mu _{j}$ represents the
prediction order, and $\theta \left(x\right)$ is obtained by linear prediction analysis.
The excitation signal is formed by the weighted sum of the adaptive codebook excitation
vector and the generation codebook excitation vector.
The generation codebook excitation vector can be divided into a core layer generation
codebook excitation vector and an enhanced generation codebook excitation vector.
The weight is the gain of each codebook. Each codebook and its gain are determined
by the synthesis analysis process by minimizing the perceptually weighted mean square
error between the input speech and the synthetic speech. The weighted transfer function
of the perceptual filter is:
$e_{1}$ and $e_{2}$ represent the broadening of the spectral dynamic range from low
frequency to high frequency of broadband speech coding, respectively. The weighted
composite filter uses quantized LP coefficients and is obtained by layering the excitation
vector. Firstly, search for an adaptive codebook, and then generate the codebook for
the core layer, extension layer 1, and extension layer 2. In speech codecs, the parameters
that need to be quantified and transmitted include linear prediction coefficient,
optimal pitch delay, adaptive codebook filter flag, core and enhancement layer generated
codebook index, core layer generated codebook gain, adaptive codebook gain quantization
index, and enhancement layer generated codebook gain ratio quantization index [18].
The excitation vector of the adaptive codebook, the generated codebook excitation
vector, and their respective gains are decoded based on the decoding rate. The excitation
vector of the adaptive codebook is multiplied by the corresponding gain to obtain
a pulse-like excitation vector. The generated codebooks of each layer are multiplied
by their respective gains to obtain the class noise excitation vectors of each layer.
Finally, the pulse-like excitation vector and the noise-like excitation vector are
added, and the reconstructed synthetic speech can be obtained through the synthesis
filter. Additionally, applying a series of post-processing operations to the excitation
signal before synthesis can improve the quality of synthetic speech. [19].
4. Experiment with Speech Encoder based on Calculus
Experimental analysis was carried out to verify the effectiveness of broadband speech
coding and decoding algorithm based on computation. During the experiment, Kankanahalli’s
algorithm [20], a speech signal codec algorithm based on predictive coding and vector coding, and
a speech codec algorithm based on the ITU G.722.1 speech encoder were used for comparison
with the proposed algorithm.
4.1 Experimental Environment
Using MATLAB and VC + + 6.0, the speech codec was programmed in C, and the test speech
was sp\_bchmk.pcm. The voice included four voice signals: a male voice speaking Chinese,
a female voice speaking Chinese, a male voice speaking English, and a female voice
speaking English. In the experiment, the input was a 64-bit/s speech sample signal
that was quantized using 8-kHz sampling and 8 bits. The length of the main frame was
35 ms, and each main frame was divided into four subframes for processing. The maximum
pitch delay value was 147 samples, and the minimum was 20 samples. Fig. 5 shows the time-domain waveform of the original speech.
Fig. 5. Time-domain waveform of original voice.
4.2 Evaluation Criteria of Speech Codec Performance
The basic goal of a speech codec is to reconstruct the speech synthesis with the highest
possible quality at the lowest possible coding rate. At the same time, the codec delay
and algorithm complexity should be minimized. Therefore, the three factors of coding
rate, speech quality evaluation, and codec delay are naturally basic indicators for
evaluating the performance of a speech codec algorithm. There is a close relationship
between these three factors, so when evaluating the advantages and disadvantages of
a speech coding algorithm, it is necessary to comprehensively consider three factors
according to the specific actual situation:
(1) The coding rate directly reflects the degree of compression of voice information
by voice coding. The coding rate can be measured in bit/s and represents the total
coding rate.
(2) Speech quality is the most widely used indicator based on the mean opinion score
(MOS) evaluation method. Table 1 lists the MOS standards and corresponding voice-quality levels.
As shown in Table 1, speech quality is generally considered higher when the MOS is in the range of 4.0-5.0.
When the MOS is about 3.5, the speech quality may be perceived as deteriorating, but
it does not interfere with normal listening and can meet the requirements for speech
use. If the MOS is below 3.0, the speech is generally clear enough, but the naturalness
is not good enough. If the MOS is below 2.0, the speech has a serious distortion problem
and is difficult to understand when listening it.
(3) Codec delay is generally expressed as the time required for a single codec. The
higher the delay is, the lower the efficiency of the speech codec will be. Higher
delay leads to higher efficiency.
Table 1. MOS criteria.
Quality grade
|
Score
|
Specific description
|
Excellent
|
5
|
When listening to the sound, one can completely relax without paying attention, and
the sound is not distorted
|
Good
|
4
|
Listening to the sound requires attention, but there is no need to be too focused,
as the sound may appear slightly distorted
|
Commonly
|
3
|
Listening to voice requires moderate attention, and voice distortion can be observed
|
Difference
|
2
|
When listening to sound, attention should be paid, and distortion can be clearly detected
|
Very bad
|
1
|
It is difficult to understand when listening to the voice, and the distortion problem
is serious
|
A. Application effect analysis of speech encoder based on calculus
(1) Codec rate
The codec rate was used as an experimental index, and the speech codec effects of
the four algorithms were compared. The results are shown in Fig. 6. It can be seen that as the number of iterations increases, the speech codec rate
of different algorithms shows a decreasing trend. When the number of iterations is
less than 3, the decreasing trend of the compilation rate of the four algorithms is
not significant. But when the number of iterations is greater than 3, the compilation
rate of the four algorithms shows a significant downward trend.
The analysis shows that the encoding and decoding rate of the proposed algorithm is
always higher than that of the speech signal encoding and decoding algorithm based
on predictive coding and vector coding, as well as the speech encoding and decoding
algorithm based on the ITU G.722.1 speech encoder. The lowest encoding and decoding
rate was 6.9 bits/s, while the lowest encoding and decoding rates of the other three
algorithms were 4.9 bits/s, 4.5 bits/s, and 4.3 bits/s, respectively. It can be seen
that the speech codec rate of the proposed algorithm is higher, and the effect is
better.
Fig. 6. Comparison results of encoding and decoding rates.
(2) Speech quality evaluation
Six different speech segments were selected for the experiments, and four algorithms
were used to encode and decode them. The quality of the speech encoding and decoding
effects of the four algorithms were evaluated. The evaluation results are shown in
Table 2.
According to the data in Table 2, the MOSs of the proposed algorithms were higher than 4.0. The maximum MOS of the
speech signal encoding and decoding algorithm based on predictive coding and vector
coding was 3.76, and that of the speech encoding and decoding algorithm based on the
ITU G.722.1 speech encoder was 4.02. That of Kankanahalli’s algorithm [20] was 4.13. The comparison shows that the speech quality evaluation score of the proposed
algorithm was higher, which shows that it can meet the usage requirements with high
naturalness and fluency.
Table 2. Speech quality evaluation results.
Phonetic fraction
|
MOS
|
|
The proposed algorithm
|
Predictive coding and vector coding
|
ITU G.722.1 voice encoder
|
Kankanahalli's algorithm [20]
|
1
|
4.73
|
3.34
|
2.51
|
3.67
|
2
|
4.80
|
3.69
|
3.27
|
3.52
|
3
|
4.36
|
3.52
|
2.99
|
3.34
|
4
|
4.59
|
3.07
|
3.46
|
3.16
|
5
|
4.76
|
3.65
|
4.02
|
4.13
|
6
|
4.21
|
3.76
|
3.94
|
3.92
|
(3) Codec delay
Using the codec delay as the experimental index, the speech codec effects of the four
algorithms were compared, and the results are shown in Fig. 7. It can be seen that the encoding and decoding delay of the proposed algorithm was
always less than 2.0 s, while those of the other three algorithms show a significant
upward trend after the number of iterations reaches 3. The test results show that
after using the proposed algorithm, the encoding and decoding processing efficiency
was significantly improved, the required computing resources were reduced, real-time
requirements can be met, and the computing overhead can be reduced.
Fig. 7. Comparison results of codec delay.
(4) Time-domain speech waveform
The four algorithms were used to encode and decode a specific section of speech, and
the processed time-domain speech waveform was compared to the original time-domain
waveform (Fig. 5). The comparison results are shown in Fig. 8. The proposed speech codec algorithm was better at restoring the speech information.
The synthesized speech envelope of this algorithm was relatively close to that of
the original speech signal. Compared with the original speech, the speech spectrum
was more in line with the pitch cycle and peak formation structure and could more
realistically restore speech information. The other algorithms cannot accurately restore
the original speech. Therefore, the speech codec effect of the proposed algorithm
is better.
Fig. 8. Comparison results of speech time domain waveform.
5. Conclusion
The focus of this paper was a wideband speech codec algorithm based on compressed
sensing and fractional calculus and its application in voice communication. Compressed
sensor processing and speech signal sub-band processing were used to improve the performance
of speech compression, and then fractional calculus was used to achieve noise suppression.
By adding digital book pulses layer by layer, the embedded encoding and decoding of
wideband speech was completed. The experimental results are summarized below:
(1) In the codec rate experiment, the lowest coding and decoding rate of the algorithm
was 6.9 bits/s, which was better than those of the other three methods. The speech
codec rate was higher, and the effect was better.
(2) In speech quality evaluation, the maximum MOS of Kankanahalli’s algorithm [20] was 4.13, that of the speech coding and decoding algorithm based on the ITU G.722.1
speech encoder was 4.02, and that of the speech signal coding and decoding algorithm
based on predictive coding and vector coding was 3.76. The MOS of the proposed method
was higher than 4.0, and the result had high naturalness and fluency. By using fractional
calculus algorithms to reconstruct speech signals, the details and characteristics
of the original speech signal can be more accurately restored. Compared with other
integer-order calculus algorithms, fractional-order calculus algorithms have better
adaptability and flexibility in processing non-stationary signals. Therefore, this
algorithm can provide higher-quality speech signals and improve the user's listening
experience.
(3) In the comparison of the codec delay, the encoding and decoding delay of the proposed
algorithm was always below 2.0 s, which greatly improves the processing efficiency
of the codec and reduces the computational cost to a greater extent.
Future research directions could include the following aspects. The complexity and
computational efficiency of the algorithms could be optimized to improve their practicality
and feasibility. The application of the algorithm in complex situations, such as multi-speaker
speech processing and speech recognition, could be explored and improved. In addition,
combination with other advanced signal processing algorithms, such as deep learning
and neural networks, could improve the performance and effectiveness of wideband speech
coding and decoding algorithms. The wideband speech codec algorithm based on compressed
sensing and fractional calculus has broad application prospects in the field of speech
communication. This study has provided a theoretical basis and experimental verification
for the research and application of the algorithm. It could also provide a reference
for related research and engineering applications.
ACKNOWLEDGMENTS
The research was supported by “Exploration and practice of integrating intelligent
classroom into modularized teaching of higher mathematics in the context of curriculum
thinking” (No. Z213009).
REFERENCES
V. Parthasaradi, P.Kailasapathi, ``A novel MFCC-NN learning model for voice communication
through Li-Fi for motion control of a robotic vehicle,'' Soft Computing, 2019, vol.
23, no. 18, pp. 8651-8660.
K. A. Berg, J. H. Noble, B. Dawant, R. Dwyer, R. Labadie, R. H. Gifford, ``Effect
of number of channels and speech coding strategy on speech recognition in mid-scala
electrode recipients,'' The Journal of the Acoustical Society of America, 2019, vol.
145, no. 3, pp. 1796-1797.
T. Bentsen, S. J. Mauger, A. A. Kressner, T. May, T. Dau, ``The impact of noise power
estimation on speech intelligibility in cochlear-implant speech coding strategies,''
The Journal of the Acoustical Society of America, 2019, vol. 145, no. 2, pp. 818-821.
S. Parida, M. G. Heinz, ``Effects of noise-induced hearing loss on speech-in-noise
envelope coding: Inferences from single-unit and non-invasive measures in animals,''
The Journal of the Acoustical Society of America, 2019, vol. 145, no. 3, pp. 1716-1716.
C. Yang, Y. F. Liu, X. X. Xu, H. Zhu, C. H. Liu, ``Speech signal encoding algorithm
based on predictive coding and vector coding'' Modern Electronics Technique, 2018,
vol. 41, no. 24, pp. 128-131.
Y. N. He, Z. Chen, F. L. Yin, ``Distributed Speech Coding Based on G.722.1 Codec,''
Journal of Signal Processing, 2020, vol. 36, no. 6, pp. 894-901.
J. Zhou, Y. Q. Xue, Y. Zhan, J. H. Jiang, ``High throughput implementation of a speech
coding algorithm,'' Microelectronics & Computer, 2020, vol. 37, no. 3, pp. 9-13.
Sadiq A, Khan S, Naseem I, Togneri R, Bennamoun M, ``Enhanced q-least mean square,''
Circuits, Systems, and Signal Processing, 2019, vol. 38, no. 10, pp. 4817-4839.
Ram R, Mohanty M N, ``Application of fractional calculus in speech enhancement: a
novel approach,'' 2018 2nd International Conference on Data Science and Business Analytics
(ICDSBA), 2018, pp. 123-126.
G. Zangerl, M. Haltmeier, ``Multiscale Factorization of the Wave Equation with Application
to Compressed Sensing Photoacoustic Tomography,'' SIAM Journal on Imaging Sciences,
2021, vol. 14, no. 2, pp. 558-579.
G. S. Alberti, P. Campodonico, M. Santacesaria, ``Compressed Sensing Photoacoustic
Tomography Reduces to Compressed Sensing for Undersampled Fourier Measurements,''
SIAM Journal on Imaging Sciences, 2021, vol. 14, no. 3, pp. 1039-1077.
T. Tadakuma, M. Rogers, K. Nishi, M. Joko, M. Shoyama, ``Carrier Stored Layer Density
Effect Analysis of Radiated Noise at Turn-On Switching via Gabor Wavelet Transform''
IEEE Transactions on Electron Devices, 2021, vol. 68, no. 4, pp. 1827-1834.
O. Ahmad, N. A. Sheikh, ``Novel special affine wavelet transform and associated uncertainty
principles,'' International Journal of Geometric Methods in Modern Physics, 2021,
vol. 18, no. 4, pp. 1801-1803.
R. Ranjan, N. Agrawal, S. Joshi, ``Interference mitigation and capacity enhancement
of cognitive radio networks using modified greedy algorithm/channel assignment and
power allocation techniques,'' IET Communications, 2020, vol. 14, no. 9, pp. 1502-1509.
V. A. Rani, S. Lalithakumari, ``Efficient Medical Image Fusion Using 2-Dimensional
Double Density Wavelet Transform to Improve Quality Metrics,'' IEEE Instrumentation
and Measurement Magazine, 2021, vol. 24, no. 4, pp. 35-41.
M. A. Majeed, A. Benjeddou, M. Al-Ajmi, ``Distributed transfer function-based unified
static solutions for piezoelectric short/open-circuit sensing and voltage/charge actuation
of beam cantilevers,'' Acta Mechanica, 2021. vol. 232, no. 3, pp. 1025-1044.
M. Seujski, S. Suzic, D. Pekar, A. Smirnov, T. Nosek, ``Speaker/Style-Dependent Neural
Network Speech Synthesis Based on Speaker/Style Embedding,'' Journal of Universal
Computer Science, 2020, vol. 26, no. 4, pp. 434-453.
N. Naderi, B. Nasersharif, A. Nikoofard, ``Persian speech synthesis using enhanced
tacotron based on multi-resolution convolution layers and a convex optimization method,''
Multimedia Tools and Applications, 2021, vol. 81, no. 3, pp. 3629-3645.
R.Wu, J. Y. Tao, ``Simulation Research on Speech Compression and Recognition of Mine
Wireless Through-the-earth Communication,'' Computer Simulation, 2018, vol. 35, no.
5, pp. 186-190.
Kankanahalli S, ``End-to-end optimized speech coding with deep neural networks,''2018
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
2018, pp. 2521-2525.
Author
Xiuhuan Wang received her Master of Science in Applied Mathematics from Chongqing
University, China in 2012. Presently, she is working as an associate professor in
Chongqing Vocational and Technical University of Mechatronics, Chongqing 402760. Her
areas of interest include advanced mathematics research, probability and statistics,
mathematical applications.