Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 13, No. 02, p.176-185

ISSN (online) :

2287-5255

Received : 17 June 2023Revised : 28 July 2023Accepted : 31 July 2023

DOI :

https://doi.org/10.5573/IEIESPC.2024.13.2.176

Regular Paper

Wideband Speech Codec Algorithm based on Compressed Sensing and Fractional Calculus

WangXiuhuan¹

(College of General Education, Chongqing Vocational and Technical University of Mechatronics, Chongqing, 402760, China wang_xiuhuan0@163.com )

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Broadband speech encoding and decoding methods are important for achieving high-quality speech communication and audio applications. However, encoding and decoding algorithms often face problems such as large data-transmission volume and high computational complexity. To overcome these problems, a wideband speech codec algorithm is proposed based on compressed sensing and fractional calculus. Compressed sensing theory was used to sparsely represent wideband speech signals. The concept and method of fractional calculus are introduced to analyze and process wideband speech signals. Algebraic codebooks were used to adapt the structure and bit allocation of speech based on its different states and actual encoding and decoding rates. Embedded encoding and decoding of wideband speech can be achieved by adding and generating digital book pulses layer by layer. The results show that the proposed algorithm has a minimum encoding and decoding rate of 6.9 bits/s and a speech quality score of over 4.0. It also has low latency and high speed for speech encoding and decoding and provides high-quality speech evaluation. It has clear advantages in speech quality and data transmission efficiency. This study could provide new ideas and methods for further research and application of broadband speech coding and decoding.

Keywords

Calculus, Broadband voice, Encoding and decoding, Compressed sensing, Wavelet transform, Speech synthesis

1. Introduction

With the continuous development of information technology, voice communication plays an important role in our daily lives ^[1]. From telephone communication to modern applications such as voice chat, voice assistants, and speech recognition, voice communication is one of the most important ways for people to communicate and interact. However, speech coding and decoding algorithms face a number of challenges under the requirements of high quality, high efficiency, and low latency.

Speech coding and decoding algorithms typically use digital signal processing and compression techniques to reduce the amount of data in speech signals and achieve efficient transmission and storage ^[2]. However, these algorithms have several limitations when processing speech signals. First, they typically require a large amount of computational resources and storage space, resulting in insufficient real-time and practicality ^[3]. Second, they have limited accuracy in recovering speech signals and their ability to preserve details, so they cannot meet users' needs for high-quality speech. In addition, the algorithms have poor processing performance for non-stationary signals, which limits their application in practical environments.

In order to overcome the limitations of speech coding and decoding algorithms, new technologies such as compressed sensing and fractional calculus have been introduced in the field of speech signal processing in recent years. Compressed sensing is a signal processing technique based on sparse representation and compressed sampling that can achieve efficient representation and recovery of signals through compressed sampling and reconstruction of signals. Fractional calculus is a mathematical tool that lies between integer calculus and integral differentiation and can process non-stationary signals more accurately and flexibly ^[4].

The wideband speech codec algorithm is based on compressed sensing and fractional calculus and has been one of the focuses in recent research. This algorithm achieves high-quality encoding and decoding of speech signals through sparse representation and compressed sampling of wideband speech signals. It then reconstructs and restores speech signals through a fractional calculus algorithm.

One speech signal encoding and decoding algorithm is based on predictive coding and vector coding and combined linear predictive coding, Self-Organizing Map (SOM) neural network vector coding, and Huffman coding ^[5]. Speech signal encoding and decoding experiments were carried out with MATLAB. The experimental results showed that the algorithm retains the advantages of good decoded speech quality and simple algorithm of the waveform coding algorithm. It also had a smaller compression rate than the general waveform coding algorithm. The coding rate of the coding algorithm reached 12.8 kbits/s, but the algorithm had the problem of long coding and decoding delay.

Another distributed speech codec algorithm was based on the ITU G.722.1 speech encoder ^[6]. The algorithm used a complementary encoder based on the G.722.1 encoder. At the encoding end, the same frame of speech was encoded with the G.722.1 encoder and its complementary encoder. At the decoding end, when receiving any of the speech code streams, the G.722.1 decoder was used for decoding, and the speech quality was not lower than that of the G.722.1 encoder.

When two speech code streams were received, the G.722.1 decoder was used to decode each of the two speech code streams and then process the decoding results together. The final speech quality was significantly improved (i.e., there was a certain coding gain). Simulation results showed a clear anti-packet-loss effect of the algorithm, but there was a problem of low coding and decoding rate.

A high-throughput implementation method of speech coding and decoding algorithm was designed through an analysis of the G.729 speech coding algorithm and the characteristics of real-time speech processing ^[7]. Focusing on the calculation process, filtering process, and reading characteristics of SRAM in hardware design, a high-throughput speech encoding and decoding algorithm was achieved by combining parallel and pipeline structures. The clock cycle required to complete the same calculation was only 1/68 of the cycles required by the optimized DSP design.

A minimum mean error algorithm based on calculus was proposed ^[8]. This algorithm adopted a new concept of parameter-free error-related energy and signal normalization and ensured high convergence, stability, and low steady-state error. It can also automatically adjust the learning rate according to the error and shows good performance in a signal enhancement experiment. A signal enhancement technique was proposed based on fractional calculus to solve the problem of noise affecting speech signal processing ^[9]. The results showed that it can achieve noise-free and fast processing.

In summary, it can be seen that the current methods mainly utilize signal decomposition to achieve enhancement, but application in speech encoding and decoding still faces some challenges. The compressed sensing algorithm has certain assumptions about the sparsity of the signal, but the sparsity of the speech signal in the time and frequency domains is not significant, resulting in an unsatisfactory effect in speech coding and decoding. Secondly, the compressed sensing algorithm needs to sparsely represent the signal, which may introduce additional distortion and affect the quality of speech.

In order to overcome these problems, this paper discusses how to use a compressed sensing algorithm to achieve efficient encoding and decoding of speech signals. Fractional calculus is also introduced to improve the signal sparsity and quality of speech encoding and decoding. At the same time, the feasibility and limitations of this algorithm were studied in practical applications, and its potential application value in fields such as speech communication and speech storage was explored. The results could contribute to improving the development of broadband speech coding and decoding and the efficiency and quality of speech communication and storage. It could also support the widespread application of speech information in internet and multimedia applications. At the same time, this research could also provide new ideas and references for the application of compressed sensing and fractional calculus in other fields.

2. Broadband Voice Processing

2.1 Compressed Sensing of Broadband Speech Signal

To improve the encoding and decoding rate and reduce the delay of speech encoding and decoding, compressed sensing theory ^[10] is applied to the encoding and decoding of speech signals. This method can improve the encoding and decoding efficiency and provides a theoretical basis for studying speech encoding and decoding methods with lower complexity. In compressed sensing theory, an important premise is that the signal has some sparsity, meaning that most elements in the signal are 0 or close to 0, while a few elements are non-zero ^[11]. However, in the real world, not all signals are sparse.

Therefore, the signal can be expressed sparsely in a certain transform domain that also meets the requirements of compressed sensing theory. The premise of this transformation is the need for a sparse basis, which can be described mathematically as:

(1)

$ s=w\phi $

$w$ represents an $N\times N$ sparse transformation matrix and is generally used as a basis for the wavelet transform. $s$ represents the original signal, and $\phi $ represents the sparse coefficient after sparse transformation. The wavelet transform is a well-known sparse representation method ^[12]. Each wavelet transform operation divides the signal into two parts: a low-frequency part and a high-frequency part. The low-frequency coefficient often represents the main information of the signal, while the high-frequency coefficient determines the detailed information of the signal. The low-frequency coefficients play an extremely important role in speech signal reconstruction ^[13].

The reconstruction methods of compressed sensing are mainly divided into convex optimization methods and greedy methods. Convex optimization methods are represented by the base pursuit method. The signal recovery effect reconstructed by a convex optimization method is good, but the computational complexity is high, resulting in slow reconstruction time. Greedy algorithms ^[14] are represented by an orthogonal matching pursuit algorithm. A greedy algorithm has a fast recovery time, but the reconstruction effect is not good.

Considering their advantages and disadvantages and the requirements of speech codec for coding quality and delay, a new compressed sensing method called the ``incrowd'' method was selected. This method can greatly reduce the reconstruction time and improve the recovery efficiency by considering multiple non-zero elements at a time instead of considering a single non-zero element at a time. The algorithm steps are as follows:

Step 1: Set the initial $s_{0}$ as an $n\times 1$ zero vector and the residual $R=h-Qs_{0}$.

Step 2: Set the active set $U$ as an empty set.

Step 3: Make $u_{k}=\left| < R,U_{k}> \right| $, where the complement $U^{c}$ of $k$ belongs to $U$.

Step 4: If there is no $u_{k}> \delta $ in $U^{c}$ in the complement of $U$, the program terminates.

Step 5: Otherwise, add the largest vector in $u_{k}$ to set $U$, but do not add the part of $u_{k}< \delta $.

Step 6: Obtain the solution in the subspace composed of all components in $U$ and use the current value of $s_{0}$ to ``warm start'' the solver.

Step 7: Remove the zero value element of the exact solution obtained in step 6 from $U$.

Step 8: Set all parts of $s_{0}$ to 0 except the part in $U$. Set these to the values found by the exact solution from step 6.

Step 9: Update residual $R=h-Qs_{0}$. Note: when $k$ belongs to $U^{c}$ and $s_{j}=0$, $U_{s}$ can be found in the sub problem of step 6.

2.2 Wavelet Reconstruction of Broadband Speech Signal

Before wideband speech coding and decoding, a lifting wavelet module is used to divide the wideband speech signal after compressed sensing, and then the signals in each sub-band are linearly predicted. The signal correlation in the sub-band is more prominent, so the result of linear prediction is more accurate, which can improve the compression performance of the whole wideband speech.

Any wavelet transform can be realized by cascading finite lifting steps. In addition, the implementation framework of the lifting transform can be realized by the integer wavelet transform ^[15]. Fig. 1 shows the basic steps of the wavelet transform: lifting and updating.

Fig. 1. Wavelet decomposition scheme.

As shown in Fig. 1, the separation module separates $y\left[n\right]$ into an odd sequence $y_{0}\left[n\right]=y\left[2n+1\right]$ and an even sequence $y\left[2n\right]$. The prediction module uses the correlation between odd and even sequences to estimate even sequences with odd sequences, thus removing the redundancy between the data and preserving the high frequency details of the signal:

(2)

$ g\left[n\right]=\frac{y_{0}-\varepsilon \left(y_{0}\left[n\right]\right)}{y\left[n\right]} $

where $g\left[n\right]$ represents the prediction residual, and $\varepsilon $ represents the prediction operator.

The update module can obtain the low-frequency information of $y\left[n\right]$, and the corresponding operation is the prediction residual $g\left[n\right]$, which is added to the even sequence $y\left[2n\right]$ through the update operator $E$:

(3)

$ y\left[n\right]=\sqrt{y\left[2n\right]+E\left(g\left[n\right]\right)} $

This equation is the decomposition algorithm of the lifting scheme corresponding to its reconstruction algorithm. Each step is the inverse transformation of the module in Fig. 1, as shown in Fig. 2.

Fig. 2. Wavelet reconstruction scheme.

2.3 Speech Signal Enhancement based on Fractional Calculus

The fractional differential operator can enhance the signal in the high-frequency range, and the larger the order of the fractional order is, the stronger the nonlinear enhancement ability will be. Therefore, the fractional differential cannot only provide denoising, but also ensures that the speech signal is not affected by the filter. Most speech enhancement techniques based on fractional calculus perform numerical operations in the Fourier transform domain, and the change of the calculus order with the frequency domain can be observed more intuitively in the frequency domain.

For any square integrable real function $l\left(t\right)$, the Fourier transform is expressed as:

(4)

$ L\left(t\right)=\int _{t}l\left(t\right)\cdot w^{-2\upsilon }dt $

The corresponding inverse transformation is expressed as:

(5)

$ l\left(t\right)=\frac{1}{2}\int _{t}L\left(t\right)\cdot \exp \left(u_{t}\right)dt $

The general real $\xi $-order differential of $l\left(t\right)$ can be expressed as:

(6)

$ D^{\xi }l\left(t\right)=\sqrt{\frac{d^{\xi }l\left(t\right)}{dt^{\xi }}} $

According to Eq. (6), the form of $D^{\xi }l\left(t\right)$ in the Fourier transform domain can be obtained as follows:

(7)

$ D^{\xi }l\left(t\right)=\int _{t}L\left(t\right)\cdot e^{2}dt $

The derivative of the signal in the frequency domain is obtained from the basics of signal processing. The Fourier transform process is derived as follows:

(8)

$ D^{\xi }L\left(t\right)=L\left(t\right)\times \vartheta ^{\xi } $

$A^{\xi }$ represents the differential operator of $\xi $. Through the transformation and inverse transformation, the noise in speech can be effectively suppressed, and the speech signal can be effectively enhanced.

3. Wideband Speech Coding and Decoding Algorithm

3.1 Random Excitation Linear Prediction Synthesis Model

The basic idea of the random excitation linear prediction synthesis model is to use a signal to excite two time-varying linear recursive filters. There is a predictor for each filter feedback loop, and one of them is a long-term predictor (or pitch predictor) $V\left(z\right)$. It is used to generate the tone structure (the detailed structure of the spectrum) of voiced speech. The other is a short-term predictor $C\left(z\right)$, which is used to recover the short-term spectral envelope of speech. The linear prediction model with random excitation is derived from its inverse process. The normalized residual signal obtained by two-level prediction approximately follows a standard normal distribution.

In general, the short-time predictor transfer function ^[16] is expressed as:

(9)

$ C\left(z\right)={\sum }_{i=1}^{n}\omega _{i}/N\times r^{-i} $

$\omega _{i}$ represents the predictor coefficient, and $r^{-i}$ represents the order of the predictor, which is generally between 8 and 16. At the receiving end, the transfer function of the short-term synthesis filter is:

(10)

$ r_{f}\left(z\right)=\frac{2\left| x_{i}\left(t\right)-x_{j}\left(t\right)\right| }{x_{i}\left(t\right)+x_{j}\left(t\right)} $

$x_{i}\left(t\right)$ represents the linear prediction error filter. The predictor coefficient $\omega _{i}$ is generally corrected every 20 to 30~ms.

The transfer function of the pitch predictor is:

(11)

$ V\left(z\right)={\int }_{i}^{\infty }\eta _{i}\left(x\left| x_{i}\right.\right)D $

$\eta _{i}$ represents pitch delay, and $D$ represents the pitch predictor coefficient. Generally, $D$ is corrected together with $\eta _{i}$, and the correction rate is usually higher than the coefficient of the short-term predictor. Generally, it is corrected every 5 to 10 ms. Based on the speech synthesis model in Fig. 3, the transfer function of the pitch synthesis filter is:

(12)

$ P_{i}\left(c\right)=\frac{1}{1-r_{f}\left(z\right)} $

The excitation parameter optimization process of the model in Fig. 3 uses the sensory weighted mean square error minimum criterion instead of the ordinary mean square error minimum criterion. The reason is that with a low bit rate, the average number of bits allocated to each speech sample is usually less than 1, which makes it very difficult to accurately match the speech waveform. Therefore, the weighted mean square error minimum criterion was adopted to avoid this situation and improve the quality of speech synthesis ^[17].

Fig. 3. Speech synthesis model.

3.2 Speech Codec

A code book is a table that stores a series of fixed length vectors or code words. In the process of coding, the input signal is represented by finding the code word that best matches the input speech signal. When decoding, the corresponding code word is found from the digital book and the original speech signal is recovered. According to the different speech state and the actual codec rate, the structure and bit allocation of the digital copy can be adjusted. Structure refers to the internal organization of the code book, such as the number of code words, vector length, etc. Bit assignment refers to the number of bits assigned to the code book, which determines the precision and representation power of the code word. According to specific needs, the structure and bit allocation of the digital copy can be adjusted to better adapt to different speech conditions and codec rates. Embedded codec refers to the technology of embedding the encoding and decoding process into a device or system. In the embedded codec of wideband speech, in order to achieve higher quality audio transmission, digital book pulse generation can be added to assist the codec process. These pulses serve as additional information to enhance the encoding of the original speech signal. Layer by layer. In the process of encoding and decoding, the searches for pulses are not independent of each other, and they have an embedding and inclusion relationship with each other. The pulses of each stage are obtained by searching several pulses that give more details for the updated error signal based on retaining the previous stage. In addition, the algorithm updates the adaptive codebook and synthesis filter states of each rate independently to ensure the synchronization between the encoder and decoder. Fig. 4 shows the principle block diagram of the speech codec.

Fig. 4. Principle of speech codec.

As shown in Fig. 4, the process of the wideband speech codec is mainly based on the CELP coding model, including ① preprocessing, ② short-term linear prediction, ③ adaptive codebook search, and ④ a core layer and enhancement layer, which replace the digital book search. The adaptive codebook gain includes the core layer, enhanced layer 1, and enhanced layer 2 generation codebook gain.

Broadband speech is obtained by passing the excitation signal through a synthetic filter based on the short-term predictor coefficient. The transfer function of the synthetic filter is:

(13)

$ \theta \left(x\right)=1-\frac{\mu _{i}-\mu _{j}}{\sqrt{2\sigma _{i}}} $

$\mu _{i}$ represents the linear prediction coefficient, $\mu _{j}$ represents the prediction order, and $\theta \left(x\right)$ is obtained by linear prediction analysis. The excitation signal is formed by the weighted sum of the adaptive codebook excitation vector and the generation codebook excitation vector.

The generation codebook excitation vector can be divided into a core layer generation codebook excitation vector and an enhanced generation codebook excitation vector. The weight is the gain of each codebook. Each codebook and its gain are determined by the synthesis analysis process by minimizing the perceptually weighted mean square error between the input speech and the synthetic speech. The weighted transfer function of the perceptual filter is:

(14)

$ f\left(x\right)=\theta \left(x\right)\times \frac{2}{\sqrt{K}}{\int }_{i}^{x}e_{1}+e_{2} $

$e_{1}$ and $e_{2}$ represent the broadening of the spectral dynamic range from low frequency to high frequency of broadband speech coding, respectively. The weighted composite filter uses quantized LP coefficients and is obtained by layering the excitation vector. Firstly, search for an adaptive codebook, and then generate the codebook for the core layer, extension layer 1, and extension layer 2. In speech codecs, the parameters that need to be quantified and transmitted include linear prediction coefficient, optimal pitch delay, adaptive codebook filter flag, core and enhancement layer generated codebook index, core layer generated codebook gain, adaptive codebook gain quantization index, and enhancement layer generated codebook gain ratio quantization index ^[18].

The excitation vector of the adaptive codebook, the generated codebook excitation vector, and their respective gains are decoded based on the decoding rate. The excitation vector of the adaptive codebook is multiplied by the corresponding gain to obtain a pulse-like excitation vector. The generated codebooks of each layer are multiplied by their respective gains to obtain the class noise excitation vectors of each layer. Finally, the pulse-like excitation vector and the noise-like excitation vector are added, and the reconstructed synthetic speech can be obtained through the synthesis filter. Additionally, applying a series of post-processing operations to the excitation signal before synthesis can improve the quality of synthetic speech. ^[19].

4. Experiment with Speech Encoder based on Calculus

Experimental analysis was carried out to verify the effectiveness of broadband speech coding and decoding algorithm based on computation. During the experiment, Kankanahalli’s algorithm ^[20], a speech signal codec algorithm based on predictive coding and vector coding, and a speech codec algorithm based on the ITU G.722.1 speech encoder were used for comparison with the proposed algorithm.

4.1 Experimental Environment

Using MATLAB and VC + + 6.0, the speech codec was programmed in C, and the test speech was sp\_bchmk.pcm. The voice included four voice signals: a male voice speaking Chinese, a female voice speaking Chinese, a male voice speaking English, and a female voice speaking English. In the experiment, the input was a 64-bit/s speech sample signal that was quantized using 8-kHz sampling and 8 bits. The length of the main frame was 35 ms, and each main frame was divided into four subframes for processing. The maximum pitch delay value was 147 samples, and the minimum was 20 samples. Fig. 5 shows the time-domain waveform of the original speech.

Fig. 5. Time-domain waveform of original voice.

4.2 Evaluation Criteria of Speech Codec Performance

The basic goal of a speech codec is to reconstruct the speech synthesis with the highest possible quality at the lowest possible coding rate. At the same time, the codec delay and algorithm complexity should be minimized. Therefore, the three factors of coding rate, speech quality evaluation, and codec delay are naturally basic indicators for evaluating the performance of a speech codec algorithm. There is a close relationship between these three factors, so when evaluating the advantages and disadvantages of a speech coding algorithm, it is necessary to comprehensively consider three factors according to the specific actual situation:

(1) The coding rate directly reflects the degree of compression of voice information by voice coding. The coding rate can be measured in bit/s and represents the total coding rate.

(2) Speech quality is the most widely used indicator based on the mean opinion score (MOS) evaluation method. Table 1 lists the MOS standards and corresponding voice-quality levels.

As shown in Table 1, speech quality is generally considered higher when the MOS is in the range of 4.0-5.0. When the MOS is about 3.5, the speech quality may be perceived as deteriorating, but it does not interfere with normal listening and can meet the requirements for speech use. If the MOS is below 3.0, the speech is generally clear enough, but the naturalness is not good enough. If the MOS is below 2.0, the speech has a serious distortion problem and is difficult to understand when listening it.

(3) Codec delay is generally expressed as the time required for a single codec. The higher the delay is, the lower the efficiency of the speech codec will be. Higher delay leads to higher efficiency.

Table 1. MOS criteria.

Quality grade	Score	Specific description
Excellent	5	When listening to the sound, one can completely relax without paying attention, and the sound is not distorted
Good	4	Listening to the sound requires attention, but there is no need to be too focused, as the sound may appear slightly distorted
Commonly	3	Listening to voice requires moderate attention, and voice distortion can be observed
Difference	2	When listening to sound, attention should be paid, and distortion can be clearly detected
Very bad	1	It is difficult to understand when listening to the voice, and the distortion problem is serious

A. Application effect analysis of speech encoder based on calculus

(1) Codec rate

The codec rate was used as an experimental index, and the speech codec effects of the four algorithms were compared. The results are shown in Fig. 6. It can be seen that as the number of iterations increases, the speech codec rate of different algorithms shows a decreasing trend. When the number of iterations is less than 3, the decreasing trend of the compilation rate of the four algorithms is not significant. But when the number of iterations is greater than 3, the compilation rate of the four algorithms shows a significant downward trend.

The analysis shows that the encoding and decoding rate of the proposed algorithm is always higher than that of the speech signal encoding and decoding algorithm based on predictive coding and vector coding, as well as the speech encoding and decoding algorithm based on the ITU G.722.1 speech encoder. The lowest encoding and decoding rate was 6.9 bits/s, while the lowest encoding and decoding rates of the other three algorithms were 4.9 bits/s, 4.5 bits/s, and 4.3 bits/s, respectively. It can be seen that the speech codec rate of the proposed algorithm is higher, and the effect is better.

Fig. 6. Comparison results of encoding and decoding rates.

(2) Speech quality evaluation

Six different speech segments were selected for the experiments, and four algorithms were used to encode and decode them. The quality of the speech encoding and decoding effects of the four algorithms were evaluated. The evaluation results are shown in Table 2.

According to the data in Table 2, the MOSs of the proposed algorithms were higher than 4.0. The maximum MOS of the speech signal encoding and decoding algorithm based on predictive coding and vector coding was 3.76, and that of the speech encoding and decoding algorithm based on the ITU G.722.1 speech encoder was 4.02. That of Kankanahalli’s algorithm ^[20] was 4.13. The comparison shows that the speech quality evaluation score of the proposed algorithm was higher, which shows that it can meet the usage requirements with high naturalness and fluency.

Table 2. Speech quality evaluation results.

Phonetic fraction	MOS
Phonetic fraction	The proposed algorithm	Predictive coding and vector coding	ITU G.722.1 voice encoder	Kankanahalli's algorithm [20]
1	4.73	3.34	2.51	3.67
2	4.80	3.69	3.27	3.52
3	4.36	3.52	2.99	3.34
4	4.59	3.07	3.46	3.16
5	4.76	3.65	4.02	4.13
6	4.21	3.76	3.94	3.92

(3) Codec delay

Using the codec delay as the experimental index, the speech codec effects of the four algorithms were compared, and the results are shown in Fig. 7. It can be seen that the encoding and decoding delay of the proposed algorithm was always less than 2.0 s, while those of the other three algorithms show a significant upward trend after the number of iterations reaches 3. The test results show that after using the proposed algorithm, the encoding and decoding processing efficiency was significantly improved, the required computing resources were reduced, real-time requirements can be met, and the computing overhead can be reduced.

Fig. 7. Comparison results of codec delay.

(4) Time-domain speech waveform

The four algorithms were used to encode and decode a specific section of speech, and the processed time-domain speech waveform was compared to the original time-domain waveform (Fig. 5). The comparison results are shown in Fig. 8. The proposed speech codec algorithm was better at restoring the speech information. The synthesized speech envelope of this algorithm was relatively close to that of the original speech signal. Compared with the original speech, the speech spectrum was more in line with the pitch cycle and peak formation structure and could more realistically restore speech information. The other algorithms cannot accurately restore the original speech. Therefore, the speech codec effect of the proposed algorithm is better.

Fig. 8. Comparison results of speech time domain waveform.

5. Conclusion

The focus of this paper was a wideband speech codec algorithm based on compressed sensing and fractional calculus and its application in voice communication. Compressed sensor processing and speech signal sub-band processing were used to improve the performance of speech compression, and then fractional calculus was used to achieve noise suppression. By adding digital book pulses layer by layer, the embedded encoding and decoding of wideband speech was completed. The experimental results are summarized below:

(1) In the codec rate experiment, the lowest coding and decoding rate of the algorithm was 6.9 bits/s, which was better than those of the other three methods. The speech codec rate was higher, and the effect was better.

(2) In speech quality evaluation, the maximum MOS of Kankanahalli’s algorithm ^[20] was 4.13, that of the speech coding and decoding algorithm based on the ITU G.722.1 speech encoder was 4.02, and that of the speech signal coding and decoding algorithm based on predictive coding and vector coding was 3.76. The MOS of the proposed method was higher than 4.0, and the result had high naturalness and fluency. By using fractional calculus algorithms to reconstruct speech signals, the details and characteristics of the original speech signal can be more accurately restored. Compared with other integer-order calculus algorithms, fractional-order calculus algorithms have better adaptability and flexibility in processing non-stationary signals. Therefore, this algorithm can provide higher-quality speech signals and improve the user's listening experience.

(3) In the comparison of the codec delay, the encoding and decoding delay of the proposed algorithm was always below 2.0 s, which greatly improves the processing efficiency of the codec and reduces the computational cost to a greater extent.

Future research directions could include the following aspects. The complexity and computational efficiency of the algorithms could be optimized to improve their practicality and feasibility. The application of the algorithm in complex situations, such as multi-speaker speech processing and speech recognition, could be explored and improved. In addition, combination with other advanced signal processing algorithms, such as deep learning and neural networks, could improve the performance and effectiveness of wideband speech coding and decoding algorithms. The wideband speech codec algorithm based on compressed sensing and fractional calculus has broad application prospects in the field of speech communication. This study has provided a theoretical basis and experimental verification for the research and application of the algorithm. It could also provide a reference for related research and engineering applications.

ACKNOWLEDGMENTS

The research was supported by “Exploration and practice of integrating intelligent classroom into modularized teaching of higher mathematics in the context of curriculum thinking” (No. Z213009).

REFERENCES

V. Parthasaradi, P.Kailasapathi, ``A novel MFCC-NN learning model for voice communication through Li-Fi for motion control of a robotic vehicle,'' Soft Computing, 2019, vol. 23, no. 18, pp. 8651-8660.

K. A. Berg, J. H. Noble, B. Dawant, R. Dwyer, R. Labadie, R. H. Gifford, ``Effect of number of channels and speech coding strategy on speech recognition in mid-scala electrode recipients,'' The Journal of the Acoustical Society of America, 2019, vol. 145, no. 3, pp. 1796-1797.

T. Bentsen, S. J. Mauger, A. A. Kressner, T. May, T. Dau, ``The impact of noise power estimation on speech intelligibility in cochlear-implant speech coding strategies,'' The Journal of the Acoustical Society of America, 2019, vol. 145, no. 2, pp. 818-821.

S. Parida, M. G. Heinz, ``Effects of noise-induced hearing loss on speech-in-noise envelope coding: Inferences from single-unit and non-invasive measures in animals,'' The Journal of the Acoustical Society of America, 2019, vol. 145, no. 3, pp. 1716-1716.

C. Yang, Y. F. Liu, X. X. Xu, H. Zhu, C. H. Liu, ``Speech signal encoding algorithm based on predictive coding and vector coding'' Modern Electronics Technique, 2018, vol. 41, no. 24, pp. 128-131.

Y. N. He, Z. Chen, F. L. Yin, ``Distributed Speech Coding Based on G.722.1 Codec,'' Journal of Signal Processing, 2020, vol. 36, no. 6, pp. 894-901.

J. Zhou, Y. Q. Xue, Y. Zhan, J. H. Jiang, ``High throughput implementation of a speech coding algorithm,'' Microelectronics & Computer, 2020, vol. 37, no. 3, pp. 9-13.

Sadiq A, Khan S, Naseem I, Togneri R, Bennamoun M, ``Enhanced q-least mean square,'' Circuits, Systems, and Signal Processing, 2019, vol. 38, no. 10, pp. 4817-4839.

Ram R, Mohanty M N, ``Application of fractional calculus in speech enhancement: a novel approach,'' 2018 2nd International Conference on Data Science and Business Analytics (ICDSBA), 2018, pp. 123-126.

G. Zangerl, M. Haltmeier, ``Multiscale Factorization of the Wave Equation with Application to Compressed Sensing Photoacoustic Tomography,'' SIAM Journal on Imaging Sciences, 2021, vol. 14, no. 2, pp. 558-579.

G. S. Alberti, P. Campodonico, M. Santacesaria, ``Compressed Sensing Photoacoustic Tomography Reduces to Compressed Sensing for Undersampled Fourier Measurements,'' SIAM Journal on Imaging Sciences, 2021, vol. 14, no. 3, pp. 1039-1077.

T. Tadakuma, M. Rogers, K. Nishi, M. Joko, M. Shoyama, ``Carrier Stored Layer Density Effect Analysis of Radiated Noise at Turn-On Switching via Gabor Wavelet Transform'' IEEE Transactions on Electron Devices, 2021, vol. 68, no. 4, pp. 1827-1834.

O. Ahmad, N. A. Sheikh, ``Novel special affine wavelet transform and associated uncertainty principles,'' International Journal of Geometric Methods in Modern Physics, 2021, vol. 18, no. 4, pp. 1801-1803.

R. Ranjan, N. Agrawal, S. Joshi, ``Interference mitigation and capacity enhancement of cognitive radio networks using modified greedy algorithm/channel assignment and power allocation techniques,'' IET Communications, 2020, vol. 14, no. 9, pp. 1502-1509.

V. A. Rani, S. Lalithakumari, ``Efficient Medical Image Fusion Using 2-Dimensional Double Density Wavelet Transform to Improve Quality Metrics,'' IEEE Instrumentation and Measurement Magazine, 2021, vol. 24, no. 4, pp. 35-41.

M. A. Majeed, A. Benjeddou, M. Al-Ajmi, ``Distributed transfer function-based unified static solutions for piezoelectric short/open-circuit sensing and voltage/charge actuation of beam cantilevers,'' Acta Mechanica, 2021. vol. 232, no. 3, pp. 1025-1044.

M. Seujski, S. Suzic, D. Pekar, A. Smirnov, T. Nosek, ``Speaker/Style-Dependent Neural Network Speech Synthesis Based on Speaker/Style Embedding,'' Journal of Universal Computer Science, 2020, vol. 26, no. 4, pp. 434-453.

N. Naderi, B. Nasersharif, A. Nikoofard, ``Persian speech synthesis using enhanced tacotron based on multi-resolution convolution layers and a convex optimization method,'' Multimedia Tools and Applications, 2021, vol. 81, no. 3, pp. 3629-3645.

R.Wu, J. Y. Tao, ``Simulation Research on Speech Compression and Recognition of Mine Wireless Through-the-earth Communication,'' Computer Simulation, 2018, vol. 35, no. 5, pp. 186-190.

Kankanahalli S, ``End-to-end optimized speech coding with deep neural networks,''2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 2521-2525.

Author

Xiuhuan Wang

Xiuhuan Wang received her Master of Science in Applied Mathematics from Chongqing University, China in 2012. Presently, she is working as an associate professor in Chongqing Vocational and Technical University of Mechatronics, Chongqing 402760. Her areas of interest include advanced mathematics research, probability and statistics, mathematical applications.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

Wideband Speech Codec Algorithm based on Compressed Sensing and Fractional Calculus

Abstract

Keywords

1. Introduction

2. Broadband Voice Processing

2.1 Compressed Sensing of Broadband Speech Signal

(1)

2.2 Wavelet Reconstruction of Broadband Speech Signal

Fig. 1. Wavelet decomposition scheme.

(2)

(3)

Fig. 2. Wavelet reconstruction scheme.

2.3 Speech Signal Enhancement based on Fractional Calculus

(4)

(5)

(6)

(7)

(8)

3. Wideband Speech Coding and Decoding Algorithm

3.1 Random Excitation Linear Prediction Synthesis Model

(9)

(10)

(11)

(12)

Fig. 3. Speech synthesis model.

3.2 Speech Codec

Fig. 4. Principle of speech codec.

(13)

(14)

4. Experiment with Speech Encoder based on Calculus

4.1 Experimental Environment

Fig. 5. Time-domain waveform of original voice.

4.2 Evaluation Criteria of Speech Codec Performance

Table 1. MOS criteria.

Fig. 6. Comparison results of encoding and decoding rates.

Table 2. Speech quality evaluation results.

Fig. 7. Comparison results of codec delay.

Fig. 8. Comparison results of speech time domain waveform.

5. Conclusion

ACKNOWLEDGMENTS

REFERENCES

Author

Xiuhuan Wang

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing