Al-SheikhBahaa1
SalmanMohammad Shukri1
EleyanAlaa2
-
(College of Engineering and Technology, American University of the Middle East, Kuwait
{bahaa.al-sheikh, mohammad.salman}@aum.edu.kw
)
-
(Electrical & Electronics Engineering Department, Ankara Science University, Ankara
alaa.eleyan@ankarabilim.edu.tr )
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
HRTF, Spectral notches, 3D auditory display, Wavelet, Multi-resolution analysis, Auto-detection
1. Introduction
A Head-Related Transfer Function (HRTF) is defined as the ratio of the frequency
response at the ear drum to that at a sound source for both ears. HRTFs comprise all
acoustical cues to a sound-source location that are available from that location [1]. In addition to the interaural time difference (ITD) and interaural level difference
(ILD), HRTFs are considered as the main cues for sound source localization. They are
usually measured for individuals at certain limited locations in terms of azimuth
and elevation.
HRTFs are required to create 3D virtual auditory displays (VADs) using headphones.
VADs have many applications, including psychoacoustic and physiological research,
industry, virtual training, driving training simulation [2], virtual aviation [3], and battle environment simulation [4]. There are also many other applications for VADs in communication, multimedia, mobile
products, and clinical auditory evaluations [5].
A Head-Related Impulse Response (HRIR) is defined as the time domain of the HRTFs.
One of the most famous methods to measure it is by generating a Dirac delta impulse
at a sound source and measuring the output at a microphone located at a subject’s
eardrum in an anechoic room. This should be done for each direction in 3D space because
the result is significantly dependent on the direction. Different techniques and algorithms
have been proposed and implemented to build HRTFs at locations other than the measured
ones or at finer resolution [6-8].
Scattering and reflections from the torso and the shoulder of a subject cause
the characterization of the HRTF at frequencies less than 3 kHz [9]. Accordingly, the geometry of these body parts doesn’t affect the shape of the HRTF
above 3KHz. Previous studies on humans show that there are prominent spectral ``notches''
and ``peaks'' in HRTFs above 4-5 kHz. These are dominant cues for elevation and azimuth
angles of a sound-source location, which are essential for sound localization, especially
for elevations and for determining whether a sound is in front of or behind an observer
[10]. Many of these spectral features are caused by pinnae reflections and diffractions,
which act as a filter in the frequency domain.
The absence or presence of the peaks gives a strong indication of the sound source
elevation [10]. For example, a one-octave peak at 7-9 kHz was presented as an indication of elevations
around 90$^{\circ}$ [11]. However, spectral peaks do not show a smooth trend with the changes of the elevations
as the spectral notches [10].
The spectral location of the first prominent spectral minima is called the first
notch. Human data has shown that changes occur in its center frequency from around
6 kHz to around 12 kHz as the angle of a sound source varies from -15$^{\circ}$ to
45$^{\circ}$ in elevation with a fixed azimuth angle [12]. The first notch is due to the ear concha and is considered as one of the important
features for elevation perception of a sound source [11].
HRTFs depend on the shape and geometry of the head, external ears, and body parts,
which interact with received sound waves. Because of this, HRTFs can be quite different
for different individuals for a given location in space [13]. In order to have a full implementation of a complete VAD for a certain individual,
the HRTFs need to be measured or synthesized for all directions (i.e., all elevation
and all azimuth angles). Higher directional resolution results in smoother and more
effective directional hearing for VADs. The most popular far-field HRTF databases
use a directional resolution of 5$^{\circ}$ to 15$^{\circ}$ for both azimuths and
elevations. However, measurement of the HRTFs in all directions for subjects is expensive
and inappropriate and requires too much preparations.
One of the solutions to this problem is using a structural model of a subject
in order to build an individualized HRTF. These models are based on synthesizing the
HRTFs based on the anthropometry of the subject, especially the geometry of the pinna,
head, torso, and shoulders [14]. Therefore, we hypothesize that if notch and peak frequencies for a certain individual
HRTF are close to those of another individual, then using the HRTF of the first individual
for the second is more suitable than using other individuals’ HRTFs with significantly
different spectral notches and peaks frequencies.
Taking a few measurements at certain locations for an individual can be used to
auto-detect the frequencies of the spectral notches and peaks and to compare their
values to the notches and peaks frequencies in currently available HRTF databases.
The ones with closer notches and peak frequencies at the same measured locations can
be used as indications for suitable HRTFs for an individual. To do this, we need to
automatically detect the main notches in a measured HRTF for an individual for comparison.
Wavelet multi-resolution analysis has been successfully used for auto-detection of
events, including notches and peaks in non-stationary signals [15,16]. It was used in this study for the auto-detection of main spectral notches in measured
HRTFs.
The rest of the paper is organized as follows. Section 2 discusses the used database
and the 3D reference coordinate system, data pre-processing, role of the spectral
notches in direction estimation, and discrete wavelet transform. Section 3 discusses
the results of applying wavelet multi-resolution analysis on the HRTFs to auto-detect
the spectral notches. The paper is concluded in section 4.
2. Methods
2.1 Database and Coordinate System
An interaural polar coordinate system was used in this study. The elevation (EL)
represents the latitude, and the source azimuth (AZ) represents the longitude. The
location at (AZ=0$^{\circ}$, EL=0$^{\circ}$) corresponds to the direction in front
of the subject. Negative elevations are below the horizontal plane, and positive elevations
are above it. EL=90$^{\circ}$ corresponds to the direction directly above the subject’s
head, and (AZ=180$^{\circ}$, EL=0$^{\circ}$) corresponds to the direction directly
behind it. Negative azimuth angles are to the left side, and positive ones are to
the right of the subject.
In this study, we used HRIRs from the Center for Image Processing and Integrated
Computing-University of California (CIPIC) database [17]. It contains HRIRs for 43 subjects with 27 anthropometric measurements for subjects’
heads, torsos, and pinnae. For each subject, HRIRs are measured at azimuth angles
between -80$^{\circ}$ and 80$^{\circ}$ and elevation angles between -45$^{\circ}$
and 230.624$^{\circ}$. There are a total of 1250 directions for each subject, and
the sampling frequency is 44.1 kHz. HRTFs have been calculated in this study from
HRIRs by taking the Fourier Transform using 512 points with a frequency resolution
Δf of 86.13 Hz.
For the purpose of this study, we used directions at elevations between -45$^{\circ}$
and 45$^{\circ}$ in the median plane as an example (i.e., at an azimuth angle of 0$^{\circ}$
for the right ear of some randomly selected subjects from the CIPIC database). Subjects
3, 8, 9, 10, 11, and 12 were selected in this study. Matlab® 2014 was used for reading
the data, pre-processing, wavelet multi-resolution analysis, and notch auto-detection
of the HRTFs’ spectral notches.
2.2 Pre-processing of Data
HRIRs have been windowed by 2 ms using a Hanning window in order to remove echoes
in the raw data, including some reflections caused by torsos, shoulders, and knees.
This causes indirect smoothing for the HRTFs. Smoothing in the frequency domain does
not affect the localization capability given that the main spectral features are kept
[18]. Phase responses are ignored, and only magnitudes of the HRTFs are considered because
many studies have proven that HRTFs can be accurately represented by their minimum
phase spectra. The reason is that the auditory system is not sensitive to the absolute
phase of a sound applied to a single ear [18,19].
2.3 HRTF Spectral Notches
Notches and peaks in the HRTFs are direction-dependent, so they indirectly provide
information about the direction of a sound. In addition, they depend on the shape
and size of the pinna, which are different among individuals. Fig. 1 presents an example of the notches and peaks in the right-ear HRTF at the location
of (AZ=0$^{\circ}$, EL=-45$^{\circ}$) for subject 10 of the CIPIC database.
Fig. 1. Example of spectral notches for right-ear HRTF at 0° azimuth and -45° elevation
for subject 10 of CIPIC database.
2.4 Discrete Wavelet Transform
A wavelet is a limited-duration waveform that is irregular and often non-symmetrical.
Its average value equals zero, and it has the capability to describe abnormalities,
pulses, and other events. Wavelet analysis involves the decomposition of a signal
using an orthonormal group of basis functions, such as the sines and cosines in a
Fourier series. Scaling or dilation in wavelet terminology means stretching the wavelet
in time, which is related to the frequency in Fourier series terminology. Translation
in wavelet terminology is the shifting of the wavelet to the right or left in the
time domain. A ``mother wavelet'' refers to an unstretched wavelet. A Continuous Wavelet
Transform (CWT) represents all possible integer factors of shifting and stretching
the wavelet, while a Discrete Wavelet Transform (DWT) stretches and shifts in a dyadic
scale using powers of 2 (e.g., 2, 4, 8, 16, etc.) [20].
Wavelet decomposition splits a signal into two parts using high-pass and low-pass
filters. Using more filters splits the signal into more parts. A low-pass filter (scaling
function filter) gives a smoothed version and approximation of the signal, while a
high-pass filter (wavelet filter) gives the details. When details and approximations
are added together, they can reconstruct the original signal.
Usually, each approximation is split into more approximations and details, and
so on. Selecting certain levels of details or approximations can be used to choose
certain events or parts of a signal that have a certain range of frequencies. Convolution
of the wavelet function ѱ(t) with signal x(t) gives the wavelet transform, T, while
the convolution of x(t) with the scaling function ϕ(t) produces the approximation
coefficient, S.
The discrete wavelet transform (DWT) can be expressed as:
The coefficient of the signal approximation at scale m and location n can be expressed
as:
Fig. 2. A 3-level discrete wavelet transform. Each filter (high pass or low pass) is followed
by decimation or down-sampling by 2. cA1 represents the first-level approximation
coefficients, cD2 represents the second-level detail coefficients, cA3 represents
the third-level approximation coefficients, etc.
For a discrete input signal of finite length and a range of scales 0 < m < M, a discrete
approximation of the signal can be expressed as [21]:
where $\textit{x}$$_{M}$ $(\textit{t})$ is the signal approximation at scale M, and
the signal detail at scale m is expressed as:
Usually, approximations are repeatedly divided into low frequencies (approximations)
and high frequencies (details) to find the next level of wavelet analysis using more
filters, as shown in Fig. 2. This figure shows a three-level wavelet decomposition as an example. The low and
high pass filters’ impulse responses are dependent on the chosen wavelet.
There are many kinds of wavelets, such as Haar, Daubechies, Biorthogonal, Symlet,
and Coiflet wavelets. Symlets 2 through 8 and Daubechies 1 and 2 wavelets were tested
in this study because of their similarity in shape to the HRTF spectral notches at
the different directions among the subjects, which make them suitable for the auto-detection
problem. The decomposition analysis was done up to level 6 for each wavelet. Fig. 3 shows examples of some Symlet wavelets used in this study. The proposed algorithm
determines the most suitable HRTF set for an individual from a database, as presented
in Fig. 4.
Fig. 3. Examples of some mother wavelets: (a) Symlet2; (b) Symlet3; (c) Symlet5; (d)
Daubechies2.
Fig. 4. Proposed algorithm description to choose best individualized HRTF.
3. Results and Discussion
Fig. 5 shows an example of an HRTF wavelet decomposition using one of the tested wavelets,
Symlet5, up to level 6. Low-level details represent the highest frequencies in the
HRTF. The energy of the main notches was noticed in all detail levels from level one
to level 5 (i.e., $\textit{D}$$_{1}$ to $\textit{D}$$_{5}$). The first three levels
($\textit{D}$$_{1}$, $\textit{D}$$_{2}$, and $\textit{D}$$_{3}$) have a clear resemblance
to the spectral notches compared to other signal information in these three levels.
Therefore, these three levels were used for the auto-detection of the main notches
in the HRTFs.
Fig. 5. HRTF at 0° azimuth and -45° elevation (AZ=0°, EL= -45°) for subject 10 and
its wavelet decomposition up to level 6 using Symlet5.
To give more significance to the highest frequency components, detail $\textit{D}$$_{1}$
coefficients were multiplied by a higher factor. The reconstructed signal from wavelet
levels $\textit{D}$$_{1}$, $\textit{D}$$_{2}$, and $\textit{D}$$_{3}$ was used according
to the following proposed equation, which gives higher weight for the lower-level
details. The weights of each level were selected empirically for the notches of the
database subjects to maximize the auto-detection sensitivity:
where R represents the reconstructed signal using the inverse-discrete wavelet
transform (IDWT) of certain weights of $\textit{D}$$_{1}$, $\textit{D}$$_{2}$, and
$\textit{D}$$_{3}$ details.
The reconstructed signal from these details for the HRTF example in Fig. 5 is shown in Fig. 6. Locations of the spectral notches are simply auto-detected and marked as local peaks
of the squared-absolute reconstructed signal in Figs. 6 and 7. These figures show
examples of notch auto-detection at two different directions using Symlet5 for subjects
10 and 11, respectively.
Local peak selection from the squared absolute reconstructed signals was simply
done using a frequency sample that is larger than the neighboring samples and restricted
to a one peak in a window of 1 kHz. The reason was that it is unusual to have more
than one main spectral notch within this frequency range. Almost all peaks need to
be detected as long as they are higher than 1% of the maximum of the squared absolute
reconstructed signal, which is considered as the peaks’ amplitude threshold.
According to the Fourier transform applied to the HRIRs, the frequency resolution
for the processed data, Δf, is 87 Hz. An analysis was done on CIPIC data subjects
3, 8, 9, 10, 11, and 12. Spectral notches located between 4 kHz and 16 kHz were considered
for the analysis in this study because pinna cues usually lie in this range of frequencies
[22]. Furthermore, this range has essential cues for sound localization [12], where the total number of notches in the selected subjects at the stated locations
is 238 notches.
Fig. 6. (a) HRTF at (AZ=0°, EL=-45°) direction for Subject 10; (b) Reconstruction
signal according to Eq. (5) using Symlet5; (c) Absolute square of signal in (b) with
the auto-detected local peaks as small red circles.
Fig. 7. (a) HRTF at (AZ=0°, EL=-39.375°) direction for Subject 11; (b) Reconstruction
signal according to Eq. (5) using Symlet5; (c) Absolute square of signal in (b) with
the auto-detected local peaks as small red circles.
The performance of the auto-detection capability of the selected wavelets is presented
in Table 1. The results were sorted according to their auto-detection sensitivity.
The sensitivity $\textit{S}$ is defined as:
where $\textit{TP}$ and $\textit{FN}$ represent the number of true positives (correctly
detected notches) and number of false negatives (missed notches), respectively.
Table 1. Performance of wavelets on HRTF spectral notches auto-detection.
Wavelet
|
Sensitivity (%)
|
sym2
|
100
|
db2
|
99.6
|
sym3
|
92.9
|
db3
|
92.4
|
sym4
|
90.8
|
sym5
|
89.1
|
sym6
|
87.4
|
sym7
|
86.1
|
sym8
|
86.1
|
Around 70% of the auto-detected notches were accurately detected with the exact
central frequency compared to the manually examined ones. Around 28% were auto-detected
with a difference of ${\pm}$Δf from the actual notch frequency, and 2% were different
by ${\pm}$2Δf from the actual central frequency. These values are almost the same
among all wavelets tested. A slight difference occured between the actual notch frequency
and the auto-detected ones when the actual notch was very shallow and not deep enough
to be auto-detected accurately. However, these shallow spectral notches do not play
an important role in sound-source localization for humans compared to the deep spectral
notches because they are not associated with significant reflections.
All deep spectral notches were auto-detected accurately without any difference
from the original notches’ central frequencies. Usually, the measured HRIRs and the
calculated HRTFs are normalized, so when the depth of the notch is discussed, we refer
to the relative attenuation in the frequency response. A higher slope and lower relative
amplitude of the spectral notch result in higher amplitude in the squared amplitude
of the squared signal, which gives a direct indication of the drop in the notch amplitude.
Many studies have proposed different algorithms to find individualized HRTFs.
Some of these studies describe the relation between anthropometric parameters of the
subjects, especially their pinnae, and the HRTF features at different locations [14]. HRTFs are modeled accordingly, given the fact that HRTF describes the interaction
between the sound waves and the human head, torso, and shoulder geometry. This approach
is complicated and needs accurate estimation of the anthropometric parameters and
clear understanding about these parameters and their characterization of the HRTF.
Other studies model the HRTFs measured at certain directions and then estimate
HRTFs at all other locations using different interpolation methods [6,8]. Most of these studies validated the interpolation in a limited range of directions
in terms of azimuth and elevation angles, and some of the models have high computational
complexity. Even though the algorithm proposed does not create or model an individualized
set of HRTFs for a subject, it can be used to find the closest set of HRTFs among
available HRTF databases that have already been measured in different institutions
and labs around the world. Thus, it can be used for a subject to save time and effort
and to provide a good approximation for individualized HRTFs.
4. Conclusions
Spectral notches of HRTFs play important roles as spectral cues for sound-source
localization for humans. Accurate auto-detection and estimation of the spectral notches
is considered an important step to check the similarity between HRTFs of a certain
subject and ones in databases in order to find a suitable HRTF set for that subject.
Wavelet multi-resolution analysis using decomposition of up to three levels by Symlet2
to Symlet8, Daubechies2, and Daubechies3 wavelets have been used successfully to auto-detect
frequencies of spectral notches in the HRTFs.
Symlet2 outperformed the other tested wavelets in terms of auto-detection capability,
and it auto-detected all spectral notches in all tested HRTFs. Most of the auto-detected
notches were detected by the exact central frequency of the notches. Nevertheless,
future work remains to subjectively validate the proposed method by a subjective test,
as well as to test more directions and more subjects.
REFERENCES
Middlebrooks J.C., Green D.M., 1992, Observations on a principal components analysis
of head-related transfer functions, J. Acoust. Soc., Vol. 92, No. 1, pp. 597-599
Krebber W., Gierlich H.W., Genuit K., 2000, Auditory virtual environments: basics
and applications for interactive simulations, Signal Processing, Vol. 80, No. 11,
pp. 2307-2322
Doerr K.U., Rademacher H., Huesgen S., 2007, Evaluation of a low-cost 3D sound system
for immersive virtual reality training systems, IEEE Trans. On Visualization and Computer
Graphics, Vol. 13, No. 2, pp. 204-212
Jones D.L., Stanney K.M., Foaud H., 2005, An optimized spatial audio system for virtual
training simulations: design and evaluation, in Proceedings of Eleventh Meeting of
the International Conference on
Auditory Display (ICAD 05), Limerick, Ireland, pp. 223-227
Xie B., 2013, Head-Related Transfer Function and Virtual Auditory Display;, J Ross
Publishing: Plantation, FL, USA,
Gamper H., 2013, Head-related transfer function interpolation in azimuth elevation
and distance, J. Acoust. Soc. Amer., Vol. 134, pp. 547-554
Al-Sheikh B. W., Matin M. A., Tollin D. J., 2009, All-pole and all-zero models of
human and cat head related transfer functions, Proc. SPIE 7444, Mathematics for Signal
and Information Processing, 74440X
Al-Sheikh B., Matin M.A., Tollin D.J., 2019, Head Related Transfer Function Interpolation
Based on Finite Impulse Response Models, Seventh International Conference on Digital
Information Processing and Communications (ICDIPC), Trabzon, Turkey, pp. 8-11
Algazi V.R., Avendano C., Duda R.O., 2001, Elevation localization and head-related
transfer function analysis at low frequencies, J. Acoust. Soc. Am., Vol. 109 , No.
3, pp. 1110-1122.
Raykar V.C., Duraiswami R., Yegnanarayana B., 2005, Extracting the frequencies of
the pinna spectral notches in measured head related impulse responses, J. Acoust.
Soc. Am., Vol. 118, No. 1, pp. 364-374
Hebrank J., Wright D., 1974, Spectral cues used in the location of sound sources on
the median plane, Journal of the Acoustical Society of America, Vol. 56, No. 6, pp.
1829-1834
Langendijk E.H., Bronkhorst A.W., 2002, Contribution of spectral cues to human sound
localization, J. Acoust. Soc. Am., Vol. 112, No. 4, pp. 1583-1596
Middlebrooks J.C., 1999, Individual differences in external-ear transfer functions
reduced by scaling in frequency, J. Acoust. Soc. Am., Vol. 106, No. 3, pp. 1480-1492
Algazi V.R., Duda R.O., Satarzadeh P., 2007, Physical and filter pinna models based
on anthropometry, in Proc. AES 122nd Conv. AES.
Pal S., Mitra M., 2010, Detection of ECG characteristic points using Multiresolution
Wavelet Analysis based Selective Coefficient Method, Measurement, Vol. 43, No. 2,
pp. 255-261
Sammaiah A., Narsimha B., Suresh E., Reddy M.S., 2011, On the performance of wavelet
transform improving Eye blink detections for BCI, International Conference on Emerging
Trends in Electrical and Computer Technology , Nagercoil, pp. 800-804
Algazi V.R., Duda R.O., Thompson D.M., Avendano C., 2001, The CIPIC HRTF database,
IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, pp.
21-24.
Kulkarni A., Isabelle S. K., Colburn H. S., 1999, Sensitivity of human subjects to
head-related transfer-function phase spectra, J. Acoust. Soc. Am., Vol. 105, No. 5,
pp. 2821-2840.
Kistler D. J., Wightman F. L., 1992, A model of head-related transfer functions based
on principal components analysis and minimum-phase reconstruction, J. Acoust. Soc.
Am., Vol. 91, No. 3, pp. 1637-1647
Fugal D.L., 2009, Conceptual Wavelets in Digital Signal Processing: An In-depth Practical
Approach for the Non-Matematician., CA, San Diego: Space & Signals Technical Publishing
Saritha C., Sukanya V., Murthy Y.N., 2008, ECG Signal Analysis Using Wavelet Transforms,
Bulg J. Phys., Vol. 35, pp. 68-77
Hebrank J., Wright D., 1974, Spectral cues used in the localization of sound sources
on the median plane, J. Acoust. Soc. Amer., Vol. 56, No. 6, pp. 1829-1834
Author
Bahaa Al-Sheikh received a B.Sc. degree in electronics engineering from Yarmouk
University, Jordan, an MSc in electrical engineering from Colorado State University,
Colorado, USA, and a PhD in biomedical engineering from the University of Denver,
Colorado, USA, in 2000, 2005, and 2009, respectively. Between 2009 and 2015, he worked
for Yarmouk University as an assistant professor in the department of Biomedical Systems
and Medical Informatics Engineering and served as the department chairman between
2010 and 2012. He served as a part-time consultant for Sand-hill Scientific Inc.,
Highlands Ranch, Colorado, USA, in biomedical signal processing between 2009 and 2014.
Currently, he is an associate professor at the Electrical Engineering Department at
the American University of the Middle East in Kuwait. His research interests include
digital signal and image processing, biomedical systems modeling, medical instrumentation,
and sound-source localization systems.
Mohammad Shukri Salman received B.Sc., M.Sc. and Ph.D. degrees in electrical and
electronics engineering from Eastern Mediterranean University (EMU) in 2006, 2007,
and 2011, respectively. From 2006 to 2010, he was a teaching assistant in the Electrical
and Electronics Engineering department at EMU. In 2010, he joined the Department of
Electrical and Electronic Engineering at European University of Lefke (EUL) as a senior
lecturer. For the period of 2011-2015, he has worked as an assist. prof. in the Department
of Electrical and Electronics Engineering, Mevlana (Rumi) University, Turkey. Currently,
he is an Assoc. Prof. at the Electrical Engineering Department at the American University
of Middle East in Kuwait. He has served as a general chair, program chair, and TPC
member for many international conferences. His research interests include signal processing,
adaptive filters, image processing, sparse representation of signals, control systems,
and communications systems.
Alaa Eleyan received B.Sc. and M.Sc. degrees in electrical & electronics engineering
from Near East University, Northern Cyprus, in 2002 and 2004, respectively. In 2009,
he finished his PhD degree in electrical and electronics engineering at Eastern Mediterranean
University, Northern Cyprus. Dr. Eleyan did his post-doctorate studies at Bilkent
University in 2010. He has nearly two decades of working experience as both a research
assistant and faculty member in different universities in Northern Cyprus and Turkey.
Currently, he is working as an associate professor at Ankara Science University in
Turkey. His current research interests are computer vision, signal & image processing,
pattern recognition, machine learning, and robotics. He has more than 60 published
journal articles and conference papers in these research fields. Dr. Eleyan has served
as a general chair of many international conferences, such as ICDIPC2019, DIPECC2018,
TAEECE2018, and DICTAP2016.