Open access
Research Article
14 December 2020

Measurements of Transmission Characteristics Related to Bone-Conducted Speech Using Excitation Signals in the Oral Cavity

Publication: Journal of Speech, Language, and Hearing Research
Volume 63, Number 12
Pages 4252-4264

Abstract

Purpose

Psychoacoustical studies on transmission characteristics related to bone-conducted (BC) speech, perceived by speakers during vocalization, are important for further understanding the relationship between speech production and perception, especially auditory feedback. For exploring how the outer ear part contributes to BC speech transmission, this article aims to measure the transmission characteristics of bone conduction focusing on the vibration of the regio temporalis (RT) and sound radiation in the ear canal (EC) due to the excitation in the oral cavity (OC).

Method

While an excitation signal was presented through a loudspeaker located in the enclosed cavity below the hard palate, transmitted signals were measured on the RT and in the EC. The transfer functions of the RT vibration and EC sound pressure relative to OC sound pressure were determined from the measured signals using the sweep-sine method.

Results

Our findings obtained from the measurements of five participants are as follows: (a) the transfer function of the RT vibration relative to the OC sound pressure attenuated the frequency components above 1 kHz and (b) the transfer function of the EC relative to the OC sound pressure emphasized the frequency components between 2 and 3 kHz.

Conclusions

The vibration of the soft tissue or the skull bone has an effect of low-pass filtering, whereas the sound radiation in the EC has an effect of 2–3 kHz bandpass filtering. Considering the perceptual effect of low-pass filtering in BC speech, our findings suggest that the transmission to the outer ear may not be a dominant contributor to BC speech perception during vocalization.
During speaking, humans perceive their own voices to control their speech production systems (Denes & Pinson, 1993). For clarifying human's speaking/hearing mechanisms, the relationship between speech production and perception needs to be better understood. There are two different types of sound transmission of one's own voice: air-conducted (AC) and bone-conducted (BC) speech. While the voice is generated by the larynx, modified by the vocal tract, and then transmitted as AC speech to the auditory system due to lip radiation and diffraction, the sound inside the vocal tract is also transmitted to the auditory system through the soft tissue and the skull bone as BC speech. To further explore human's speaking/hearing mechanisms, both AC and BC speech transmission processes during the perception of one's own voice need to be understood.
Previous investigations related to own-voice perception have focused only on AC speech perception (Chen et al., 2007; Lee, 1950; Okazaki et al., 2010). However, strong influence of BC speech transmission on the perception of one's own voice is pointed out (Madaule, 2001; Sundberg, 1987). Recently, the effect of BC speech perception on one's speech production has been investigated using a delayed auditory feedback technique (Toya et al., 2016), showing that BC speech perception affects one's speech production similarly to AC speech perception. However, acoustical characteristics and transmission process of BC speech have not been well determined.
Several studies have been conducted for the purpose of understanding the characteristics of own-voice perception focusing on two different types of transmission pathways (AC and BC). von Békésy (1949) showed that BC speech transmission has almost the same order of magnitude as AC speech transmission. Later, Pörschmann (2000) investigated the perceptual relationship between AC and BC speech as a function of frequency, concluding that the magnitude relationship between AC and BC speech is frequency dependent but that BC speech has perceptually greater magnitude at frequencies between 0.7 and 1.2 kHz than AC speech. Reinfeldt et al. (2010) also investigated the perceptual relationship between AC and BC speech during 10-phoneme vocalization on the basis of the ear canal (EC) sound pressure at hearing thresholds for AC and BC stimulations. The relationship was shown to be phoneme dependent, but the magnitude of BC speech dominated at frequencies between 1 and 2 kHz in most cases. This frequency dependency is thought to be caused by AC and BC speech transmission pathways having completely different frequency characteristics. Here, one's own perceived voice (i.e., the combination of AC and BC speech) is assumed to have a perceptually low-pass effect relative to AC speech alone (Nakayama, 1997). This suggests that BC pathways may mainly have a low-pass effect. The systematic factors characterizing this effect in the BC speech pathways should be carefully determined.
From physiological studies, BC stimulation at the skull bone is assumed to be transmitted to the outer, middle, and inner ear through multiple pathways (Tonndorf, 1976). On the basis of this, the transmission characteristics of the outer and middle ear during BC stimulation have been measured using cadaver heads or several parts of the auditory peripheral system (e.g., Stenfelt, 2015; Stenfelt et al., 2002; Stenfelt et al., 2003). On the other hand, in the case of BC speech transmission during vocalization, the sound source should be regarded as the sound field in the vocal tract, rather than the skull bone. In this case, while AC speech is transmitted from the lips to pinna by sound diffraction outside the head, the sound pressure inside the vocal tract stimulates the soft tissue and the skull bone to cause the outer, middle, and inner ear parts of BC speech transmission. Therefore, BC speech can be produced as a result of spectral modifications during the vibration/sound pressure transmission from the glottal source to each part of the auditory system through the vocal tract, soft tissue, or skull bone.
Stenfelt (2011) previously proposed a model of BC speech transmission pathways for one's own voice. This model shows the relationships between the soft tissue/skull bone vibration and each part of the auditory system on the basis of the physiological aspects of BC hearing. Vibrations of the soft tissue and that of the skull bone were hypothesized to produce the EC sound pressure, respectively. Moreover, the skull bone vibration was hypothesized to cause inertial forces in the middle ear ossicles and the cochlea fluid, respectively. However, the contribution of the speech production process (i.e., glottal source energy or vocal tract sound pressure) to BC speech perception still has not been carefully considered.
Here, our hypothesized model of the transmission process for AC and BC speech during vocalization is shown in Figure 1. In AC speech transmission, the glottal source signal is assumed to be modified by the vocal tract filter and the lip radiation characteristics to reach the outer ear, according to source–filter theory (Kent & Read, 1992). In BC speech transmission, on the other hand, the sound pressure in the oral cavity (OC) is assumed to excite the soft tissue and the skull bone to reach the outer, middle, and inner ear. At that time, the mechanical vibration of the vocal folds is also assumed to excite the soft tissue. Although the transmission characteristics of the middle and inner ear parts cannot be directly measured, the transmission characteristics from the speech production systems to the outer ear through the soft tissue and the skull bone can be measured physically, which leads to further understanding the relationship between speech production and perception.
Figure 1. Our assumed model of transmission pathways of AC and BC speech. In AC speech transmission, the glottal source signal is assumed to be modified by the vocal tract filters and the lip radiation characteristics to reach the outer ear through diffraction. In BC speech transmission, on the other hand, the sound pressure in the OC is assumed to cause vibration of the soft tissue and the skull bone to reach the outer, middle, and inner ear. In this study, mechanical coupling of the vocal folds and soft tissue is ignored. Three measurement positions (OC, RT, and EC) are also marked.
For the observable transmission processes, this study focuses on the vibration of the regio temporalis (RT) and the sound radiation in the EC. In fact, the previous model of BC transmission hypothesized that both the skin/soft tissue vibration and skull bone vibration independently produce the EC sound pressure (Stenfelt, 2011). In investigations with live humans, both skin/soft tissue and skull bone vibrations are difficult to measure clearly in isolation from each other. Therefore, this article hypothesized a model with the skin/soft tissue and the skull bone as a single composite system to determine the characteristics of the whole composite system in vivo. Under the hypothesis, the authors previously investigated the transmission characteristics between speech production processes (i.e., both vocal fold vibration and the OC sound pressure) and the RT vibration and EC sound radiation, using transcutaneous excitation on the neck (Toya et al., 2019). The outcome from the study showed the relative low-pass effect for the RT vibration and the 1- to 3-kHz bandpass effect for the EC sound radiation. However, the extent of the contribution of the OC sound pressure to BC speech transmission has not been clarified. This type of contribution should be independently investigated since the sound transmission within the vocal tract strongly affects the representation of the phonemic information.
This study aims to investigate the transmission characteristics of BC speech focusing on how the OC sound pressure originating from the glottal source affects the BC transmission to the RT and EC, for further understanding the transmission process related to sound pressure from the glottal source to the soft tissue/skull bone and the EC. The transfer functions from the OC to RT (|H ocrt(f)|) and those from the OC to EC (|H ocec(f)|) are physically measured, to explore whether/how the transmission characteristics assayed at the RT or in the EC are strongly related to the perceptual low-pass effect of BC speech.
This article is organized as follows. In the Method section, we describe the methods for measuring the transmission characteristics of the soft tissue/skull bone and of the EC. In the Results section, we present the measurement results of these transmission characteristics. In the Discussion section, we interpret the trends in the physical transmission characteristics and discuss the relationships between the obtained results and previous physiological/psychological findings. In the last section, we draw conclusions.

Method

Participants

Five university students (three men and two women, aged 23–27 years) participated in the measurements of transmission characteristics. All participants were native Japanese speakers who had normal hearing, and none had a speaking disorder.

Materials and Apparatus

Experimental Excitation Setup in the OC

Figure 2 shows the excitation setup for the measurements. A small loudspeaker (VECO 32KC08-1) was used for excitation in the OC. The loudspeaker was enclosed by a rubber enclosure. The purpose of using the rubber enclosure was to generate a quasistatic sound field in the OC and induce vibrations of the palate (and hence the whole skull) above. The enclosure for each participant was individually designed and molded so that it was well fitted to their hard palate using a plaster replica of their tooth and upper jaw. The hardness of the rubber enclosure was Shore 30 A, which was assumed to be more flexible than their hard palate. A probe microphone (Etymotic Research ER-10C) with an ear-tip (ER10C-03) was used to measure the sound pressure actually generated inside the enclosure.
Figure 2. The excitation setup for measuring the transmission characteristics. The loudspeaker (VECO 32KC08-1) was embedded in an enclosure fitted to the hard palate to make a quasistatic sound field in the oral cavity. The probe microphone was also inserted in the sound field for the measurements.

Apparatus

Figure 3 schematically illustrates the setup for the measurements. The measurements were conducted in a soundproof room. Excitation signals (see below) were generated by a software (Steinberg Cubase Pro 9) running on a PC (LG Sharkoon, with Windows 10) to drive the loudspeaker in the participants' OCs via an A/D converter (Steinberg UR44). The response signals were simultaneously recorded on the skin in their left RTs and in their right ECs through a BC microphone (TEMCO HG70), which is a kind of accelerometer, and another probe microphone (the same type as used above), respectively.
Figure 3. Experimental setup for the measurements of the transmission characteristics. While the loudspeaker was driven in the excitation setup (shown in Figure 2), the response signals were simultaneously recorded on the three microphones.
Figure 4 schematically illustrates the layout of the loudspeaker and the microphones. The HG70 BC microphone was fixed by a belt attachment. The ER-10C probe for measuring EC sound pressure had a foam ear-tip (ER10C-14A) to occlude the EC. The purpose of the occlusion was to attenuate the AC sound leaking from the excitation setup to the EC microphone. The probe was inserted in the EC by at least half of the probe ear-tip length (7 mm) and at most the full length (14 mm).
Figure 4. Layout of the loudspeaker and the microphones during measurements. The excitation setup (shown in Figure 2) was located in the oral cavity. The bone-conducted microphone (TEMCO HG70) and the probe microphone (Etymotic Research ER-10C) were located on the left regio temporalis and in the right ear canal to measure the response signals. RT = regio temporalis.
The recorded signals were amplified by built-in amplifiers. The sampling frequency was 44.1 kHz, and the number of quantizing bits was 16.

Measurements

The participants were asked to close their mouths as much as possible with the enclosed loudspeakers fixed in their OCs. A logarithmic sweep-tone signal lasting 15 s, which had a steady-state duration of 13 s and rise/fall times of 1 s, was used for excitation. Total frequency band of the signal was from 0.1 to 7 kHz, which includes the frequency band valid in the equipment (0.2–5 kHz). During excitation, the vibration of the participants' RTs and the sound pressure in their ECs were measured. The total number of measurement trials for each participant was 10.
The input voltage of the excitation setup was determined so that the output signal from the ER-10C microphone in the rubber enclosure was not distorted, and the maximum sound pressure levels (SPLs) in the rubber enclosure (i.e., the OC) were approximately 100 dB, which is assumed to be typical for the SPL in the OC due to loud speaking.

Analysis

Let x(t) denote the logarithmic sweep-tone signal and y p(t) (P = {OC, RT, and EC}) the response signals at the positions (OC, RT, and EC) as shown in Figure 1. The impulse response of each position (h p(t)) was calculated as follows:
hpt=ypt∗∗xinvt,
(1)
where x inv(t) is a time-mirrored and amplitude-modified signal of x(t) (Farina, 2000). “*” denotes the operation of the convolution. The spectra of the response signals (H p(f)) were calculated as follows:
Hpf=Fhpt,
(2)
where F[⋅] denotes the Fourier transform. In this study, the absolute value |H p(f)| (i.e., the amplitude spectrum) was used to determine the transmission characteristics. Here, |H p(f)| was averaged across the 10 measurements for each participant.
The response signal of the RT (y RT(t)) was derived from the acceleration of the RT vibration since the BC microphone (i.e., the accelerometer) converts acceleration into voltage. On the other hand, the response signals y OC(t) and y EC(t) were derived from the sound pressure, as measured by the microphone. Theoretically, the responses of the accelerometer at the RT and that of the microphone in the EC cannot be quantitatively compared. However, the frequency responses from capacitor microphones (i.e., sound pressure response) were reported to be similar to the double integral of those from accelerometers (i.e., displacement response of vibration) for mechanomyogram measurements (Watanabe et al., 2001). On the basis of this relationship, the absolute acceleration response of the BC microphone |H RT(f)| was converted into the absolute displacement response |H RT_d(f)| as follows:
HRT_df=HRTf2πf2,
(3)
where the manipulation with 1/2πf2 in the frequency domain corresponds to the double integral in the time domain. At that time, the gain of the absolute amplitude |H RT_d(f)| obtained in Equation 3 was calibrated so that the total power |H RT_d(f)|2 in the valid frequency band (0.2–5 kHz) equals that of the original absolute acceleration |H RT_d(f)|2. This manipulation enables the total power of the RT output signal to be retained. Here, it should be noted that the obtained displacement response |H RT_d(f)| for the RT vibration and the sound pressure response |H EC(f)| for the EC sound radiation could just be compared in terms of relative spectral shapes, not as quantitative comparisons.

Compensation

The BC microphone (TEMCO HG70) had a nonflat acceleration sensitivity in the frequency band between 0.2 and 5 kHz, as shown in Figure 5. This sensitivity was measured in advance, using the same equipment and method as the microphone-response measurements reported by Shimizu et al. (2009). This frequency sensitivity was compensated for the acceleration response of RT vibration |H RT(f)| before manipulating Equation 3.
Figure 5. Frequency sensitivity of the bone-conduction microphone (TEMCO HG70) used in the measurement. This sensitivity was measured in advance by using the same equipment and method as the microphone response measurements by Shimizu et al. (2009).
When the sound radiation in the EC due to the excitation was recorded by the probe microphone, the participants' ECs were occluded. In fact, the open and occluded EC sound pressures have different frequency characteristics due to the occlusion effect. Figure 6 shows the frequency characteristics of the occlusion effect estimated from the measurements by Stenfelt, Wild, et al. (2003), which were derived from nine normal intact ECs and BC stimulation from a transducer on the skull, with the case of 8-mm insertion of probes. To estimate the transmission characteristics with their ECs opened, the occlusion effect of the ECs was compensated for the amplitude spectrum of the EC sound radiation |H EC(f)|.
Figure 6. Frequency characteristics of the occlusion effect in the ear canal (EC). It was derived from the measurement of the EC sound pressure induced by the mastoid BC stimulation for nine normal intact ECs with the case of 8-mm insertion of probes, conducted by Stenfelt, Wild, et al. (2003).
Table 1 shows the adopted compensation data and the compensated outputs with the frequency ranges. Here, since ER-10C microphone sensitivity is reported to be nearly flat in the frequency region from 0.2 to 10 kHz, the ER-10C microphone response was not compensated for besides the occlusion effect.
Table 1. Adopted compensation data and the compensated outputs with the frequency ranges.
Compensated outputCompensation dataFrequency range for compensation
HG70 accelerometer output (|H_RT(f)|)Frequency response of HG70 (see Figure 5)200–5000 Hz
ER-10C microphone output (|H_EC(f)|)Occlusion effect measured by Stenfelt, Wild, et al.
(2003) for 8-cm insertion of probes (see Figure 6)
200–5000 Hz

Calculating Transfer Functions Relative to OC Sound Pressure

From the obtained amplitude spectra, the transfer functions of the RT vibration and the EC sound pressure relative to the OC sound pressure were calculated as the following equation:
Hocrtf=HRT_dfHOCf
(4)
Hocecf=HECfHOCf
(5)
The obtained transfer functions were compared among speakers. Additionally, the averaged magnitude differences between |H ocrt(f)| and |H ocec(f)| were calculated as meanHocecf/meanHocrtf , where mean(⋅) denotes the average across participants.

Results

SPLs in the OC

Figure 7 shows the OC SPLs in the rubber enclosure for five participants. Figures 7a– 7e show the OC SPLs for participants P01 to P05, respectively. Figure 7f shows the average across the five participants. Two thin-dashed lines in each figure represent the maximum level and the level of −10 dB below the maximum. The thin-darkened areas in Figures 7a– 7e show the frequency bands in which the SPL was less than −10 dB below the maximum. The thin-darkened areas in Figure 7f show the union of the areas in Figures 7a– 7e. Except for participant P05, the OC SPLs were approximately between 90 and 100 dB in the frequency band below 3 kHz. On the other hand, the OC SPLs partially decreased by approximately 20 dB in the frequency band above 3 kHz. For participant P05, the OC SPL changed between 72.8 and 100 dB in the frequency band below 2.5 Hz. It should be noted that those nonflat characteristics, especially two troughs above 3 kHz for all participants and the overall shapes for P05, could affect the results of the measurement. These results show that the quasistatic sound field in the OC could hold up to approximately 3 kHz.
Figure 7. Oral cavity (OC) sound pressure levels in the rubber enclosure: (a) participant P01, (b) participant P02, (c) participant P03, (d) participant P04, (e) participant P05, and (f) average across five participants. Each light-colored area in (a)–(e) represents the standard deviation. Two thin-dashed lines in each figure represent the maximum level and the level of −10 dB below the maximum. The thin-darkened areas in (a)–(e) show the frequency bands in which the sound pressure level was less than −10 dB below the maximum. The thin-darkened areas in (f) show the union of the areas in (a)–(e). SPL = sound pressure level.

Transfer Functions From OC to RT and EC

Figure 8 shows the measured transfer functions |H ocrt(f)| and |H ocec(f)| for five participants. Figures 8a– 8e show the results for participants P01–P05, respectively. Figure 8f shows the results averaged across the five participants. The dotted and solid lines represent the averaged transfer functions across the 10 repetitions. Each light-colored area represents the standard deviation of each transfer function. Thin-darkened frequency bands correspond to those in Figure 7.
Figure 8. Measured transfer functions of the regio temporalis vibration and the ear canal sound pressure relative to the oral cavity (OC) sound pressure (i.e., |H ocrt(f)| and |H ocrt(f)|): (a) participant P01, (b) participant P02, (c) participant P03, (d) participant P04, (e) participant P05, and (f) average across five participants. Each light-colored area in (a)–(e) represents the standard deviation. Thin-darkened frequency bands correspond to those in Figure 7.
The transfer function |H ocrt(f)| had the magnitude between approximately −40 and 20 dB for most participants. In most cases, the frequency components below 250 Hz were partially emphasized, while the frequency components above 1 kHz were gradually attenuated as the frequency increased. Additionally, two common peaks were found around 0.7 and 1.4 kHz.
The transfer function |H ocec(f)| had the magnitude between approximately −50 and 0 dB for all participants. In most cases, the frequency components below 500 Hz were gradually attenuated as the frequency increased, while the amplitude increased between 1 and 3 kHz. Overall, as shown in Figure 8f, a common peak around 3 kHz was found. For participants P02 (see Figure 8b) to P04 (see Figure 8d), another peak around 2.5 kHz was also found. On the other hand, for participant P05 (see Figure 8e), the above trends were not found.
Figure 9 compares the measured transfer functions (|H ocrt(f)| and |H ocec(f)|) and those obtained by transcutaneous excitation on the neck in the previous study. The solid line shows the same transfer functions as shown in Figure 8f. The dotted line shows the mean transfer functions from the larynx to the RT/EC across seven participants, which were derived from the RT vibration and the EC sound radiation induced by the direct vibration on the neck close to the larynx (Toya et al., 2019). Thin-darkened frequency bands correspond to those in Figure 7f. Here, it should be taken into account that those two transfer functions can only be compared with respect to the relative shapes of the magnitudes since the excitation methods were different.
Figure 9. Comparison of measured transfer functions (|H ocrt(f)| and |H ocec(f)|) with the previously measured transfer function, obtained by transcutaneous excitation on the neck. The solid line shows the same transfer functions as shown in Figure 8f. The dotted line shows the mean transfer functions from the larynx to the regio temporalis (RT)/ear canal (EC) across seven participants, which were derived from the RT vibration and the EC sound radiation induced by the direct vibration on the neck close to the larynx (Toya et al., 2019). Thin-darkened frequency bands correspond to those in Figure 7f.
As shown in Figure 9a, the averaged |H ocrt(f)| has two peaks at approximately 500 and 800 Hz and a trough at approximately 1.2 kHz. Those trends were similarly observed in the averaged transfer function from the larynx to the RT measured previously. Overall, |H ocrt(f)| had a global magnitude attenuation of approximately −7.8 dB/octave, while the transfer function from the larynx to the RT had approximately −4.9 dB/octave attenuation.
As shown in Figure 9b, the average |H ocec(f)| had a gradual increase in the magnitude by approximately 15 dB as the frequency increased in the frequency band between 0.5 and 1.4 kHz. Those trends were similarly observed in the average transfer function from the larynx to the EC previously measured. At that time, a common peak at 2 kHz was shown in the transfer function from the larynx to the RT, while it was not found in |H ocec(f)|.
Figure 10 shows the averaged magnitude difference between |H ocrt(f)| and |H ocec(f)|. Thin-darkened frequency bands correspond to those in Figure 7f. The magnitude differences were about −30 to 20 dB. The frequency components between 0.4 and 1 kHz were most attenuated while the frequency components between 2 and 4 kHz were most emphasized. In the frequency band corresponding to the assumed quasistatic sound field (below 3 kHz), the relative magnitude was greatest approximately between 2 and 2.5 kHz.
Figure 10. Averaged magnitude difference between |H ocrt(f)| and |H ocec(f)|. The difference was obtained as a ratio of the averaged |H ocrt(f)| to the averaged |H ocec(f)| across participants. Thin-darkened frequency bands correspond to those in Figure 7f.

Discussion

Transmission characteristics of the RT vibration and the EC sound radiation induced by the excitation in the OC were measured, hypothesizing the transmission model shown in Figure 1. Regarding the excitation setup shown in Figure 2, effects of the rubber enclosure on the derived results have not been taken into account in this study: The rubber enclosure itself may partly transmit the mechanical vibration of the loudspeaker to the palate, which is a secondary source of excitation of the skull bone. Hence, the obtained transfer functions may differ from those during natural speaking. Overall, as shown in Figure 7, the OC sound pressure was relatively stable in the frequency band below 3 kHz but partially decreased and had two troughs by approximately 20 dB in the frequency band between 3 and 5 kHz, which may correspond to the common peaks at 3.5 kHz shown in the derived transfer functions (see Figure 8). The troughs might come from some interaction between the rubber enclosure and the ER-10C probe. The standing waves in the enclosure may have partially altered the probe output and canceled the desired probe response to the sound pressure inside the enclosure due to phase interaction. The rubber cavity length was 39 mm, but it could be slightly increased/decreased because of its flexibility. Considering this point, the troughs in the OC sound pressure may correspond to the first eigenfrequency (approximately 3.5–4.5 kHz) of the standing waves. Additionally, the rubber enclosure may have altered the transmission and affected the transfer function slightly, since the enclosure touched each participant's palate. The hardness of the rubber was Shore 30 A, which is assumed to be more flexible than their palate. Hence, the rubber could affect the soft tissue/skull bone transmission at lower frequencies. This also might cause the nonflat characteristics of the OC SPLs especially below 0.35 kHz, as shown in Figure 7. In this case, not only the sound from the loudspeaker but also the mechanical vibration of the loudspeaker or the enclosure could influence the RT vibration and EC sound pressure. The concerns above should be further investigated in future measurements.
There could be other factors of the nonflat characteristics of the OC SPL shown in Figure 7. Although the rubber enclosure in the OC for each participant was individually designed and molded so that it was well fitted to their hard palate, the individual differences of their palate shapes could influence the characteristics of the sound fields. Even some slight movements of their mouths/bodies may also have caused the change in OC SPL during the measurements, especially for the Participant P05. If the nonflat characteristics in the thin-darkened frequency bands as shown in Figure 7 were offset, the accurate transfer functions might be estimated. Improvement of the excitation setup (see Figure 2) may enable the nonflat frequency bands shown in Figure 7 to be reduced. Additionally, intersubject variability of the OC SPL should be considered. One of the possible reasons for the variability is assumed to be the intersubject variability of the strength (force) of holding the enclosure with the upper/lower jaws. The differences of the strength may influence the volumes of the sound fields or the vibration modes of the speaker/enclosure, which might cause the intersubject variability of the sound pressure in the rubber enclosure.
In this study, the RT vibration was characterized as a change in displacement, while the EC sound radiation was characterized as a change in sound pressure. Basically, these two characteristics should be difficult to compare directly. However, it has been pointed out that the capacitor microphone responses to body vibration indicate the displacement characteristics (Watanabe et al., 2001). Hence, in this study, the displacement characteristics of the RT vibration and the sound-pressure characteristics of the EC sound radiation (which is derived from a capacitor microphone response) were compared. It should be noted that this manipulation just enables the shapes of RT and EC transfer functions to be compared globally or relatively but not quantitatively. The correspondence of the derived transfer function |H ocrt(f)| to perceptual properties of BC speech should be clarified.
The measurement results of the RT vibration were regarded as the vibrations of the skull bone (or the soft tissue). Naturally, the vibration characteristics of the temporal bone are not identical to those of the whole skull bone. In this study, the BC microphone was fixed on a specific position in the left RT as shown in Figure 4. In general, measurements at different BC microphone locations during BC stimulation cause different characteristics (Stenfelt et al., 2000). In the case of recording BC speech during vocalization, obvious effects of BC microphone placement on the intensity and spectral characteristics of the recorded BC speech have also been pointed out (Tran et al., 2013). However, the present aim of the RT vibration measurement was to obtain the BC transmission characteristics from the OC to the EC via soft tissue/skull bone and not to explore the detailed vibration characteristics of the whole skull bone. Stenfelt et al. (2002) stimulated a cored temporal bone specimen to simulate the middle ear part of BC transmission. Likewise, the RT vibration characteristics obtained in this study (|H ocrt(f)|) may be the main factors contributing to the middle or inner ear parts of BC speech transmission.
Another concern is the EC occlusion during the measurement. The purpose of the occlusion was to attenuate the AC sound leaking from the excitation setup to the EC microphone. During the experiment, the EC SPL due to the OC excitation was approximately 70–80 dB. At that time, the SPL of the AC leakage at the OC entrance was less than 64 dB. Assuming that the ER10C-14A foam ear tips provided approximately 15 dB of attenuation, the AC SPL leaked from the OC entrance to the occluded EC is assumed to be less than 50 dB. The effect of the AC leakage on the measured EC sound pressure should be carefully considered. Here, the present measurement results of the EC sound radiation include the compensation of the mean occlusion effect derived from the EC sound pressure induced by the mastoid BC stimulation (Stenfelt, Wild, et al., 2003). This value was reported to be derived from nine normal intact ECs and BC stimulation from a transducer on the skull. This is the case for an insertion depth of 8-mm of probes. In fact, the actual magnitude of the occlusion effect is reported to strongly depend on the probe insertion depth (Stenfelt & Reinfeldt, 2007). In this study, the probe insertion depth was approximately from 7 to 14 mm. Considering this, the compensation in this study is assumed to be typical for a shallower placement. Additionally, frequency characteristics of the occlusion effect induced by the frontal BC stimulation (Fagelson & Martin, 1998; Goldstein & Hayes, 1965) are reported to be different. In the case of BC speech transmission during vocalization, the frequency characteristics of the occlusion effect are likely to change, too. Instead of the EC occlusion and the occlusion effect compensation, using large earmuffs is another method to attenuate AC sound without causing the occlusion effect (Pörschmann, 2000; Reinfeldt et al., 2010). The influence of the occlusion effect compensation on the measurement results in this study should be further investigated.
The global magnitude attenuation of around −78 dB/octave as the frequency increases, seen in the averaged |H ocrt(f)|, is assumed to be derived especially from the low-pass effect of the soft tissue. Although the local peaks between 0.7 and 1 kHz in |H ocrt(f)| for all participants may correspond to the peaks shown in the OC SPL, those peaks in |H ocrt(f)| might also be affected by the first resonance of the skull bone itself, which is reported to be 0.8–1.2 kHz (Håkansson et al., 1994). On the other hand, the relative increase of magnitude around 2–2.5 kHz in |H ocec(f)| for most participants is assumed to be related to the open EC resonance due to the occlusion effect compensation. The EC transfer function for AC sound is known to have a peak around 2–3 kHz (Mehrgardt & Mellert, 1977). A similar trend is also shown in the EC sound pressure during BC stimulation (Stenfelt, Wild, et al., 2003). However, the range of the relative magnitude between |H ocrt(f)| and |H ocec(f)| (45–50 dB) was greater than that of the reported EC transfer functions (Mehrgardt & Mellert, 1977: around 15 dB). Thus, this suggests that the EC transmission characteristics in BC and AC speech transmission differ.
As shown in Figure 9, the previous measurement of transmission characteristics from the excitation in the larynx to the RT vibration and EC sound radiation showed global trends similar to the current characteristics of |H ocrt(f)| and |H ocec(f)|, except the trends of the frequency components below 400 Hz and around 2000 Hz (Toya et al., 2019). In this article, the currently and previously measured transfer functions could not be quantitatively compared. However, the local difference between two measurement results may suggest the effect of the mechanical coupling of the vocal folds and soft tissue. Although it is currently still unclear whether the BC speech transmission via OC sound pressure is more effective than that via the mechanical coupling of the vocal folds and soft tissue, the OC sound pressure is hypothesized to be a meaningful contributor to BC speech perception. The reasons for this hypothesis are as follows:
1.
The delayed auditory feedback presented through BC causes stuttering speech as well as through AC (Toya et al., 2016). This suggests that the BC speech includes phoneme information (which comes from the vocal tract filter), and one reason for the phenomenon may be the temporal mismatch of the phoneme information between production and perception.
2.
The phoneme-related characteristic is observed in BC speech signals recorded by a BC microphone (Rahman & Shimamura, 2019).
Future studies should further investigate which of two pathways is more effective.
Perceptual properties of BC speech during vocalization have been investigated on the basis of the hearing threshold. Pörschmann (2000) measured the masked threshold for AC and BC speech to determine the perceptual relationship between AC and BC speech as a function of frequency. He showed the transfer function of the BC part of one's own voice, in which gradual attenuation of amplitude (around 20 dB) was found as the frequency exceeds 1 kHz. This behavior is similar to that of the transfer function |H ocrt(f)| measured in our study. Considering that his perceptually obtained transfer functions should be influenced by all (the outer/middle/inner ear) parts of BC speech transmission, own-voice perception is likely to be strongly affected by the low-pass effect of the tissue/skull bone vibration, as assumed in Figure 1. Reinfeldt et al. (2010) measured the physical and perceptual relationship between AC and BC speech during vocalization on the basis of EC sound pressure at hearing thresholds for AC and BC stimulations. Overall, they showed that the BC relative to AC sound pressure is greater in the 1–2 kHz frequency region but less in the frequency region below 500 Hz. Theoretically, their physically measured characteristics of EC sound pressure should be affected by both the soft tissue/skull bone vibration and EC sound radiation, without effects by the middle or inner ear parts of the transmission. In this study, the low-attenuation and high-emphasis effects of |H ocec(f)| are likely to account for their results. In this study, the change of the transmission characteristics depending on the articulation was not examined, while previous measurements (Reinfeldt et al., 2010) showed phoneme-dependent transmission characteristics. Since the amplitude behavior of recorded BC speech is reported to be sensitive to the first formant locations (Rahman & Shimamura, 2019), the transmission characteristics related to BC speech are likely to change depending on the size of the OC or tongue height.
One's own perceived voice (i.e., the combination of AC and BC speech) has a perceptually low-pass effect relative to AC speech (Nakayama, 1997). Therefore, BC speech obviously has a greater magnitude in the lower frequency region and less magnitude in the higher frequency region. Considering the 2- to 3-kHz bandpass effect of EC sound radiation or BC relative to AC sensitivity due to EC sound pressure reported in Reinfeldt et al. (2010), it is suggested that the BC transmission to the outer ear may not be a dominant contributor to BC speech perception during vocalization. Stenfelt and Goode (2005) hypothesize that the inertia of the cochlea fluids is the most important contributor to BC hearing. Frequency dependency of the fluid displacement ratio at the oval and the round window due to BC stimuli (Stenfelt, Hato, & Goode, 2003) may suggest the nonflat frequency response in the cochlea. However, vibration of the soft tissue and the skull bone is likely to affect the inner ear directly due to inertial forces in the cochlea fluid. Hence, BC speech, initially characterized by the low-pass effect (i.e.,|H ocrt(f)|) in the soft tissue and the skull bone, can be assumed to be transmitted to the inner ear as a main contributor to BC speech perception especially in the lower frequency region. Unfortunately, no physical method has been developed for measuring the inner ear part of BC speech transmission in vivo. Instead, for example, subjective modification of the AC speech spectrum by using the |H ocrt(f)| and |H ocec(f)| transfer functions to simulate the outer/inner ear parts of BC speech transmission may be useful to further investigate perceptual properties. At the same time, transmission characteristics in the middle ear part of BC should also be considered. The middle ear velocity level is reported to be greater in the frequency region around 2 kHz than in the other regions (Stenfelt et al., 2002). The effect of these characteristics on BC speech perception should be further investigated.
During the measurement of transfer functions |H ocrt(f)| and |H ocec(f)| for each participant (see Figure 8), both the soft tissue/skull bone vibration and the EC sound radiation were assumed to be linear systems, as shown in Figure 1. The skull bone sound transmission is reported to behave linearly in the frequency region between 0.1 and 10 kHz (Håkansson et al., 1996). In this study, a nonlinear relationship between the soft tissue/skull bone vibration and the EC sound radiation was not considered. For example, a nonlinear relationship should be considered if the middle ear muscle reflex (MEMR) is activated due to the loud excitation in the OC. Although the EC SPL during the experiments was approximately 70–80 dB, the experimental stimuli (i.e., the excitation signal in the OC) could be transmitted to not only the EC but also the middle ear. Therefore, the total energy reaching the auditory system could be greater. An impedance change due to the muscle reflex as the stimulus SPL exceeds 90 dB was long ago reported to be obvious (Møller, 1962). If the MEMR were activated, the hypothesis that each system is linear would not be true. However, even if the ossicular chain were stiffened due to the MEMR, this would not be an important factor affecting the transfer function between the OC to the EC via soft tissue/skull bone. It is argued that the EC sound radiation due to BC excitation is produced mainly because of the deformation of the EC itself (Stenfelt, 2011). Considering this point, the impedance changes of the ossicular chain or the tympanic membrane could barely affect the transfer functions derived in this article.
In the model of the transmission in Figure 1, |H ocec(f)| was assumed to be a serial composite system of the |H ocrt(f)| and the EC sound radiation effect. Here, the trends of the relative magnitude difference between |H ocrt(f)| and |H ocec(f)| shown in Figure 10 were similar to those of the open EC sound pressure due to BC stimulation derived from Stenfelt, Wild, et al. (2003). If the difference between |H ocrt(f)| and |H ocec(f)| represents a modifier from the soft tissue/skull bone vibration to the EC sound radiation, the assumption of the serial composite system may account for the similarity stated above. However, this is only a simple interpretation and much uncertainty remains. For example, the relative velocity between the RT vibration and the middle ear ossicle vibration could cause the sound radiation in the EC via the tympanic membrane (Stenfelt et al., 2002). If the effect of this transmission is strong, the assumption of the serial composite system is not entirely valid. To further investigate this point, subjective evaluations of the voice quality of the modified AC speech spectrum using the |H ocrt(f)| and |H ocec(f)| may be useful.
In this study, the amplitude characteristics of the transfer function related to BC speech were measured. However, in real life, speakers perceive both their own AC and BC speech at the same time. Hence, this measurement does not entirely represent the speakers' real-life situation enough. For example, there could be phase interactions between AC and BC speech perceived by speakers themselves. In this article, phase characteristics of the transfer functions were not investigated because of several compensations and manipulation for the amplitude characteristics. However, considering the effect of phase interaction, not only amplitude characteristics but also phase characteristics of the transfer functions for both AC and BC speech transmission need to be investigated. In future studies, the amplitude/phase characteristics of the EC sound pressure induced by the OC sound pressure could be measured using the nonoccluding microphones. Here, the transfer function measured in this study would be useful to determine whether/how the perception of one's own voice is affected by the phase interaction between AC and BC speech.

Conclusions

This article measured the transmission characteristics of the observable BC speech pathways (the soft tissue/skull bone vibration and the sound radiation in the EC) focusing on the RT vibration and the EC sound pressure induced by the OC sound pressure, for exploring how the outer ear part contributes to BC speech transmission. Our findings derived from the measurements of five participants are as follows: (a) The transfer function of the RT vibration relative to the OC sound pressure had a low-pass effect above 1 kHz, and (b) the transfer function of the EC relative to the OC sound pressure had a 2- to 3-kHz bandpass effect. These findings, although derived from the excitation in the OC, correspond to several physiological trends from measurements using the excitation on skull bone. The findings also accounted for some perceptual properties related to BC speech perception. Since BC speech is known to have a perceptual low-pass effect, our findings suggest that the transmission to the outer ear may not be a dominant contributor to BC speech perception during vocalization. The obtained transmission characteristics can be used to simulate the outer, middle, and inner ear parts of BC speech transmission for further investigations.

Author Contributions

Teruki Toya: Conceptualization (Lead), Investigation (Lead), Methodology (Lead), Project administration (Lead), Software (Lead), Validation (Lead), Visualization (Lead), Writing – original draft (Lead). Peter Birkholz: Conceptualization (Supporting), Methodology (Supporting), Supervision (Supporting), Writing – review & editing (Supporting). Masashi Unoki: Conceptualization (Supporting), Methodology (Supporting), Supervision (Equal), Writing – review & editing (Supporting).

Acknowledgments

This work was supported by Japan Society for the Promotion of Science KAKENHI Grants 16H01669 (to Teruki Toya), 17J03679 (to Masashi Unoki), and 18H05004 (to Masashi Unoki). We are grateful to Naoto Nishide, associate dean of Houju Memorial Hospital, and all dental technicians in this hospital for helping us make the plaster replicas of the upper jaws for the measurements. We are also grateful to Tatsuya Hirahara of Toyama Prefectural University for providing us with the equipment and knowhow for measuring the frequency response of the bone-conduction microphone. We also thank Shumpei Taniguchi, a research fellow in Japan Advanced Institute of Science and Technology, for providing us with the equipment and knowhow for making the rubber enclosures used in the measurements.

References

Chen, S. H., Liu, H., Xu, Y., & Larson, C. R. (2007). Voice F 0 responses to pitch-shifted voice feedback during English speech. The Journal of the Acoustical Society of America, 121(2), 1157–1163.
Denes, P. B., & Pinson, E. N. (1993). The speech chain (2nd ed.). Freeman.
Fagelson, M. A., & Martin, F. N. (1998). The occlusion effect and ear canal sound pressure level. American Journal of Audiology, 7(2), 50–54.
Farina, A. (2000). Simultaneous measurement of impulse response and distortion with a swept-sine technique. Proceedings–Audio Engineering Society Convention, 108, 1–24. http://www.aes.org/e-lib/browse.cfm?elib=10211
Goldstein, D. P., & Hayes, C. S. (1965). The occlusion effect in bone conduction hearing. Journal of Speech and Hearing Research, 8(2), 137–148.
Håkansson, B., Brandt, A., Carlsson, P., & Tjellström, A. (1994). Resonance frequencies of the human skull in vivo. The Journal of the Acoustical Society of America, 95, 1474–1481.
Håkansson, B., Carlsson, P., Brandt, A., & Stenfelt, S. (1996). Linearity of sound transmission through the human skull in vivo. The Journal of the Acoustical Society of America, 99(4), 2239–2243.
Kent, R. D., & Read, C. (1992). The acoustic analysis of speech. Singular.
Lee, B. S. (1950). Effects of delayed speech feedback. The Journal of the Acoustical Society of America, 22(6), 824–826.
Madaule, P. (2001). Listening and singing. Journal of Singing, 57(5), 15–20. https://www.listeningcentre.com/articles/listening-and-singing
Mehrgardt, S., & Mellert, V. (1977). Transformation characteristics of the external human ear. The Journal of the Acoustical Society of America, 61(6), 1567–1576.
Møller, A. R. (1962). Acoustic reflex in man. The Journal of the Acoustical Society of America, 34(9), 1524–1534.
Nakayama, I. (1997). Voice timbre in autophonic production compared with that in extraphonic production. Journal of the Acoustical Society of Japan, 18(2), 67–71.
Okazaki, S., Mori, K., & Cai, C. (2010). Effect of delayed auditory feedback on the vocal time-reproduction. Acoustical Science and Technology, 31(6), 408–410.
Pörschmann, C. (2000). Influences of bone conduction and air conduction on the sound of one's own voice. Acta Acustica United With Acustica, 86(6), 1038–1045.
Rahman, M. S., & Shimamura, T. (2019). Amplitude variation of bone-conducted speech compared with air-conducted speech. Acoustical Science and Technology, 40(5), 293–301.
Reinfeldt, S., Östli, P., Håkansson, B., & Stenfelt, S. (2010). Hearing one's own voice during phoneme vocalization—Transmission by air and bone conduction. The Journal of the Acoustical Society of America, 128(2), 751–762.
Shimizu, S., Otani, M., & Hirahara, T. (2009). Frequency characteristics of several non-audible murmur (NAM) microphones. Acoustical Science and Technology, 30(2), 139–142.
Stenfelt, S. (2011). Acoustic and physiological aspects of bone conduction hearing. Advances in Oto-Rhino-Laryngology, 71, 10–21.
Stenfelt, S. (2015). Inner ear contribution to bone conduction hearing in the human. Hearing Research, 329, 41–51.
Stenfelt, S., & Goode, R. L. (2005). Bone-conducted sound: Physiological and clinical aspects. Otology and Neurotology, 26(6), 1245–1261.
Stenfelt, S., Håkansson, B., & Tjellström, A. (2000). Vibration characteristics of bone conducted sound in vitro. The Journal of the Acoustical Society of America, 107(1), 422–431.
Stenfelt, S., Hato, N., & Goode, R. L. (2002). Factors contributing to bone conduction: The middle ear. The Journal of the Acoustical Society of America, 111(2), 947–959.
Stenfelt, S., Hato, N., & Goode, R. L. (2003). Fluid volume displacement at the oval and round windows with air and bone conduction stimulation. The Journal of the Acoustical Society of America, 115(2), 797–812.
Stenfelt, S., & Reinfeldt, S. (2007). A model of the occlusion effect with bone-conducted stimulation. International Journal of Audiology, 46(10), 595–608.
Stenfelt, S., Wild, T., Hato, N., & Goode, R. L. (2003). Factors contributing to bone conduction: The outer ear. The Journal of the Acoustical Society of America, 113(2), 902–913.
Sundberg, J. (1987). The science of the singing voice. Northern Illinois University Press.
Tonndorf, J. (1976). Bone conduction. In W. D. Keidel-William & D. Neff (Eds.), Handbook of sensory physiology (pp. 5/3–5/37, 84). Springer.
Toya, T., Birkholz, P., & Unoki, M. (2019). Estimates of transmission characteristics related to perception of bone-conducted speech using real utterances and transcutaneous vibration on larynx. In A. A. Salar, A. Karpov, & R. Potapova (Eds.), Speech and computer. SPECOM 2019. Lecture notes in computer science (Vol. 11658, pp. 491–500). Springer Nature Switzerland.
Toya, T., Ishikawa, D., Miyauchi, R., Nishimoto, K., & Unoki, M. (2016). Study on effects of speech production during delayed auditory feedback for air-conducted and bone-conducted speech. Journal of Signal Processing, 20(4), 197–200.
Tran, P. K., Letowski, T., & McBride, M. (2013). The effect of bone conduction microphone placement on intensity and spectrum of transmitted speech items. The Journal of the Acoustical Society of America, 133(6), 3900–3908.
von Békésy, G. (1949). The structure of the middle ear and the hearing of one's own voice by bone conduction. The Journal of the Acoustical Society of America, 21(3), 217–232.
Watanabe, M., Mita, K., Akataki, K., & Itoh, Y. (2001). Mechanical behavior of condenser microphone in mechanomyography. Medical and Biological Engineering and Computing, 39(2), 195–201.

Information & Authors

Information

Published In

Journal of Speech, Language, and Hearing Research
Volume 63Number 1214 December 2020
Pages: 4252-4264
PubMed: 33170762

History

  • Received: Feb 28, 2020
  • Revised: Jun 13, 2020
  • Accepted: Aug 20, 2020
  • Published online: Nov 10, 2020
  • Published in issue: Dec 14, 2020

Authors

Affiliations

Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Ishikawa
Author Contributions: Conceptualization, Investigation, Methodology, Project administration, Software, Validation, Visualization, and Writing – original draft.
Peter Birkholz
Institute of Acoustics and Speech Communications,Technisch Universität Dresden, Germany
Author Contributions: Conceptualization, Methodology, Supervision, and Writing - review & editing.
Masashi Unoki
Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Ishikawa
Author Contributions: Conceptualization, Methodology, Supervision, and Writing - review & editing.

Notes

Disclosure: The authors have declared that no competing interests existed at the time of publication.
Correspondence to Teruki Toya: [email protected]
Editor-in-Chief: Frederick (Erick) Gallun
Editor: Daniel Rasetshwane

Metrics & Citations

Metrics

Article Metrics
View all metrics



Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

For more information or tips please see 'Downloading to a citation manager' in the Help menu.

Citing Literature

  • Change in transfer function between air and bone conduction microphones due to mouth opening variation, Applied Acoustics, 10.1016/j.apacoust.2024.110293, 228, (110293), (2025).
  • Soft Speech, Loud World: Bone Conduction Microphones Enhance Voice Assistant Interaction, 2024 IEEE International Conference on Consumer Electronics (ICCE), 10.1109/ICCE59016.2024.10444140, (1-5), (2024).
  • Regional Language Speech Recognition from Bone Conducted Speech Signals Through CCWT Algorithm, Circuits, Systems, and Signal Processing, 10.1007/s00034-024-02733-y, 43, 10, (6553-6570), (2024).
  • Bone-conducted Speech Enhancement Using Vector-quantized Variational Autoencoder and Gammachirp Filterbank Cepstral Coefficients, 2022 30th European Signal Processing Conference (EUSIPCO), 10.23919/EUSIPCO55093.2022.9909731, (21-25), (2022).
  • Bone‐Conducted Speech Synthesis Based on Least Squares Method , IEEJ Transactions on Electrical and Electronic Engineering, 10.1002/tee.23531, 17, 3, (425-435), (2022).

View Options

View options

PDF

View PDF
Sign In Options

ASHA member? If so, log in with your ASHA website credentials for full access.

Member Login

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share