Acoustic Analysis of Dialog Speech
Fumitada ITAKURA and Shoji KAJITA
School of Engineering, Nagoya University
Furo-cho 1, Chikusa-ku, Nagoya 464-01, JAPAN
e-mail: ita@itakura.nuee.nagoya-u.ac.jp
In dialog speech recognition systems, the speech analysis part is an
essential front end using the acoustic signal processing. Since the
acoustic features lost there cannot be easily recovered in later
stages, what features and how to extract them from acoustic signals is
one of the most important problems in speech recognition. Therefore,
more intensive research on acoustic analysis is required to achieve a
robust speech recognition in a realistic acoustic environment.
In order to tackle the problem, we have proposed a technique based
on subband processing and autocorrelation analysis, namely,
subband-autocorrelation (SBCOR) analysis. This SBCOR analysis has been
developed so as to extract periodicities associated with the delay
time equal to inverses of the center frequencies. The SBCOR has been
shown to be robust under the multiplicative signal-dependent white
noise that has constant SNRs at any points.
In this paper, we investigate to what extent the SBCOR analysis is
robust against severe waveform distortion and noises.
First, it is shown that SBCOR is robust against severe waveform
distortions such as zero-crossing. Although the zero-crossing
distortion deteriorates the performance of conventional recognition
systems, such distorted signals are still intelligible for humans.
The analysis examples of SBCOR and smoothed group delay spectrum (SGDS)
show that the SBCOR spectrum is stable for such distortion, while the
zero-crossing distortion influences significantly the formant
structure extracted by SGDS.
In the recognition experiments, a standard DTW speaker-dependent
isolated word recognizer is used. The recognition task is a 68 pair
discrimination. Each pair is a phonetically similar city name pair,
selected from a 550 Japanese city name database recorded twice by 5
Japanese male speakers. The first set is used as the reference pattern
and the second set, which was spoken a week later, is used as the test
pattern. The test signals are distorted by zero-crossing. The
experimental results using a DTW word recognition show that the
SBCOR (Q=1.0) performs about 19% higher than SGDS, when the test
signals are distorted by zero-crossing. These results indicates that
the speech features are much more robust against the zero-crossing
distortion.
Second, it is shown that SBCOR is more robust against multiplicative
signal-dependent white noise, Gaussian white noise, and a human
speech-like noise than SGDS. The human speech-like noise were
generated by superposing independent speech waveforms of 3200 phrases
spoken by 30 males and 34 females in the Continuous Speech Corpus for
Research edited by the ASJ. The experimental results based on the DTW
word recognition, which is the same as the above one, show the SBCOR
spectrum performs equally as well as the SGDS under clean conditions,
and better than it under noisy conditions, for all noises. Besides,
the best Q for the white noises is 1.5, while the best one for the
human speech-like noise is 2.0. The reason seems to be that the noise
effects due to the low frequencies can be attenuated by narrowing the
band width. The effectiveness of the SBCOR is larger when the noise is
white than when the noise is the human speech noise.
Finally, we evaluate the robustness at phonemic level. The task is
23 phoneme speaker-dependent recognition for the
/a,i,u,e,o,b,d,g,m,n,N,p,t,k,s,h,r,y,w,z,ts,ch,sh/ using HMMs. Each
HMM is left-to-right and seven mixture HMM. The parameter estimation
was performed using the 2620 even-numbered words in the ATR Japanese
5240 speech database (two male and two female speakers). The speech
data for tests were collected from the odd-numbered 2620. The sampling
frequency is 10 kHz. To examine the robustness against noise, the
multiplicative signal-dependent white noise is added to the database
for tests. The experimental results show the best Q becomes low
gradually as the SNR falls. When it is taken into account that the
best Q for low SNR is not the best for high SNR and vice versa, the
best Q is 1.5. Moreover, although the performance of the SBCOR (Q=1.5)
is slightly worse than that of SGDS under clean conditions, the SBCOR
performs much better than the SGDS under SNR 20(+6%) and 10dB(+15%).
In this paper, we showed that the SBCOR is robust against severe
waveform distortion such as zero-crossing and three types of noise
using a DTW recognizer. This results indicate that the SBCOR extracts
the speech features that are not captured sufficiently by conventional
speech analyses. As for the robustness at phonemic level, we could
verify it as long as the noise is the multiplicative signal-dependent
white noise. For the other noises, we should investigate further.
Keywords: SBCOR analysis, waveform distortion, noise, DTW, HMM