[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2. Speech analysis techniques

This section provides a brief scientific overview of the speech signal analysis techniques involved in SPro with a particular focus on variable resolution spectral analysis. It also defines the equations and methods implemented in SPro.

2.1 Pre-emphasis and windowing  Short term windows and pre-emphasis
2.2 Variable resolution spectral analysis  
2.3 Filter-bank analysis  Filter-bank speech analysis
2.4 Linear predictive analysis  Linear prediction speech analysis
2.5 Cepstral analysis  
2.6 Deltas and normalization  Delta, acceleration and feature normalization

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.1 Pre-emphasis and windowing

Speech is intrinsically a highly non-stationary signal. Therefore, speech analysis, whether FFT-based or LPC-based, must be carried out on short segments across which the speech signal is assumed to be stationary. Typically, the feature extraction is performed on 20 to 30 ms windows with 10 to 15 ms shift between two consecutive windows. This principle is illustrated in the figure below

To avoid problems due to the truncation of the signal, a weighting window with the appropriate spectral properties must be applied to the analyzed chunk of signal. SPro implements three such windows
HAMMING w_i = 0.54 - 0.46 \cos(i \pi^2 / N)
HANNING w_i = (1 - \cos(i \pi^2 / N)) / 2
BLACKMAN w_i = 0.42 - 0.5 \cos(i \pi^2 / N) + 0.08 cos(2 i \pi^2 / N)
where N is the number of samples in the window and i \in [0,N-1].

Pre-emphasis is also traditionally use to compensate for the -6dB/octave spectral slope of the speech signal. This step consists in filtering the signal with a first-order high-pass filter H(z) = 1 - k z^{-1}, with k \in [0,1[. The pre-emphasis filter is applied on the input signal before windowing.

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.2 Variable resolution spectral analysis

Classical spectral analysis has a constant resolution over the frequency axis. The idea of variable resolution spectral analysis(1) is to vary the spectral resolution as a function of the frequency. This is achieved by applying a bilinear transformation of the frequency axis, the transformation being controlled by a single parameter a. The bilinear warping of the frequency axis is defined by

f' = arctan |(1 - a^2) sin f / ((1 + a^2) cos f - 2a) | ,
where f and f' are the frequencies on the original and transformed axis respectively and a \in ]-1,1[. The axis transformation is depicted in the following figure
Spectral analysis is done with a constant resolution on the warped axis f' and therefore with a variable resolution on the original axis. Clearly, positive values of a leads to a higher low frequency resolution while negative values give a better high frequency resolution. If a equals one, the transformation is the identity thus resulting in a classical constant resolution spectral analysis.

Using variable resolution spectral analysis with a filter-bank is rather trivial since it simply consists in determining the filter's central frequency according to the warping. See section 2.3 Filter-bank analysis.

Linear predictive models with variable resolution spectral analysis is also possible. Very briefly, the idea consists in solving the normal equations on the generalized auto-correlation rather than on the traditional auto-correlation sequence. The generalized auto-correlation r(p) is the correlation between the original signal filtered by a corrective filter mu(z) = (1 - a^2) / (1 - a z^{-1)^2} and the latter filtered p times by a correction filter of response

H(z) = ((1 / z) - a) / (1 - a / z)
See section 2.4 Linear predictive analysis, for more details.

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.3 Filter-bank analysis

Filter-bank is a classical spectral analysis technique which consists in representing the signal spectrum by the log-energies at the output of a filter-bank, where the filters are overlapping band-pass filters spread along the frequency axis. This representation gives a rough approximation of the signal spectral shape while smoothing out the harmonic structure if any. When using variable resolution analysis, the central frequencies of the filters are determined so as to be evenly spread on the warped axis and all filters share the same bandwidth on the warped axis. This is also applied to MEL frequency warping, a very popular warping in speech analysis which mimics the spectral resolution of the human ear. The MEL warping is approximated by mel(f) = 2595 \log_{10(1 + f / 700)}.

SPro provides an implementation of filter-bank analysis with triangular filters on the FFT module as depicted below

SPro provides an implementation of filter-bank analysis with triangular filters on the FFT module. The energy at the output of channel i is given by
e_i = \log \sum_{j=1^{N} h_i(j) ||X(j)||}
where N is the FFT length(2) and h_i is the filter's frequency response as depicted above. The filter's response is a triangle centered at frequency f_i with bandwidth [f_{i-1,f_{i+1}]}, assuming the f_i's are the central frequencies of the filters determined according to the desired spectral warping.

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.4 Linear predictive analysis

Linear prediction is a popular speech coding analysis method which relies on a source/filter model if the speech production process. The vocal tract is modeled by an all-pole filter of order p whose response is given by

H(z) = 1 / (1 + \sum_{i=1^{p} a_i z^{-i})} .
The coefficients a_i are the prediction coefficients, obtained by minimizing the mean square prediction error. The minimization is implemented in SPro using the auto-correlation method.

The idea of the resolution algorithm is to iteratively estimate the prediction coefficients for each prediction order until the required order is reached. Assuming the prediction coefficients for order n-1 are known and yields a prediction error e_{n-1}, the estimation of the coefficients for order n rely on the n'th reflection coefficients defined as

k_n = - (1 / e_{n-1)\sum_{i=0}^{n-1} a_{n-1}(i) r(n-i)} ,
where r is the autocorrelation of the signal. Given the reflection coefficient k_n, the prediction coefficients are obtained using the recursion
a_n(i) = a_{n-1(i) + k_n a_{n-1}(n-i)}
for i=1,\ldots,n-1 and a_n(n) = k_n. Finally, the prediction error for order n is given by
e_n = e_{n-1 ( 1 - k_n^2 )} .

For variable resolution, the generalized auto-correlation sequence is used instead of the traditional auto-correlation. See section 2.2 Variable resolution spectral analysis. for details on generalized auto-correlation.

The all-pole filter coefficients can be represented in several equivalent ways. First, the linear prediction coefficients a_i can be used directly. The reflection (or partial correlation) coefficients k_i \in ]-1,1[ used in the resolution algorithm can also be used to represent the filter. The log-area ratio, defined as

g_i = 10 \log_{10 ((1 + k_i) / (1 - k_i))} ,
is also a popular way to define the prediction filter. Last, the line spectrum frequencies (a.k.a. line spectrum pairs) are also frequently used in speech coding. Line spectrum frequencies is another representation derived from linear predictive analysis which is very popular in speech coding.

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.5 Cepstral analysis

Probably the most popular features for speech recognition, the cepstral coefficients can be derived both from the filter-bank and linear predictive analyses. From the theoretical point of view, the cepstrum is defined as the inverse Fourier transform of the logarithm of the Fourier transform module. Therefore, by keeping only the first few cepstral coefficients and setting the remaining coefficients to zero, it is possible to smooth the harmonic structure of the spectrum(3). Cepstral coefficients are therefore very convenient coefficients to represent the speech spectral envelope.

In practice, cepstral coefficients can be obtained from the filter-bank energies e_i via a discrete cosine transform (DCT) given by

c_i = \sqrt{2/N sum_{j=1}^{N} e_j \; \cos(\pi i (j-0.5)
/ N)} , where N is the number of channels in the filter-bank and i \in [1,M] (M <= N). Cepstral coefficients can also be obtained from the linear prediction coefficients a_i according to
c_i = -a_i + (1 / i) \sum_{j=1^{i-1} (i - j) * a_j * c_{i-j}} ,
for i \in [1,M] with M <= P, the prediction order.

Cepstral coefficients have rather different dynamics, the higher coefficients showing the smallest variances. It may sometimes be desirable to have a constant dynamic across coefficients for modeling purposes. One way to reduce these differences is liftering which consists in applying a weight to each coefficients. The weight for the i'th coefficient is defined in a parametric way according to

h_i = 1 + L \sin(i\pi/L) / 2 ,
where L is the lifter parameter, typically equals to 2M.

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.6 Deltas and normalization

Feature normalization can be used to reduce the mismatch between signals recorded in different conditions. In SPro, normalization consists in mean removal and eventually variance normalization. Cepstral mean subtraction (CMS) is probably the most popular compensation technique for convolutive distortions. In addition, variance normalization consists in normalizing the feature variance to one and is a rather popular technique in speaker recognition to deal with noises and channel mismatch. Normalization can be global or local. In the first case, the mean and standard deviation are computed globally while in the second case, they are computed on a window centered around the current time.

To account for the dynamic nature of speech, it is possible to append the first and second order derivatives of the chosen features to the original feature vector. In SPro, the first order derivative of a feature $y_i$ is approximated using a second order limited development given by

y_i'(t) = (y_i(t+1) - y_i(t-1) +2 (y_i(t+2) - y_i(t-2))) / 10 .
Second order differences, known as accelerations, are obtained by derivating the first order differences. It is therefore not possible to have the acceleration without the delta features.

[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Guillaume Gravier on March, 5 2004 using texi2html