The analysis of periodic curves is best suited when working with continuous, stable curves. However, the sounds of speech are nothing if not unstable, which makes analysis more difficult. Praat overcomes this by assuming that speech is sufficiently stable when looking at small enough fragments of it, which are called ‘windows of analysis’.

This is the sound we’ll be working on: a complex sound wave with a fundamental frequency of 140Hz and a harmonic of 280Hz.

Each window is filtered to make sure there are no intensity peaks on the edges, which facilitates analysis. The filter used in this demonstration is a Hanning window (Boersma, 1993). According to the Praat documentation, a Hanning window is more responsive when working with 3 periods per analysis window, while a Gaussian window is better when working with a larger analysis window (the Gaussian window is twice as large as the Hanning window).

Praat indeed uses both these windows as default depending on the task and the degree of precision that is required.

We apply the filter by multiplying both curves.

To detect a sound’s pitch Praat uses autocorrelation, comparing each window with itself.

An autocorrelation plot shows the degree to which the compared curves are related on the Y-axis, and the time lag for each comparison on the X-axis. If the curve is periodic, then there should be a peak on the autocorrelation curve when the lag is equal to the original curve’s period.

The autocorrelation is highest at a time lag of 0, so we need to look for peaks that are greater than 0 for significant periodicity. However, in this case, since we are working with a complex sound wave with a loud harmonic, the autocorrelation curve shows a false peak (red line) before the time lag that we know is the sound’s actual fundamental frequency (blue line), which is alligned with a lower peak.

In order to correct for this, we need to divide the filtered signal by the normalized autocorrelation curve of the windowing function.

The result is an estimate of the autocorrelation of the original signal, which is both robust and better suited for the analysis of speech signals than previous methods used. Note that this estimate gets increasingly unreliable after roughly half the length of the analysis window (Boersma, 1993).

And by finding the maximum at a time lag > 0 in this estimated curve, we can calculate the pitch of the original signal converting from samples to Hz.

$f_0 = \frac{1}{lag_{max} / f_s}$