Sunday, 30 November 2014

Framing and reconstructing speech signals

This post will deal with framing and overlap-add resynthesis. This can also be known as AMS (Analysis-Modification-Synthesis) when doing things like speech enhancement. First of all, what is the point of framing? An audio signal is constantly changing, so we assume that on short time scales the audio signal doesn't change much (when we say it doesn't change, we mean statistically i.e. statistically stationary, obviously the samples are constantly changing on even short time scales). This is why we frame the signal into 20-40ms frames. If the frame is much shorter we don't have enough samples to get a reliable spectral estimate, if it is longer the signal changes too much throughout the frame, and the FFT will end up smearing the contents of the frame.

What is involved: frame the speech signal into short, overlapping frames. Typically frames are taken to be about 20ms long. For a 16kHz sampled audio file, this corresponds to 0.020s * 16,000 samples/s = 400 samples in length. We then use an overlap of 50%, or about 200 samples. This means the first frame starts at sample 0, the second starts at sample 200, the third at 400 etc.

MATLAB code for framing: frame_sig.m and unframing:deframe_sig.m.

Framing the signal is pretty simple, the only thing to note is that the signal is padded with zeros so that it makes an integer number of frames. A window function is also applied. The overlap-add process has a few things that make it tricky, as well as adding up the overlapped signal we also add up the window correction which is basically what our signal would be if every frame was just the window. This is important since the windowed frames won't necessarily add up to get the original signal back. You can see this by plotting the window_correction variable in deframe_sig.m and thinking about how it gets like that. We also have to add eps (this is just a very small constant i.e. epsilon) to the window correction just in case it is ever zero, this prevents infs appearing in our reconstructed signal.

To see how the AMS framework can be used for spectral subtraction, have a look at this spectral subtraction tutorial. The framing and deframing routines on this page can be used to implement the enhancement routines there. Some example code for the tutorial above would look something like this:

Processing lists of files in MATLAB

It is often necessary to apply a certain operation to many files all at once e.g. adding noise to a database, creating feature files from a list of audio files, enhancing files, etc. This post will explain the method I use for processing lists of files, because I tend to use it quite often.

The main points are to use the unix find command to generate the file list (of course this is not necessary, any method for generating file lists is fine), then use matlab's fgetl to read each file name and process it. The code for adding noise to a list of NIST files is below, you can find the read_NIST_file script here, and the addnoise script can be found here.

Note that we check for end of file by checking ischar(wavname), once the end of file is reached, wavname (a line from the file list) will no longer contain character data.

Saturday, 29 November 2014

Adding Noise of a certain SNR to audio files

A common task when dealing with audio is to add noise to files, e.g. if you want to test the performance of a speech recognition system in the presence of noise. This is based on computing the Signal to Noise Ratio (SNR) of the speech vs. noise. To compute the energy in a speech file, just add up the sum of squares of all the samples:

\[ E_{Speech} = \sum_{i=0}^N s(i)^2 \]

where \(s(i)\) is the vector of speech samples you read with a function like wavread. We will also need some noise, which we can generate using a function like randn(N,1) where N is the length of the speech signal. Alternatively we can use a dedicated noise file containing e.g. babble noise and just truncate it at the correct length. When using a noise file, it is important to randomise the start position for each file so you don't always have e.g. a door banging or a guy laughing at the same point in every file. This can mess with classifiers. Anyway, now compute the energy of the noise:

\[ E_{Noise} = \sum_{i=0}^N n(i)^2 \]

where \(n(i)\) is the noise vector. To compute the SNR of a speech file compared to noise:

\[ SNR = 10\log_{10} \left( \dfrac{E_{Speech}}{E_{Noise}} \right) \]

If you don't have the pure noise, you just have a corrupted version of the original, you compute the noise as: \(n(i) = x(i) - s(i)\), where \(x(i)\) is the corrupted signal.

Now we want to scale the noise by a certain amount and add it to the original speech signal so that the SNR is correct. This assumes we have a target SNR, for the sake of this post assume we want the noise to be at 20dB SNR. We now use the following formula (This formula assumes the noise signal has unit variance, you may need to normalise it before using this formula):

\[ K = \sqrt{ \dfrac{E_{Speech}}{10^{20\text{dB}/10}} } \]

Once we have done this we need to create \(\hat{n}(i) = K\times n(i)\) for our noise samples. Our noisy speech file is calculated as:

\[ x(i) = s(i) + \hat{n}(i) \]

for \(i = 1 \ldots N\). You should be able to compute the SNR between the new noisy signal and the original signal and it should come out to be very close to 20dB (it could be 19.9 or 20.1 or something). Function for computing snr in matlab: snr.m, function for adding white noise to files: addnoise.m.

Thursday, 27 November 2014

Reading and writing NIST, RAW and WAV files in MATLAB

To open files (NIST or WAV) when you are not sure which it could be, use audioread.m, which depends on the read_X_file.m explained below.

NIST files

NIST files are very common when doing speech processing, for example the TIMIT and RM1 speech databases are in NIST format. The NIST format consists of 1024 bytes at the start of the file consisting of ASCII text, after this header the speech data follows. For TIMIT and RM1 the speech data is 16-bit signed integers which can be read directly from the file and treated as a signal.

To get the sampling frequency, you'll have to parse the text that forms the header. In any case, the functions to read and write NIST files is supplied here: read_NIST_file.m and write_NIST_file.m.

Note that writing a NIST file using the scripts above requires a header. The easiest way to get a header is to read another NIST file. So if you want to modify a NIST file then you would use something like:

[signal, fs, header] = read_NIST_file('/path/to/file.wav');

This reuses the header from previously and works fine. If you want to create completely new files i.e. there is no header to copy, I recommend not creating NIST formatted files, create wav files instead as they are far better supported by everything.

An example header from the timit file timit/test/dr3/mrtk0/sx283.wav. Note the magic numbers NIST_1A as the first 7 bytes of the file. The actual audio data starts at byte 1024, the rest of the space between the end of the header text and byte 1024 is just newline characters.

database_id -s5 TIMIT
database_version -s3 1.0
utterance_id -s10 rtk0_sx283
channel_count -i 1
sample_count -i 50791
sample_rate -i 16000
sample_min -i -2780
sample_max -i 4675
sample_n_bytes -i 2
sample_byte_format -s2 01
sample_sig_bits -i 16

RAW files

RAW files are just pure audio data without a header. This can make it a little difficult to figure out what is actually in them, often you may just have to try things until you get meaningful data coming out. Common settings would be 16-bit signed integer samples. You'll also have to take care of endianness, if the files were written on a windows machine all the integers will be byte-swapped and you'll have to swap them back.

read_RAW_file.m and write_RAW_file.m.

WAV files

wav files are supported natively by matlab, so you can just use matlabs wavread and wavwrite functions.

Generating Pretty Spectrograms in MATLAB

Spectrograms are a time-frequency representation of speech (or any other) signals. It can be difficult to make them pretty, as there are a lot of settings that change various properties. This post will supply some code for generating spectrograms in MATLAB, along with an explanation of all the settings that affect the final spectrogram.

The file itself can be found here: myspectrogram.m.

An example of what it can do:

We will now look at some of the default settings. To call it all you need to do is: myspectrogram(signal,fs);. This assumes signal is a sequence of floats that represent the time domain sequence of an audio file, and fs is the sampling frequency. fs is used to display the time (in seconds) at the bottom of the spectrogram.

There are a lot more settings, the full call with everything included is:

[handle] = myspectrogram(s, fs, nfft, T, win, Slim, alpha, cmap, cbar);

s     - speech signal
fs    - sampling frequency
nfft  - fft analysis length, default 1024
T     - vector of frame width and frame shift (ms), i.e. [Tw, Ts], default [18,1] in ms
w     - analysis window handle, default @hamming
Slim  - vector of spectrogram limits (dB), i.e. [Smin Smax], default [-45, -2]
alpha - fir pre-emphasis filter coefficients, default false (no preemphasis)
cmap  - color map, default 'default'
cbar  - color bar (boolean), default false (no colorbar)

nfft is the number of points used in the FFT, larger values of nfft will have more detail, but there will be diminishing returns. 1024 is a good value.

The frame lengths and shifts can be important, shorter window lengths give better time resolution but poorer frequency resolution. Long window lengths give better frequency resolution but poorer time resolution. By playing with these numbers you can get a better idea of how they work.

The window function is specified as an inline function. @hamming is the MATLAB hamming function. If you want blackman you would use @blackman. For a parameterised window like the Chebyshev, you can use @(x)chebwin(x,30) using 30dB as the chebyshev parameter. For a rectangular window you can use @(x)ones(x,1).

The vector of spectrogram limits clips the spectrogram at these points. First the highest point of the log spectrogram is set to 0, the everything outside the limits is set to the limits. This can make spectrograms look much cleaner if there is noise below e.g. -50dB, which is not audible but makes the spectrogram look messy. You can remove it with the limits.