MMSE (Minimum Mean Square Error) estimators are commonly used in speech enhancement, with many different estimators applied to different speech representations. In its most basic form, an MMSE estimator is used to find the most likely clean signal given a noisy signal. I will start by explaining a time domain MMSE estimator (which won't be very useful), and then we can move on to better ones.

Some notation: \(x_n\) is our clean signal (\(n\) is the sample index), \(y_n\) is our noisy signal, and \(d_n\) will be our noise. We'll assume our noise is additive i.e. \[ y_n = x_n + d_n \]

We will now start applying some assumptions: we assume both \(x_n\) and \(d_n\) are gaussian variables, with different variances. As an assumption, this is not a very good one. Time domain speech is not gaussian, and most real noise is also not gaussian. One of the things this assumption has going for it though is that it results in a very simple estimator.

So I should now explain what an estimator actually is. Given a probability distribution function for a random variable, we want to pick the most likely value the random variable can take. There are two main ways of doing this: MMSE estimators, and MAP estimators. I'll explain MAP estimators in another post, they tend to be simpler to compute but they don't usually perform as well as MMSE estimators.

MMSE estimators are derived as follows: given a pdf \(p(x_n|y_n)\) (this is the probability of the clean signal given the noisy signal) we wish to find \(\hat{x}_n\) (this is called "x hat"),which is our estimate of the clean signal. To do this we find the expected value of the pdf \(p(x_n|y_n)\). We use the following formula for expectation:

\[\hat{x}_n = \int_{-\infty}^{\infty}x_n p(x_n|y_n)dx_n\]

Lets look at the pdfs we know (remember we assumed both \(x\) and \(d\) are gaussian distributed):

\[p(x_n) = \dfrac{1}{\sqrt{2\pi\sigma_x^2}} \exp \left( \dfrac{-x_n^2}{2\sigma_x^2} \right) \]
\[p(d_n) = \dfrac{1}{\sqrt{2\pi\sigma_d^2}} \exp \left( \dfrac{-d_n^2}{2\sigma_d^2} \right) \]

Using the 2 pdfs above, we can derive the pdf for \(p(y_n|x_n)\) (the pdf of the noisy signal given the clean signal). Since we assumed \(y_n = x_n + d_n\), if we are given the value \(x_n\), we know \(y_n\) must be close to it with a variance of \(\sigma_d^2\) i.e. the noise variance. This results in:

\[p(y_n|x_n) = \dfrac{1}{\sqrt{2\pi\sigma_d^2}} \exp \left(\dfrac{ -(y_n-x_n)^2 }{ 2\sigma_d^2 } \right) \]

Despite how easy it is to calculate \(p(y_n|x_n)\), it is not the pdf we want. We can compute \(p(x_n|y_n)\) using Bayes rule:

\[p(x_n|y_n) = \dfrac{p(y_n|x_n)p(x_n)}{\int_{-\infty}^\infty p(y_n|x)p(x) dx}\]

The value we actually want is:

\[\hat{x}_n = \int x_n p(x_n|y_n) dx_n = \int \dfrac{x_n p(y_n|x_n)p(x_n)}{\int_{-\infty}^\infty p(y_n|x)p(x) dx} dx_n\]

Now we substitute the expressions for the pdfs into our equations to get a bit of a monster:

\[\hat{x}_n = \int{ \dfrac{x_n \dfrac{1}{\sqrt{2\pi\sigma_d^2}} \exp \left(\dfrac{ -(y_n-x_n)^2 }{ 2\sigma_d^2 } \right) \dfrac{1}{\sqrt{2\pi\sigma_x^2}} \exp \left( \dfrac{-x_n^2}{2\sigma_x^2} \right)}{\int_{-\infty}^\infty \dfrac{1}{\sqrt{2\pi\sigma_d^2}} \exp \left(\dfrac{ -(y_n-x)^2 }{ 2\sigma_d^2 } \right) \dfrac{1}{\sqrt{2\pi\sigma_x^2}} \exp \left( \dfrac{-x^2}{2\sigma_x^2} \right) dx} dx_n}\]

If you want to do it by hand you can simplify the above expression a bit and use a book of integrals to solve it, but the easier way is just to use mathematica to do it for you (or this page which helps with integrals). The result is the following very elegant solution:

\[\hat{x}_n = \dfrac{\sigma_x^2}{\sigma_x^2 + \sigma_d^2}y_n \]

If \(x_n\) and \(d_n\) were actually gaussian random variables, then the estimator we derived above is optimal i.e. no better estimator of \(y_n\) exists. In reality, we would also need to know \(\sigma_d^2\) and \(\sigma_x^2\), and these can be very tricky to estimate. In the next post I will use the same method but different probability distributions to derive other estimators.

In the next post I'll do a slightly better MMSE estimator that doesn't use gaussians. By using more complicated pdfs we can improve our estimator, but we will also get much more complicated integrals. In some cases the integrals won't be analytically solvable at all, so we'll have to be happy approximating them numerically. If this is not possible because of e.g. timing constraints, we'll have to settle on a MAP estimator.