The Wiener filter is the optimal filter to use if you want to remove Gaussian noise from a Gaussian signal. Most useful applications though involve multiple (possibly correlated) signals with additive noise, so a multivariate Wiener filter is required. In this post I'll derive the multivariate Wiener filter as a Maximum A-Posteriori (MAP) estimator of the clean signal given the noisy signal. Because Gaussians are symmetric, the MAP estimator will be identical to the MMSE estimator, with the advantage that the MAP estimator is much easier to compute.
So our assumption is we have a noisy signal \(y(n)\) which is the result of adding gaussian noise \(d(n)\) to our clean signal \(x(n)\). Remember each of these are vectors, so \(y(n)\), \(d(n)\) and \(x(n)\) are vectors with \(k\) dimensions.
\[ y(n) = x(n) + d(n)\]What we want to do is find the most likely value of \(x(n)\) given \(y(n)\). We never get to know \(d(n)\), but we do have an estimate of the covariance of \(d(n)\) which we will call \(\Sigma_d\). We assumed initially that \(x(n)\) and \(d(n)\) were distributed according to the multivariate Gaussian distribution, but because the pdf takes up so much space, I'll be using the following notation:
\[ a \sim \mathcal{N}(\boldsymbol\mu_a,\boldsymbol\Sigma_a) = (2\pi)^{-\frac{k}{2}}|\boldsymbol\Sigma_a|^{-\frac{1}{2}}\, e^{ -\frac{1}{2}(\mathbf{a}-\boldsymbol\mu_a)'\boldsymbol\Sigma_a^{-1}(\mathbf{a}-\boldsymbol\mu_a) } \]This means \(a\) is distributed according to a multivariate Gaussian with mean \(\boldsymbol\mu_a\) and covariance \(\boldsymbol\Sigma_a\). So we have \(d(n) \sim \mathcal{N}(0,\boldsymbol\Sigma_d)\) and \(x(n) \sim \mathcal{N}(\boldsymbol\mu_x,\boldsymbol\Sigma_x)\).
From our intial assumtion that \(y(n) = x(n) + d(n)\), we can also determine the pdf of \(y(n)\)
This will give us the pdf of the clean signal given the noisy, then we just have to find the maximum of the distribution to find our estimate of \(x\). Note that \(p(y)\) in the above equation does not effect the maximisation (because it does not depend on \(x\)), so it will be left out. From here we write out everything we have, using the values for \(p(x)\) and \(p(y|x)\) we worked out above, hopefully fitting everything on the page:
\[ p(x | y) = p(y|x)p(x)\] \[=(2\pi)^{-\frac{k}{2}}|\boldsymbol\Sigma_d|^{-\frac{1}{2}}\, e^{ -\frac{1}{2}(y - x)'\boldsymbol\Sigma_d^{-1}(y-x)} \times (2\pi)^{-\frac{k}{2}}|\boldsymbol\Sigma_x|^{-\frac{1}{2}}\, e^{ -\frac{1}{2}(x-\boldsymbol\mu_x)'\boldsymbol\Sigma_x^{-1}(x-\boldsymbol\mu_x)}\]We want to maximise the equation above, unfortunately it is rather complicated. To get around this we take the log of the whole thing, since maximising a function and maximising its log will give the same point. The result is much easier to differentiate:
\[ \log(p(x | y)) = \log(C) - \frac{1}{2}(y - x)'\boldsymbol\Sigma_d^{-1}(y-x)- \frac{1}{2}(x-\boldsymbol\mu_x)'\boldsymbol\Sigma_x^{-1}(x-\boldsymbol\mu_x) \]where \(C\) consists of all the bits that don't depend on \(x\).
\[C = (2\pi)^{-\frac{k}{2}}|\boldsymbol\Sigma_d|^{-\frac{1}{2}} (2\pi)^{-\frac{k}{2}}|\boldsymbol\Sigma_x|^{-\frac{1}{2}}\]Now, to maximise \(\log(p(x|y))\), we differentiate it. This involves some vector calculus, the rule we use can be seen in The Matrix Cookbook (page 11, eqn 85 and 86).
\[ \dfrac{d\log(p(x | y))}{dx} = -\boldsymbol\Sigma_d^{-1}(y-x) + \boldsymbol\Sigma_x^{-1}(x-\boldsymbol\mu_x) \]To find the maximum we set the left hand side to zero, then solve for \(x\):
\[ 0 = -\boldsymbol\Sigma_d^{-1}(y-x) + \boldsymbol\Sigma_x^{-1}(x-\boldsymbol\mu_x) \]With a tiny bit of algebra, we can manipulate the above equation to isolate \(x\),
\[\hat{x'} = \boldsymbol\Sigma_x(\boldsymbol\Sigma_x + \boldsymbol\Sigma_d)^{-1}(y - \boldsymbol\mu_x) + \boldsymbol\mu_x \]and that is it. To get an estimate for the clean signal, we just use the equation above which takes the noisy signal. This is optimal (i.e. the best that is theoretically possible) if \(x\) and \(d\) are both Gaussian. It can still be O.K. if they are not, but you can easily repeat the above steps for different distributions. If the distributions aren't symmetric, then the MMSE estimator will usually perform better than the MAP estimator.