Anisotropic Gaussians
Contents
Anisotropic Gaussians¶
Today, we’ll take a look at anisotropic Gaussians: normal distributions where its isocontours are not perfect spheres (ellipsoidal). Another way to put this: the standard deviation is longer in certain directions. In this lecture, we’ll look at the consequences of this for LDA and QDA. Different directions, different variances.
Recall the multivariate normal PDF is:
where \(q(x) = (x-\mu)^T\Sigma^{-1}(x-\mu)\) is the quadratic part of \(f\) mapping a d-dimensional vector to a scalar value, and \(n(q)\) is the outer part that is an exponential map that maps a scalar to another scalar, in this way:
This is called the conditional covariance for points in class C. For each sample point in C, we sum an outer product matrix: specifically, subtract the sample mean and take the outer product of that vector with itself. We sum this over all the points in class \(C\), then divide by the number of points in class \(C\).
For QDA, we calculate \(\hat{\Sigma}_C\) for each class C, as well as priors and means as usual. Once we have these parameters, we have the Gaussian PDF that is best fit to our sample points \(X_i\). Then, once we have the distributions, doing QDA/LDA is straightforward.
Note that \(\hat{\Sigma}_C\) is a sample covariance matrix: it is always positive semidefinite, but not necessarily positive definite (can have some eigenvalues of 0, which give it 0-length directions). If it’s not positive definite, it doesn’t actually define a true normal distribution. This will result in linear isocontours instead of ellipsoidal!
What about LDA? We know that \(\Sigma\) is constant for all classes. The sample covariance matrix in LDA, then, is:
So what we’re basically doing is taking a weighted average of the covariance matrices for each class, weighted by the number of points in that class. This is called the pooled within-class covariance matrix. The “pooled” comes from the fact that we’re summing over all classes. The usage of within-class means \(\mu_C\) tend to make covariance smaller. Once we figure out this matrix, we just use it for each class.
QDA¶
So now that we’ve used MLE to get our best-fit Gaussians, now comes prediction.
The big idea, of course, is to choose class \(C\) that maximizes the posterior, or maximizes \(f(X=x|Y=C)\pi_C\) : the product of the class-conditional density and prior. In QDA, this is equivalent to maximizing the quadratic discriminant function \(Q_C(x)\):
So we just compute for each class and pick the one with the highest discriminant function.
In the two class case, it’s even simpler: our decision function is \(Q_C(x) - Q_D(x)\) which is quadratic. However, this may be indefinite because we are subtracting 2 PSD matrices. The Bayes decision boundary will always be a solution to a multivariate quadratic. We can find the posterior probability \(P(Y|X) = s(Q_C(x) - Q_D(x))\).
Let’s take a look at this graphically. Below is a graph of the two PDFs that we fit to our classes, \(f_C(x)\) and \(f_D(x)\) (x,y axes are features, z axis is PDF value).
This actually has a hyperbolic decision boundary- it is possible since quadrics in 2D are conic sections. This means that the set of points in predicting a class isn’t connected- it’s in 2 disjoint regions!
Now, if we plot the decision function \(Q_C(x) - Q_D(x)\), we see it is indeed a hyperplane:
The decision boundary is much clearer here: we don’t need the Gaussian PDFs to find it here. Also, since it’s quadratic, the boundary can be computed MUCH more quickly.
If we want the posterior probability, i.e. the probability our prediction is correct, we just pass our decision function through the sigmoid function as \(s(Q_C(x) - Q_D(x))\).