\documentclass[10pt]{article}
\usepackage{amsfonts}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{graphicx}
\usepackage{ulem}
\usepackage{framed}
\usepackage{microtype}
\usepackage{booktabs}
\usepackage{subfigure}
\usepackage{hyperref}
\usepackage{tabularx}
\usepackage[capitalise,noabbrev]{cleveref}
\usepackage[usenames,dvipsnames]{xcolor}
\newcommand{\theHalgorithm}{\arabic{algorithm}}
\usepackage{algorithm}
\usepackage[noend]{algpseudocode}
\usepackage[letterpaper, margin=1in]{geometry}
\sloppy
\renewcommand{\familydefault}{ppl}
\newcommand{\bx}{{\boldsymbol x}}
\newcommand{\bX}{\mathbf {X}}
\newcommand{\bz}{{\boldsymbol z}}
\newcommand{\bh}{{\boldsymbol h}}
\newcommand{\bw}{{\boldsymbol w}}
\newcommand{\bK}{\mathbf{K}}
\newcommand{\bmu}{{\boldsymbol \mu}}
\newcommand{\balpha}{{\boldsymbol \alpha}}
\newcommand{\bbeta}{{\boldsymbol \beta}}
\newcommand{\bSigma}{{\boldsymbol \Sigma}}
\newcommand{\btheta}{{\boldsymbol \theta}}
\newcommand{\reals}{{\mathbb R}}
\DeclareMathOperator*{\E}{{\mathbb E}}
\DeclareMathOperator*{\normal}{{\cal N}}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator*{\AND}{AND}
\DeclareMathOperator*{\OR}{OR}
\DeclareMathOperator*{\XOR}{XOR}
\DeclareMathOperator*{\sign}{sign}
\newcommand{\solution}[1]{{\color{PineGreen} \textbf{Solution:} #1}}
\newcommand{\todo}[1]{\textbf{\textcolor{MidnightBlue}{to do: #1}} }
\begin{document}
\section*{Machine Learning Fall 2015 Homework 4}
Make sure to explain you reasoning or show your derivations. Except for answers that are especially straightforward, you will lose points for unjustified answers, even if they are correct.
\section*{General Instructions}
Submit your homework electronically on Canvas. We recommend using LaTeX, especially for the written problems. But you are welcome to use anything as long as it is neat and readable.
Include a README file that describes all other included files. Indicate which files you modified. You are welcome to create additional functions, files, or scripts, and you are also welcome to modify the included interfaces for existing functions if you prefer a different organization.
Since we will work on some of the homework in class, clearly indicate which parts of your submitted homework are work done in class and which are your own work.
Relatedly, cite all outside sources of information and ideas. \textbf{List any students you discussed the homework with.}
\section*{Written Problems}
\begin{enumerate}
\item We will derive the expectation-maximization (EM) algorithm using \textit{variational} analysis. Let $X = \{\bx_1, \bx_2, \ldots, \bx_n\}$ be a set of data vectors. Let $Z = \{z_1, \ldots, z_n\}$ be set of multinomial variables corresponding to which Gaussian data is generated from. Let the Gaussian parameters be means $\{\bmu_1, \ldots, \bmu_K\}$ and covariance matrices $\{\bSigma_1, \ldots, \bSigma_K\}$. Let $\Theta = \{\theta_1, \ldots, \theta_K\}$ be multinomial prior probabilities over which Gaussian data is sampled from.
Each data point is generated by first sampling a Gaussian index from $p(z | \Theta)$, then sampling from the Gaussian $\normal(\bmu_z, \bSigma_z)$. The log likelihood of any observations given these mixture model parameters is
\begin{align}
L(X, \bmu, \bSigma, \Theta) &= \sum_{i=1}^n \log \left(\sum_{k = 1}^K \theta_k \normal(x_i | \bmu_k, \Sigma_k) \right)
\label{eq:likelihood}
\end{align}
Since we will never observe any $z$ variables, these are considered \textit{hidden} or \textit{latent} variables. When computing the likelihood of observed variables, we sum over all possible states of the latent variables, weighted by the probability of those states.
\begin{enumerate}
\item (2 points) We start by doing something a little weird. We create an independent distribution $q$ for the latent variables such that
\begin{align}
q(z_1, \ldots, z_n) &:= q(z_1) q(z_2) \ldots q(z_n).
\end{align}
Rewrite the log likelihood from \cref{eq:likelihood} so that each data point's likelihood is multiplied by the $q$ distribution divided by itself. In other words, multiply the terms in the innermost summation by $\frac{q(z_i = k)}{q(z_i = k)}$.
\item (6 points) Jensen's inequality guarantees that for any convex function $\varphi$ and any distribution over random variable $X$
\begin{align}
\varphi\left( \E \left[ X \right] \right) \le \E \left[ \varphi(X) \right]
\label{eq:jensen}
\end{align}
Use Jensen's inequality and the fact that $\log$ is a \textbf{concave} function (i.e., $-\log$ is a convex function) to form a lower bound on the log likelihood. You should get an expression that looks like a sum of ``cross-entropies'' and entropies.
\item (6 points) You now have a lower bound objective function that depends on the Gaussian mixture model parameters $\bmu$, $\bSigma$, and $\Theta$ and the variational distribution $q$. If we hold $q$ fixed, maximizing the Gaussian mixture model parameters gets us the Gaussian EM updates (the so-called ``m-step''). You will prove this for one of the mixture parameters (you are welcome to try the other parameters too for fun). Show that maximizing the lower bound with respect to the cluster mixture probabilities $\Theta$ is exactly the EM update for these variables. You will need to use a Lagrange multiplier to enforce that $1 = \sum_{k=1}^K \theta_k$.
\item (6 points) Show how to find the $q$ parameters for each data point that maximize the lower bound. Hints: You'll need to use Lagrange multipliers to enforce that $1 = \sum_k q(z_i = k)$, and you should be able to consider each data point's $q$ distribution independently.
\end{enumerate}
\item (10 points) Project proposal. See instructions on project homepage.
\end{enumerate}
\section*{Programming Assignment}
For this programming assignment, we have provided a lot of starter code. Your tasks will be to complete the code in a few specific places, which will require you to read and understand most of the provided code, but will only require you to write a small amount of code yourself.
The script \texttt{makeSyntheticData.m} generates synthetic data using the following procedure:
\begin{itemize}
\item First, it generates a 2D random Gaussian mixture model with five Gaussians.
\item It then generates $n$ (2000) 2D data points from the mixture model.
\item It projects the 2D data into 64-dimensional data using a random projection matrix.
\item Finally, it adds random noise in 64-dimensional space to the projected data.
\end{itemize}
The resulting data comes from a 2D mixture model, but it is presented using 64-dimensions. You will use PCA and expectation maximization to attempt to fit a mixture model to the data.
The main experiment script is \texttt{runSyntheticExperiment.m}. It first uses PCA to transform the data into the two most descriptive dimensions. Then it runs expectation maximization using different numbers of Gaussians and records the log-likelihood of held-out validation data for each cluster count parameter.
Since MATLAB includes the functionality to do PCA and Gaussian mixture fitting, you will not be allowed to submit code that uses these built-in functions. You are however allowed to used the built-in numerical linear algebra functions \texttt{det}, \texttt{eig}, \texttt{svd}, \texttt{eigs}, \texttt{svds}, or \texttt{chol} (you don't need all of these, but you may want to use one or two).
\begin{enumerate}
\item (10 points) Complete \texttt{myPca.m}. This function should take a data matrix as input and return a transformed data matrix and the variance of the transformed data. The dimensions of the transformed data should be ordered such that the earlier indices have the highest variance. Using this ordering, one should be able to truncate to a lower-dimensional space while preserving the highest reconstruction accuracy by using the first few dimensions of the returned data.
\item (12 points) Complete \texttt{gmm.m}. This function should take a data matrix and a number of clusters as input and fit the Gaussians using expectation maximization. The main framework for the algorithm is set up for you already. You need to complete the code for updating the parameters (the Gaussian means and covariances and the cluster priors) and the latent variable probabilities (the cluster assignments).
One important note for fitting Gaussians is that sometimes we can get degenerate data that creates numerically unstable covariances. To avoid this problem, one fix is to add a small constant to the diagonal of the estimated covariance matrix. The matrix \texttt{reg} in \texttt{gmm.m} is set up for you to do this. You can argue that this trick is regularization, or if you're more honest, it's just a reasonable hack to make things work.
\item (8 points) Complete \texttt{gmmLL.m}. This function should take a data matrix and Gaussian mixture model parameters as input and output a log likelihood. The formula for the log likelihood is
\begin{equation}
\begin{aligned}
\mathrm{ll}(\bX, \bmu, \bSigma, \btheta) &= \log \prod_{i=1}^n p(\bx_i | \bmu, \bSigma, \btheta)\\
&= \log \prod_{i=1}^n \sum_{k = 1}^K \theta_k \normal(x_i | \bmu_k, \Sigma_k)\\
&= \sum_{i=1}^n \log \left(\sum_{k = 1}^K \theta_k \normal(x_i | \bmu_k, \Sigma_k) \right)\\
&= \sum_{i=1}^n \log \left(\sum_{k = 1}^K \frac{\theta_k}{\sqrt{(2\pi)^k|\Sigma_k|}} \exp\left(- \frac{1}{2} (\bx_i - \bmu_k)^\top \Sigma_k^{-1} (\bx_i - \bmu_k)\right) \right).
\end{aligned}
\end{equation}
With some manipulation, you should be able to use the log-sum-exp trick to compute the terms in the summation while avoiding numerical underflow.
\end{enumerate}
\begin{table}[h]
\caption{Included files and brief descriptions. Unless indicated here in bold, the source files are just placeholders for code you should complete. You should not need to implement or modify bolded files.}
\label{tab:files}
\begin{center}
\renewcommand{\arraystretch}{1.5}
\begin{tabularx}{\textwidth}{lX}
\toprule
\textbf{File Path} & \textbf{Description}\\
\midrule
\textbf{synthData.mat} & Synthetic data of 64-dimensional points\\
\textbf{trueData.mat} & Synthetic 2D data that was used to generate the 64 dimensional points. Included for reference if you're interested in comparing.\\
\textbf{src/gaussianLL.m} & Script that computes the log likelihood of points under a Gaussian distribution\\
\textbf{src/gmm.m} & Function \textbf{you will complete} that fits a Gaussian mixture model to data using expectation maximization\\
\textbf{src/gmmLL.m} & Function \textbf{you will complete} that computes the log likelihood of data under a Gaussian mixture distribution\\
\textbf{src/makeSyntheticData.m} & Script that generates the synthetic data. You do not need to run this script since \texttt{synthData.mat} saves the relevant output.\\
\textbf{src/myPca.m} & Function \textbf{you will complete} that finds an orthogonal basis for the input data such that early dimensions capture the most variance in the data.\\
\textbf{src/plotGMM.m} & Function that plots ellipses representing the Gaussians in a mixture model.\\
\textbf{src/runSyntheticExperiment.m} & Script that runs the main experiment. It runs PCA and EM using different numbers of clusters and reports the log likelihoods.\\
\textbf{src/sampleGaussian.m} & Script that samples data from a multivariate Gaussian distribution. You won't need this, but it's used to generate the synthetic data.\\
\textbf{src/sampleMultinomial.m} & Script that samples data from a multinomial distribution. You won't need this, but it's used to generate the synthetic data.\\
\bottomrule
\end{tabularx}
\end{center}
\end{table}
\end{document}