\documentclass[10pt]{article}
\usepackage{amsfonts}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{graphicx}
\usepackage{ulem}
\usepackage{framed}
\usepackage{microtype}
\usepackage{booktabs}
\usepackage{subfigure}
\usepackage{hyperref}
\usepackage{tabularx}
\usepackage[capitalise,noabbrev]{cleveref}
\usepackage[usenames,dvipsnames]{xcolor}
\newcommand{\theHalgorithm}{\arabic{algorithm}}
\usepackage{algorithm}
\usepackage[noend]{algpseudocode}
\usepackage[letterpaper, margin=1in]{geometry}
\usepackage{todonotes}
\sloppy
\renewcommand{\familydefault}{ppl}
\newcommand{\bx}{{\boldsymbol x}}
\newcommand{\bX}{\mathbf {X}}
\newcommand{\bz}{{\boldsymbol z}}
\newcommand{\bh}{{\boldsymbol h}}
\newcommand{\bw}{{\boldsymbol w}}
\newcommand{\bK}{\mathbf{K}}
\newcommand{\bmu}{{\boldsymbol \mu}}
\newcommand{\balpha}{{\boldsymbol \alpha}}
\newcommand{\bbeta}{{\boldsymbol \beta}}
\newcommand{\bSigma}{{\boldsymbol \Sigma}}
\newcommand{\btheta}{{\boldsymbol \theta}}
\newcommand{\reals}{{\mathbb R}}
\DeclareMathOperator*{\E}{{\mathbb E}}
\DeclareMathOperator*{\normal}{{\cal N}}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator*{\AND}{AND}
\DeclareMathOperator*{\OR}{OR}
\DeclareMathOperator*{\XOR}{XOR}
\DeclareMathOperator*{\sign}{sign}
\newcommand{\solution}[1]{{\color{PineGreen} \textbf{Solution:} #1}}
\begin{document}
\section*{Machine Learning Fall 2015 Homework 5}
Make sure to explain you reasoning or show your derivations. Except for answers that are especially straightforward, you will lose points for unjustified answers, even if they are correct.
\section*{General Instructions}
Submit your homework electronically on Canvas. We recommend using LaTeX, especially for the written problems. But you are welcome to use anything as long as it is neat and readable.
Include a README file that describes all other included files. Indicate which files you modified. You are welcome to create additional functions, files, or scripts, and you are also welcome to modify the included interfaces for existing functions if you prefer a different organization.
Since we will work on some of the homework in class, clearly indicate which parts of your submitted homework are work done in class and which are your own work.
Relatedly, cite all outside sources of information and ideas. \textbf{List any students you discussed the homework with.}
\section*{Written Problems}
\begin{enumerate}
\item Consider a hidden Markov model (HMM) with Gaussian emission probabilities. This type of HMM represents the joint distribution of a sequence of real valued observations $Y = \{y_1, \ldots, y_T\}$, where each observation is a real-valued vector $y_t \in \reals^d$, and a sequence of hidden, latent states $X = \{x_1, \ldots, x_T\}$, where each latent state is a multinomial value $x_t \in \{1, \ldots, K\}$. Using a temporal interpretation of the sequence, the HMM's power comes from the assumption that the past is independent of the future given the present.
\begin{enumerate}
\item (2 point) Write the factorized probability distribution for the HMM in terms of (1) the prior state probability $p(x_t)$, (2) the transition probabilities $p(x_t | x_{t-1})$, and (3) the observation probabilities $p(y_t | x_t)$.
\item (2 points) Using the factorized form, write the formula for computing the joint probability of the first observation and the first latent state $p(x_1, y_1)$. In other words, show how marginalizing out all the variables except $x_1$ and $y_1$ leads to a simple formula.
\item (3 points) Letting $\alpha_1(x_i) := p(x_1, y_1)$, and more generally $\alpha_t(x_t) := p(x_t, y_1, \ldots, y_t)$, derive the formula for $\alpha_2$ (i.e., $p(x_2, y_1, y_2)$) in terms of $\alpha_1$. Note that this time you will need to marginalize out $x_1$.
\item (2 points) Show the generalization of the formula you derived in the previous subproblem: the formula for $p(x_t, y_1, \ldots, y_t)$ or $\alpha_t(x_t)$ in terms of $\alpha_{t-1}$. You should get the standard forward update of the famous forward-backward algorithm.
\item (3 points) Returning to the full joint probability, write the formula for $p(y_T | x_{T-1})$ and simplify. We will call this quantity $\beta_{T-1}(x_{T-1})$.
\item (4 points) Write the formula for $\beta_{T-2}(x_{T-2})$, which will need to sum over all values of $x_{T-1}$ (i.e., $p(y_{T-1}, y_{T} | x_{T-2} )$), and then write the general formula for $\beta_{t-1}(x_{t-1})$ (i.e., $p(y_t, \ldots, y_T | x_{t-1})$). You should get the standard backward update of the famous forward-backward algorithm, which uses the values of $\beta_{t}(x_t)$. Show how this formula comes from marginalizing out the irrelevant variables.
\item (3 points) Using $\alpha$ and $\beta$, write the expression for $p(x_t, y_1, \ldots, y_T)$, or the joint probability of any one state $x_t$ and all the observations $y_1, \ldots, y_T$. Then show the formula for the conditional probability $p(x_t | Y)$.
\item (6 points) When running forward-backward, in practice, one must normalize the $\alpha$ and $\beta$ messages to avoid numerical underflow. For forward messages $\alpha$, normalization amounts to computing $p(x_t | y_1, \ldots, y_t)$ instead of the joint probability $p(x_t, y_1, \ldots, y_t)$. There is not as natural an interpretation for the normalized $\beta$ message $\tilde{\beta}_t(x_t) := \beta_t(x_t) / \sum_{x_t'} \beta_t(x_t')$. Show that normalizing the $\alpha$ and $\beta$ messages to $\tilde{\alpha}$ and $\tilde{\beta}$ then normalizing the final product still yields the correct conditional probability $p(x_t | Y)$. Hint: use the conditional definition of $\tilde{\alpha}$, and write the definition of $\tilde{\beta}$ in terms of the unnormalized $\beta$. Use a placeholder $Z_t$ for the normalizing constant of each $\beta_t$ and show that terms cancel in the last stage of the normalization.
\end{enumerate}
\item (5 points) Project status report. Write a 1-2 page summary of your progress so far on the project. Describe what you have accomplished so far on your project, whether any project goals have evolved since the proposal, and what is planned for the rest of the project period.
\end{enumerate}
\section*{Programming Assignment}
For this programming assignment, we have provided a lot of starter code. Your tasks will be to complete the code in a few specific places, which will require you to read and understand most of the provided code, but will only require you to write a small amount of code yourself.
\begin{enumerate}
\item (10 points) Complete \texttt{markovChainTrain.m}, which takes a sequence of data as input and learns the maximum likelihood transitions. Given data $\{x_1, \ldots, x_T\}$ with each variable one of $K$ states, the model should train a prior
\begin{align}
p(x = i) \leftarrow \frac{1}{T + K \alpha} \left( \alpha + \sum_{t = 1}^T I(x_t = i) \right), \forall i
\end{align}
and a transition probability
\begin{align}
p(x_t = i | x_{t-1} = j) \leftarrow \frac{(\textrm{\# time steps where state $j$ transitions to state $i$}) + \alpha}{(\textrm{\# of times process is in state $j$ in the first $T-1$ steps}) + K \alpha}
\end{align}
Test this function using \texttt{runObama.m}, which loads a speech transcript by President Obama and trains a Markov chain on his word usage. It should generate somewhat convincing text that sometimes reads like real English. Try running the same procedure on other text documents!
\item (12 points) Complete \texttt{myHmmInferStates.m}, which performs the forward-backward algorithm to infer the latent state probabilities of a hidden Markov model. Refer to the textbook and the class videos (and what you derived in the written section) for the formulas you'll need to implement. You are responsible for three segments of this algorithm implementation: the forward message-passing, the backward-message passing, and the fusion of the forward and backward messages to compute the marginal probabilities.
Once you get this function working, part of the \texttt{runSyntheticExperiments.m} script will run. The first test this script runs is to evaluate the log likelihood of the synthetic data using the true models they were generated from. If your code is working, you should see that the models that generated the corresponding data sets will have the highest log likelihood (i.e., model 1 should have the highest for data 1, model 2 for data 2, and model 3 for data 3). The rest of the script will not work yet, until you finish the next step.
\item (6 points) Complete \texttt{myHmmTrain.m}. You are only responsible for a small snippet of this function: the M-step update for the Gaussian parameters. It will be very helpful to understand the rest of the provided implementation to write this snippet.
Once you complete this step, you will be able to run the full synthetic experiment. The experiment trains HMMs on the three data types, then evaluates the log likelihood of each data set (train and test) under each learned HMM. If everything is working correctly, you should see significantly higher likelihoods for the correct HMMs. Finally, the script plots the Gaussians of each state from the trained HMMs under the Gaussians of each state from the true HMM that generated the data. The trained Gaussians should look similar to the true Gaussian if your code is working correctly.
\item (0 points) Your code should run on the real data, which is a time series of a four daily weather measurements (low and high temperature, overall precipitation, and snowfall) taken from the Blacksburg NOAA weather station over approximately 30 years. Run the script \texttt{runWeatherExperiment.m} (which should take a few minutes to complete, even if you vectorize your implementations). The script will train HMMs using different numbers of hidden states and evaluate them on held-out data. The script is not set up to run with very high numbers of hidden states, even though the initial experiments indicate that more states will yield a better fit model. Instead, once it sweeps through different numbers of hidden states, it visualizes the trained parameters of the HMM with four hidden states as well as the inferred state probabilities for the test period. The learned states have an intuitive interpretation.
\item (2 points) Prepare a short writeup describing what files you modified, any special instructions for us to understand your submission.
\end{enumerate}
\begin{table}[h]
\caption{Included files and brief descriptions.}
\label{tab:files}
\begin{center}
\renewcommand{\arraystretch}{1.5}
\begin{tabularx}{\textwidth}{lX}
\toprule
\textbf{File Path} & \textbf{Description}\\
\midrule
\textbf{synthObservations.mat} & Synthetic observations from HMMs\\
\textbf{synthStates.mat} & States of sampled HMM sequence used to generate observations\\
\textbf{trueModels.mat} & HMM parameters used to generate synthetic data\\
\textbf{src/gaussianLL.m} & Script that computes the log likelihood of points under a Gaussian distribution\\
\textbf{src/loadWordSequence.m} & Function that loads a text file into a sequence of indices from a dictionary of words\\
\textbf{src/logsumexp.m} & Function that computes log(sum(exp(X))) in numerical stable manner\\
\textbf{src/makeSyntheticData.m} & Script that generates the synthetic data. You do not need to run this script since all the relevant output is saved in mat files\\
\textbf{src/markovChainTrain.m} & Function \textbf{you will complete} to train a Markov model (without hidden variables) on an observed sequence\\
\textbf{src/myHmmInferStates.m} & Function \textbf{you will complete} that performs inference via the forward-backward algorithm on an HMM\\
\textbf{src/myHmmTrain.m} & Function \textbf{you will complete} to train a hidden Markov model on an observed sequence of observations\\
\textbf{src/plotGMM.m} & Function that plots ellipses representing the Gaussians in a mixture model.\\
\textbf{src/runSyntheticExperiment.m} & Script that runs the main HMM experiment.\\
\textbf{src/runObama.m} & Script that runs the Markov chain experiment .\\
\textbf{src/runWeatherExperiment.m} & Script that runs the real-data HMM experiment, training HMMs on Blacksburg weather data.\\
\textbf{src/sampleGaussian.m} & Script that samples data from a multivariate Gaussian distribution. You won't need this, but it's used to generate the synthetic data.\\
\textbf{src/sampleMultinomial.m} & Script that samples data from a multinomial distribution. You won't need this, but it's used to generate the synthetic data.\\
\bottomrule
\end{tabularx}
\end{center}
\end{table}
\end{document}