Machine Learning Fall 2017 Notes

This high-level outline includes the major takeaway points from the topics we covered so far this semester.

Types of Machine Learning

Supervised learning
- Learner gets examples of inputs and outputs
- naive Bayes, decision trees, logistic regression, SVM, neural nets, batch perceptron
Unsupervised learning
- Learner doesn’t get examples of output
- Use other insights to extract knowledge from data.
Online learning
- Learner gets one example at a time, must make decision on that example, then gets to update model once the label is revealed
- Learner is evaluated while it is learning
- E.g., perceptron
Parametric model: finite dimensional parameter describing model
- Can still have very high complexity (e.g., large neural networks or rich kernels)
Non-parametric models: arbitrary complexity models, can have infinite complexity to fit infinite data
- Still usually have parameters to control complexity (otherwise non-parametric models would never work)

General Themes

Modularity shows up a lot
- Loss functions for optimization-based methods
- Squashing functions for neural nets
- Kernels for SVM
- Features are exchangeable
Optimization approach to learning:
- formulate learning as optimizing an objective function
- solve with general-purpose optimizer

Model Selection

Usually measure performance with loss function
Tradeoff between bias and variance
- bias: error caused by using the wrong model class
- variance: error caused by randomness in sample
If we use a really complex model, we can do really well on training data
- low bias, high variance
- our excellent training-time performance is not a good estimate of how we’ll do at test time
If we use a really simple model, we might do poorly on training data
- high bias, low variance
- good estimate of how we’ll do at test time, but we know we won’t do well
Regularization to control model complexity
Cross-validation to estimate true risk, find best tradeoff

Naive Bayes

Independence assumption: observations independent given label
Can substitute various probability functions (like multinomial, Bernoulli, or Gaussian distributions)
A simple Bayesian network

Decision Trees

Training: greedily split input space (and data set) by criterion
- information gain
- misclassification rate
- Gini index (expected misclassification rate)
Prediction: recurse down subtree according to decision rule
Pruning:
- faster to train a full tree, then prune away subtrees
- one strategy is to prune to obtain the best held-out validation performance

Logistic Regression

Linear classifier
Logistic function (sigmoid) to squash to zero-one range
Maximum likelihood training via convex programming
Stochastic methods work too
- Repeat: randomly sample mini batch of examples, and take gradient step for those examples
Multiclass version exponentiates linear score and normalizes

Perceptron

Another linear classifier
Online update
- Binary class:
  - add input features to w if wrong and label is positive
  - subtract input features to w if wrong and label is negative
  - do nothing if correct
- Multi class:
  - If wrong: add input features to correct class weight vector
  - If wrong: subtract input features to incorrect class weight vector
Mistake bound: guaranteed maximum number of updates, scales with margin and input bound

Neural Networks

Each node takes input activations, combines them and squashes
We studied feed-forward structures
- layers of “units”, fully-connected between layers
Train by gradient methods on loss function of output
- e.g., log likelihood
Back propagation: chain rule of gradients
- save activation values during forward propagation
- use saved activation values to compute gradients
- propagate error gradient backwards through weights
Modular: choose any differentiable squashing function
Get non-linearity, but lose convexity
- local solutions are usually good

Optimization

Convex objective functions are nice
- (sub)gradient descent
- line search, conjugate gradient, second-order methods
“Efficient” non-convex methods:
- local optimization
- bound minimization
Methods for handling constraints:
- Lagrange or KKT multipliers: adversarial terms to make constraint satisfaction a game
  - Technical jargon: Lagrange multipliers are for equality constraints, KKT for inequality constraints
- Penalty/barrier functions

Duality

Generally: a dual is an alternate optimization that tells you something about the original, primal, problem
Usual scenario #1: solution to dual allows you to solve for the solution of the primal
Usual scenario #2: solution to dual allows you to solve for an approximate solution of the primal
Lagrangian dual for constrained optimization:
- take constraints from original problem, write Lagrangian (or KKT) form
- solve objective for primal variables
- remaining optimization over KKT multipliers is a dual problem

Support Vector Machines

Large margin linear classifier
Solution found via quadratic optimization
Each labeled example is a constraint: linear classifier score must correctly classify by at least 1.0 (or –1.0)
Minimizing norm of weight vector is equivalent to maximizing the margin
Slack: change constraints to have nonnegative slack variables
- penalize slack variable
- Equivalently, express objective with hinge-loss on misclassification
Learning theoretic analysis shows large margin reduces model complexity (VC dimension)
Dual SVM is also a quadratic program
- solve directly for KKT multipliers on misclassification constraints
- Form of dual QP allows kernelization, since all input data appears in inner products
- dual variables will be sparse. Nonzero values for the support vectors, which are the inputs that affect the solution
SMO: do closed-form optimization of dual objective on two alphas at a time
Stochastic: take gradient step of SVM objective for small mini-batch

Kernels

A way to take a linear classifier and make it non-linear
replace all inner products with kernel functions
- kernel function: inner product in some vector space
- ideally efficiently computable without explicitly doing inner products, especially if high dimensional
Example kernels
- polynomial: take input features and group them into different polynomial terms
  - can be efficiently computed implicitly by taking linear kernel (K + 1).^p
- radial basis function (Gaussian): kernel value is Gaussian density of point x with y as mean
  - implies an infinite dimensional vector space (via a Taylor expansion of the kernel value)
  - easier to think about as the kernel function than the infinite dimensional vector space

Learning Theory

Formalize analysis, theorems given mathematical assumptions
Ex. 1: mistake bounds for online learning
- perceptron will make a finite number of mistakes if data is separable by a margin and bounded
Ex. 2: generalization analysis, risk bounds
- Bound with high probability the true risk
- Bound includes the empirical risk and some measure of model complexity
- Related to bias-variance
- Vapnik-Chervonenkis dimension:
  - how many points can the hypothesis class shatter
  - label in all possible binary configurations
- Smaller complexity -> smaller gap between empirical risk and true risk

Principal Components Analysis

Realign data vectors along new basis
New dimensions ordered by amount of variance
(Usually) truncate to first few dimensions, removing noise from directions of less variance
Computed by eigendecomposition of covariance matrix
- Or equivalently, singular-value decomposition of data matrix

EM for Gaussian Mixture Models

Mixture model: randomly sample from set of sub-distributions, then sample from chosen distribution
Gaussian mixture model: each sub-distribution is a Gaussian
Often used for clustering
Latent variables: which Gaussian each data point came from
- never observed, so we keep track of it as a random variable from a multinomial probability
Latent variables make log-likelihood non-convex
Expectation maximization:
- E-step: given current Gaussian parameters and prior, estimate the probability of each latent variable
- M-step: given current latent variable probabilities, set Gaussian parameters to maximize the expected log likelihood under those distributions.

Variational EM and K-Means

EM can be interpreted as maximizing a lower bound on the true log likelihood
First scale each setting in the expectation summation by a variational distribution q divided by itself (so 1.0)
Bound comes from applying Jensen’s inequality when we move the log from outside to inside the expectation
Limiting q distribution to independent multinomials for each example gets the standard EM algorithm
Limiting q to point distributions (and making the Gaussians spherical) gets k-means
K-means:
- E-step: assign all points to their nearest mean
- M-step: recompute mean of points assigned to each cluster
Also a bound on log-likelihood, but has discrete point distributions, so optimization isn’t as smooth

Bayesian Networks

Describe conditional independence relationships with a directed acyclic graph
Nodes are variables
Edges between variables
The factorized probability distribution includes the conditional probability of child given its parents
Explaining away
- observing a child makes the parents dependent
- but if child is unobserved, parents are independent
Markov blanket: variables that need to be observed to make a variable independent of all other variables
- Children, parents
- Parents of children
Variable elimination algorithm for exact inference
- dynamic programming method to avoid redundant computation
- choose variable to eliminate
- sum out all settings of variable over relevant factors
- compute new factor
  - function of any other variables in summed out factors
- rewrite probability using new factor

Hidden Markov Models

Special structured Bayes nets
Used for modeling sequences, often time series
Hidden states form a Markov chain
- past is independent of future given present
Observed values depend on hidden state
Forward-backward inference
- compute forward messages: probabilities of state and previous evidence
- compute backward messages: probability of future evidence given state
- combine and normalize to get marginal probabilities

Markov Random Fields

Undirected graph to describe independence structure
Probability factorized into potential functions for each clique of the graph
Can convert Bayes net to MRF, but lose independence of parents if child is unobserved
- connect all mutual parents
Markov blanket in MRFs: all neighbors in graph
Collect and distribute belief propagation in trees is exact
- Choose node to be root
- compute messages from leaves, then leaf-parents, etc. until root has all messages from entire tree
- compute outgoing messages from root, then from root-children, etc. until leaf has messages from root
- Fuse messages into beliefs
Loopy belief propagation is only an approximation in general, but it works well in practice

Structured Prediction

A structured predictor is a function that outputs a complex object. Typically the object is a set of dependent variable states, such as a graph structure, a tree, a path, etc.
Structured output learning is the task of learning the parameters for a structured predictor.
Aside from probabilistic graphical models, other frameworks for structured prediction typically involve learning parameters for an optimization, such as learning the weights for a minimum-spanning-tree problem.
Structured perceptron tries to enforce that the score of the ground-truth structures are better than the scores for any other possible output.

Reinforcement Learning

Learner receives (1) state, (2) action, (3) result, (4) reward.
(Depending on setting, the learner may also get to choose the action.)
Key challenge: delayed rewards require learner to figure out what states and actions led to the reward.
Most algorithms represent problems as Markov decision processes.
- Pr(s' | s, a): probability of transitioning to next state given the current state and action.
- Allows for randomness as a result of an action.
A policy is a function that reads in a state and returns an action.
We can evaluate a policy by considering the expected reward if the learner uses the policy.
The expected discounted reward scales future rewards by an exponentially decaying weight, allowing us to reason about the expected reward of a learner running for infinite steps into the future.
The expected reward of the optimal policy has a recursuve form, known as the Bellman equation.
Q-learning attempts to learn a function approximating the expected discounted future reward given the current state and an action (the q-function).
Can be updated by taking a gradient step in the loss, which is the difference between the empirical estimate of the q-function and its current value.
The q-function can be approximated by any other function, so modern reinforcement learning is typically done by training a neural network to mimic the q-function.

Deep Learning

Aside from larger datasets and faster computation (e.g., on GPUs), various techniques have been developed to overcome vanishing gradients to allow training of much deeper models than in previous years.
- New activation functions like rectified linear units (ReLUs) partially prevent gradients from vanishing by not squashing one side of the function. Many other variations have been proposed and experimented with.
- Stochastic optimization methods, especially adaptive ones that adjust learning rates per parameter.
- Dropout regularization disables units randomly on gradient updates.
- Batch normalization normalizes the inputs to each layer, rather than just the input to the entire neural network.
Convolutional neural networks are structured such that convolutional layers run linear filters over the input grid, structuring the neural network to be spatially coherent.
Recurrent neural networks output vectors that are fed into itself, allowing the network to retain "memory."
Long short term memory (LSTM) explicitly organizes this so there is a long-term memory vector that is passed onto the next time step, with modules that decide what parts of the memory to remove and replace with new information.
RNNs (and LSTMs) are currently good tools for processing text and other sequental data. ConvNets are good tools for processing images, video, and other spatial data.
Mixing these structures together, we can form models like sequence-to-sequence models, which use a RNN to read in data and output sequences via another RNN.
Generative adversarial networks (GANs) are trained in a setup that includes a generative model and a discriminative model. The discriminator tries to distinguish between generated examples and real data, and the generator tries to create realistic generated data.