Machine Learning Fall 2015 Notes

This high-level outline includes the major takeaway points from the topics we covered this semester.

Types of Machine Learning

Supervised learning
- Learner gets examples of inputs and outputs
- naive Bayes, decision trees, logistic regression, SVM, neural nets, least-squares regression, batch perceptron
Unsupervised learning
- Learner doesn’t get examples of output
- Clustering (K-means), density estimation (fitting a distribution to data, as in Gaussian mixtures), dimensionality reduction (e.g., PCA)
- (Not well defined)
Online learning
- Learner gets one example at a time, must make decision on that example, then gets to update model once the label is revealed
- Learner is evaluated while it is learning
- E.g., perceptron
Parametric model: finite dimensional parameter describing model
- Can still have very high complexity (e.g., large neural networks or rich kernels)
Non-parametric models: arbitrary complexity models, can have infinite complexity to fit infinite data
- Still usually have parameters to control complexity (otherwise non-parametric models would never work)

General Themes

Modularity shows up a lot
- Loss functions for optimization-based methods
- Squashing functions for neural nets
- Kernels for SVM
- Graphical model pieces
- Features are exchangeable
Optimization approach to learning:
- formulate learning as optimizing an objective function
- solve with general-purpose optimizer

Model Selection

Usually measure performance with loss function
Tradeoff between bias and variance
- bias: error caused by using the wrong model class
- variance: error caused by randomness in sample
If we use a really complex model, we can do really well on training data
- low bias, high variance
- our excellent training-time performance is not a good estimate of how we’ll do at test time
If we use a really simple model, we might do poorly on training data
- high bias, low variance
- good estimate of how we’ll do at test time, but we know we won’t do well
Regularization to control model complexity
Cross-validation to estimate true risk, find best tradeoff

Naive Bayes

Independence assumption: observations independent given label
Can substitute other probability functions like Gaussian distributions
A simple Bayesian network

Decision Trees

Training: greedily split input space (and data set) by criterion
- information gain
- misclassification rate
- Gini index (expected misclassification rate)
Prediction: recurse down subtree according to decision rule
Pruning:
- faster to train a full tree, then prune away subtrees
- one strategy is to prune to obtain the best held-out validation performance

Logistic Regression

Linear classifier
Logistic function (sigmoid) to squash to zero-one range
Maximum likelihood training via convex programming
Stochastic methods work too
- Repeat: randomly sample mini batch of examples, and take gradient step for those examples
Multiclass version exponentiates linear score and normalizes

Perceptron

Another linear classifier
Online update
- Binary class:
  - add input features to w if wrong and label is positive
  - subtract input features to w if wrong and label is negative
  - do nothing if correct
- Multi class:
  - If wrong: add input features to correct class weight vector
  - If wrong: subtract input features to incorrect class weight vector
Mistake bound: guaranteed maximum number of updates, scales with margin and input bound

Neural Networks

Each node takes input activations, combines them and squashes
We studied feed-forward structures
- layers of “units”, fully-connected between layers
Train by gradient methods on loss function of output
- e.g., log likelihood
Back propagation: chain rule of gradients
- save activation values during forward propagation
- use saved activation values to compute gradients
- propagate error gradient backwards through weights
Modular: choose any differentiable squashing function
Get non-linearity, but lose convexity
- local solutions are usually good

Optimization

Convex objective functions are nice
- (sub)gradient descent
- line search, conjugate gradient, second-order methods
“Efficient” non-convex methods:
- local optimization
- bound minimization
Methods for handling constraints:
- Lagrange or KKT multipliers: adversarial terms to make constraint satisfaction a game
  - Technical jargon: Lagrange multipliers are for equality constraints, KKT for inequality constraints
- Penalty/barrier functions

Duality

Generally: a dual is an alternate optimization that tells you something about the original, primal, problem
Usual scenario #1: solution to dual allows you to solve for the solution of the primal
Usual scenario #2: solution to dual allows you to solve for an approximate solution of the primal
Lagrangian dual for constrained optimization:
- take constraints from original problem, write Lagrangian (or KKT) form
- solve objective for primal variables
- remaining optimization over KKT multipliers is a dual problem

Support Vector Machines

Large margin linear classifier
Solution found via quadratic optimization
Each labeled example is a constraint: linear classifier score must correctly classify by at least 1.0 (or –1.0)
Minimizing norm of weight vector is equivalent to maximizing the margin
Slack: change constraints to have nonnegative slack variables
- penalize slack variable
- Equivalently, express objective with hinge-loss on misclassification
Learning theoretic analysis shows large margin reduces model complexity (VC dimension)
Dual SVM is also a quadratic program
- solve directly for KKT multipliers on misclassification constraints
- Form of dual QP allows kernelization, since all input data appears in inner products
- dual variables will be sparse. Nonzero values for the support vectors, which are the inputs that affect the solution
SMO: do closed-form optimization of dual objective on two alphas at a time
Stochastic: take gradient step of SVM objective for small mini-batch

Kernels

A way to take a linear classifier and make it non-linear
replace all inner products with kernel functions
- kernel function: inner product in some vector space
- ideally efficiently computable without explicitly doing inner products, especially if high dimensional
Example kernels
- polynomial: take input features and group them into different polynomial terms
  - can be efficiently computed implicitly by taking linear kernel (K + 1).^p
- radial basis function (Gaussian): kernel value is Gaussian density of point x with y as mean
  - implies an infinite dimensional vector space (via a Taylor expansion of the kernel value)
  - easier to think about as the kernel function than the infinite dimensional vector space

Regression

Regression: technically learning to predict any output from any input
Usually used to distinguish between continuous output (regression) and discrete output (classification)
Change in loss function, e.g., least-squares loss
- minimize squared error between predicted output and true output
- closed-form solution requires inverting input matrix
SVM regression
- find tube around hyperplane (hypertube?) that includes all points
- constraints hold points inside the tube, can be slackened
- can be kernelized
Neural network regression: plug in regression loss function (like squared error) and back propagate

Principal Components Analysis

Realign data vectors along new basis
New dimensions ordered by amount of variance
(Usually) truncate to first few dimensions, removing noise from directions of less variance
Computed by eigendecomposition of covariance matrix
- Or equivalently, singular-value decomposition of data matrix

EM for Gaussian Mixture Models

Mixture model: randomly sample from set of sub-distributions, then sample from chosen distribution
Gaussian mixture model: each sub-distribution is a Gaussian
Often used for clustering
Latent variables: which Gaussian each data point came from
- never observed, so we keep track of it as a random variable from a multinomial probability
Latent variables make log-likelihood non-convex
Expectation maximization:
- E-step: given current Gaussian parameters and prior, estimate the probability of each latent variable
- M-step: given current latent variable probabilities, set Gaussian parameters to maximize the expected log likelihood under those distributions.

Variational EM and K-Means

EM can be interpreted as maximizing a lower bound on the true log likelihood
First scale each setting in the expectation summation by a variational distribution q divided by itself (so 1.0)
Bound comes from applying Jensen’s inequality when we move the log from outside to inside the expectation
Limiting q distribution to independent multinomials for each example gets the standard EM algorithm
Limiting q to point distributions (and making the Gaussians spherical) gets k-means
K-means:
- E-step: assign all points to their nearest mean
- M-step: recompute mean of points assigned to each cluster
Also a bound on log-likelihood, but has discrete point distributions, so optimization isn’t as smooth

Bayesian Networks

Describe conditional independence relationships with a directed acyclic graph
Nodes are variables
Edges between variables
The factorized probability distribution includes the conditional probability of child given its parents
Explaining away
- observing a child makes the parents dependent
- but if child is unobserved, parents are independent
Markov blanket: variables that need to be observed to make a variable independent of all other variables
- Children, parents
- Parents of children
Variable elimination algorithm for exact inference
- dynamic programming method to avoid redundant computation
- choose variable to eliminate
- sum out all settings of variable over relevant factors
- compute new factor
  - function of any other variables in summed out factors
- rewrite probability using new factor

Hidden Markov Models

Special structured Bayes nets
Used for modeling sequences, often time series
Hidden states form a Markov chain
- past is independent of future given present
Observed values depend on hidden state
Forward-backward inference
- compute forward messages: probabilities of state and previous evidence
- compute backward messages: probability of future evidence given state
- combine and normalize to get marginal probabilities

Markov Random Fields

Undirected graph to describe independence structure
Probability factorized into potential functions for each clique of the graph
Can convert Bayes net to MRF, but lose independence of parents if child is unobserved
- connect all mutual parents
Markov blanket in MRFs: all neighbors in graph
Collect and distribute belief propagation in trees is exact
- Choose node to be root
- compute messages from leaves, then leaf-parents, etc. until root has all messages from entire tree
- compute outgoing messages from root, then from root-children, etc. until leaf has messages from root
- Fuse messages into beliefs
Loopy belief propagation is only an approximation in general, but it works well in practice

Learning Theory

Formalize analysis, theorems given mathematical assumptions
Ex. 1: mistake bounds for online learning
- perceptron will make a finite number of mistakes if data is separable by a margin and bounded
Ex. 2: generalization analysis, risk bounds
- Bound with high probability the true risk
- Bound includes the empirical risk and some measure of model complexity
- Related to bias-variance
- Vapnik-Chervonenkis dimension:
  - how many points can the hypothesis class shatter
  - label in all possible binary configurations
- Smaller complexity -> smaller gap between empirical risk and true risk