# Machine Learning Fall 2017 Notes

This high-level outline includes the major takeaway points from the topics we covered so far this semester.

## Types of Machine Learning

• Supervised learning
• Learner gets examples of inputs and outputs
• naive Bayes, decision trees, logistic regression, SVM, neural nets, batch perceptron
• Unsupervised learning
• Learner doesn’t get examples of output
• Use other insights to extract knowledge from data.
• Online learning
• Learner gets one example at a time, must make decision on that example, then gets to update model once the label is revealed
• Learner is evaluated while it is learning
• E.g., perceptron
• Parametric model: finite dimensional parameter describing model
• Can still have very high complexity (e.g., large neural networks or rich kernels)
• Non-parametric models: arbitrary complexity models, can have infinite complexity to fit infinite data
• Still usually have parameters to control complexity (otherwise non-parametric models would never work)

## General Themes

• Modularity shows up a lot
• Loss functions for optimization-based methods
• Squashing functions for neural nets
• Kernels for SVM
• Features are exchangeable
• Optimization approach to learning:
• formulate learning as optimizing an objective function
• solve with general-purpose optimizer

## Model Selection

• Usually measure performance with loss function
• Tradeoff between bias and variance
• bias: error caused by using the wrong model class
• variance: error caused by randomness in sample
• If we use a really complex model, we can do really well on training data
• low bias, high variance
• our excellent training-time performance is not a good estimate of how we’ll do at test time
• If we use a really simple model, we might do poorly on training data
• high bias, low variance
• good estimate of how we’ll do at test time, but we know we won’t do well
• Regularization to control model complexity
• Cross-validation to estimate true risk, find best tradeoff

## Naive Bayes

• Independence assumption: observations independent given label
• Can substitute various probability functions (like multinomial, Bernoulli, or Gaussian distributions)
• A simple Bayesian network

## Decision Trees

• Training: greedily split input space (and data set) by criterion
• information gain
• misclassification rate
• Gini index (expected misclassification rate)
• Prediction: recurse down subtree according to decision rule
• Pruning:
• faster to train a full tree, then prune away subtrees
• one strategy is to prune to obtain the best held-out validation performance

## Logistic Regression

• Linear classifier
• Logistic function (sigmoid) to squash to zero-one range
• Maximum likelihood training via convex programming
• Stochastic methods work too
• Repeat: randomly sample mini batch of examples, and take gradient step for those examples
• Multiclass version exponentiates linear score and normalizes

## Perceptron

• Another linear classifier
• Online update
• Binary class:
• add input features to w if wrong and label is positive
• subtract input features to w if wrong and label is negative
• do nothing if correct
• Multi class:
• If wrong: add input features to correct class weight vector
• If wrong: subtract input features to incorrect class weight vector
• Mistake bound: guaranteed maximum number of updates, scales with margin and input bound

## Neural Networks

• Each node takes input activations, combines them and squashes
• We studied feed-forward structures
• layers of “units”, fully-connected between layers
• Train by gradient methods on loss function of output
• e.g., log likelihood
• Back propagation: chain rule of gradients
• save activation values during forward propagation
• use saved activation values to compute gradients
• propagate error gradient backwards through weights
• Modular: choose any differentiable squashing function
• Get non-linearity, but lose convexity
• local solutions are usually good

## Optimization

• Convex objective functions are nice
• line search, conjugate gradient, second-order methods
• “Efficient” non-convex methods:
• local optimization
• bound minimization
• Methods for handling constraints:
• Lagrange or KKT multipliers: adversarial terms to make constraint satisfaction a game
• Technical jargon: Lagrange multipliers are for equality constraints, KKT for inequality constraints
• Penalty/barrier functions

## Duality

• Generally: a dual is an alternate optimization that tells you something about the original, primal, problem
• Usual scenario #1: solution to dual allows you to solve for the solution of the primal
• Usual scenario #2: solution to dual allows you to solve for an approximate solution of the primal
• Lagrangian dual for constrained optimization:
• take constraints from original problem, write Lagrangian (or KKT) form
• solve objective for primal variables
• remaining optimization over KKT multipliers is a dual problem

## Support Vector Machines

• Large margin linear classifier
• Solution found via quadratic optimization
• Each labeled example is a constraint: linear classifier score must correctly classify by at least 1.0 (or –1.0)
• Minimizing norm of weight vector is equivalent to maximizing the margin
• Slack: change constraints to have nonnegative slack variables
• penalize slack variable
• Equivalently, express objective with hinge-loss on misclassification
• Learning theoretic analysis shows large margin reduces model complexity (VC dimension)
• Dual SVM is also a quadratic program
• solve directly for KKT multipliers on misclassification constraints
• Form of dual QP allows kernelization, since all input data appears in inner products
• dual variables will be sparse. Nonzero values for the support vectors, which are the inputs that affect the solution
• SMO: do closed-form optimization of dual objective on two alphas at a time
• Stochastic: take gradient step of SVM objective for small mini-batch

## Kernels

• A way to take a linear classifier and make it non-linear
• replace all inner products with kernel functions
• kernel function: inner product in some vector space
• ideally efficiently computable without explicitly doing inner products, especially if high dimensional
• Example kernels
• polynomial: take input features and group them into different polynomial terms
• can be efficiently computed implicitly by taking linear kernel (K + 1).^p
• radial basis function (Gaussian): kernel value is Gaussian density of point x with y as mean
• implies an infinite dimensional vector space (via a Taylor expansion of the kernel value)
• easier to think about as the kernel function than the infinite dimensional vector space

## Learning Theory

• Formalize analysis, theorems given mathematical assumptions
• Ex. 1: mistake bounds for online learning
• perceptron will make a finite number of mistakes if data is separable by a margin and bounded
• Ex. 2: generalization analysis, risk bounds
• Bound with high probability the true risk
• Bound includes the empirical risk and some measure of model complexity
• Related to bias-variance
• Vapnik-Chervonenkis dimension:
• how many points can the hypothesis class shatter
• label in all possible binary configurations
• Smaller complexity -> smaller gap between empirical risk and true risk

## Principal Components Analysis

• Realign data vectors along new basis
• New dimensions ordered by amount of variance
• (Usually) truncate to first few dimensions, removing noise from directions of less variance
• Computed by eigendecomposition of covariance matrix
• Or equivalently, singular-value decomposition of data matrix

## EM for Gaussian Mixture Models

• Mixture model: randomly sample from set of sub-distributions, then sample from chosen distribution
• Gaussian mixture model: each sub-distribution is a Gaussian
• Often used for clustering
• Latent variables: which Gaussian each data point came from
• never observed, so we keep track of it as a random variable from a multinomial probability
• Latent variables make log-likelihood non-convex
• Expectation maximization:
• E-step: given current Gaussian parameters and prior, estimate the probability of each latent variable
• M-step: given current latent variable probabilities, set Gaussian parameters to maximize the expected log likelihood under those distributions.

## Variational EM and K-Means

• EM can be interpreted as maximizing a lower bound on the true log likelihood
• First scale each setting in the expectation summation by a variational distribution q divided by itself (so 1.0)
• Bound comes from applying Jensen’s inequality when we move the log from outside to inside the expectation
• Limiting q distribution to independent multinomials for each example gets the standard EM algorithm
• Limiting q to point distributions (and making the Gaussians spherical) gets k-means
• K-means:
• E-step: assign all points to their nearest mean
• M-step: recompute mean of points assigned to each cluster
• Also a bound on log-likelihood, but has discrete point distributions, so optimization isn’t as smooth

## Bayesian Networks

• Describe conditional independence relationships with a directed acyclic graph
• Nodes are variables
• Edges between variables
• The factorized probability distribution includes the conditional probability of child given its parents
• Explaining away
• observing a child makes the parents dependent
• but if child is unobserved, parents are independent
• Markov blanket: variables that need to be observed to make a variable independent of all other variables
• Children, parents
• Parents of children
• Variable elimination algorithm for exact inference
• dynamic programming method to avoid redundant computation
• choose variable to eliminate
• sum out all settings of variable over relevant factors
• compute new factor
• function of any other variables in summed out factors
• rewrite probability using new factor

## Hidden Markov Models

• Special structured Bayes nets
• Used for modeling sequences, often time series
• Hidden states form a Markov chain
• past is independent of future given present
• Observed values depend on hidden state
• Forward-backward inference
• compute forward messages: probabilities of state and previous evidence
• compute backward messages: probability of future evidence given state
• combine and normalize to get marginal probabilities

## Markov Random Fields

• Undirected graph to describe independence structure
• Probability factorized into potential functions for each clique of the graph
• Can convert Bayes net to MRF, but lose independence of parents if child is unobserved
• connect all mutual parents
• Markov blanket in MRFs: all neighbors in graph
• Collect and distribute belief propagation in trees is exact
• Choose node to be root
• compute messages from leaves, then leaf-parents, etc. until root has all messages from entire tree
• compute outgoing messages from root, then from root-children, etc. until leaf has messages from root
• Fuse messages into beliefs
• Loopy belief propagation is only an approximation in general, but it works well in practice

## Structured Prediction

• A structured predictor is a function that outputs a complex object. Typically the object is a set of dependent variable states, such as a graph structure, a tree, a path, etc.
• Structured output learning is the task of learning the parameters for a structured predictor.
• Aside from probabilistic graphical models, other frameworks for structured prediction typically involve learning parameters for an optimization, such as learning the weights for a minimum-spanning-tree problem.
• Structured perceptron tries to enforce that the score of the ground-truth structures are better than the scores for any other possible output.

## Reinforcement Learning

• Learner receives (1) state, (2) action, (3) result, (4) reward.
• (Depending on setting, the learner may also get to choose the action.)
• Key challenge: delayed rewards require learner to figure out what states and actions led to the reward.
• Most algorithms represent problems as Markov decision processes.
• Pr(s' | s, a): probability of transitioning to next state given the current state and action.
• Allows for randomness as a result of an action.
• A policy is a function that reads in a state and returns an action.
• We can evaluate a policy by considering the expected reward if the learner uses the policy.
• The expected discounted reward scales future rewards by an exponentially decaying weight, allowing us to reason about the expected reward of a learner running for infinite steps into the future.
• The expected reward of the optimal policy has a recursuve form, known as the Bellman equation.
• Q-learning attempts to learn a function approximating the expected discounted future reward given the current state and an action (the q-function).
• Can be updated by taking a gradient step in the loss, which is the difference between the empirical estimate of the q-function and its current value.
• The q-function can be approximated by any other function, so modern reinforcement learning is typically done by training a neural network to mimic the q-function.

## Deep Learning

• Aside from larger datasets and faster computation (e.g., on GPUs), various techniques have been developed to overcome vanishing gradients to allow training of much deeper models than in previous years.
• New activation functions like rectified linear units (ReLUs) partially prevent gradients from vanishing by not squashing one side of the function. Many other variations have been proposed and experimented with.
• Stochastic optimization methods, especially adaptive ones that adjust learning rates per parameter.