Machine Learning Fall 2017 Notes
This high-level outline includes the major takeaway points from the topics we covered so far this semester.
Types of Machine Learning
- Supervised learning
- Learner gets examples of inputs and outputs
- naive Bayes, decision trees, logistic regression, SVM, neural nets, batch perceptron
- Unsupervised learning
- Learner doesn’t get examples of output
- Use other insights to extract knowledge from data.
- Online learning
- Learner gets one example at a time, must make decision on that example, then gets to update model once the label is revealed
- Learner is evaluated while it is learning
- E.g., perceptron
- Parametric model: finite dimensional parameter describing model
- Can still have very high complexity (e.g., large neural networks or rich kernels)
- Non-parametric models: arbitrary complexity models, can have infinite complexity to fit infinite data
- Still usually have parameters to control complexity (otherwise non-parametric models would never work)
- Modularity shows up a lot
- Loss functions for optimization-based methods
- Squashing functions for neural nets
- Kernels for SVM
- Features are exchangeable
- Optimization approach to learning:
- formulate learning as optimizing an objective function
- solve with general-purpose optimizer
- Usually measure performance with loss function
- Tradeoff between bias and variance
- bias: error caused by using the wrong model class
- variance: error caused by randomness in sample
- If we use a really complex model, we can do really well on training data
- low bias, high variance
- our excellent training-time performance is not a good estimate of how we’ll do at test time
- If we use a really simple model, we might do poorly on training data
- high bias, low variance
- good estimate of how we’ll do at test time, but we know we won’t do well
- Regularization to control model complexity
- Cross-validation to estimate true risk, find best tradeoff
- Independence assumption: observations independent given label
- Can substitute various probability functions (like multinomial, Bernoulli, or Gaussian distributions)
- A simple Bayesian network
- Training: greedily split input space (and data set) by criterion
- information gain
- misclassification rate
- Gini index (expected misclassification rate)
- Prediction: recurse down subtree according to decision rule
- faster to train a full tree, then prune away subtrees
- one strategy is to prune to obtain the best held-out validation performance
- Linear classifier
- Logistic function (sigmoid) to squash to zero-one range
- Maximum likelihood training via convex programming
- Stochastic methods work too
- Repeat: randomly sample mini batch of examples, and take gradient step for those examples
- Multiclass version exponentiates linear score and normalizes
- Another linear classifier
- Online update
- Binary class:
- add input features to w if wrong and label is positive
- subtract input features to w if wrong and label is negative
- do nothing if correct
- Multi class:
- If wrong: add input features to correct class weight vector
- If wrong: subtract input features to incorrect class weight vector
- Mistake bound: guaranteed maximum number of updates, scales with margin and input bound
- Each node takes input activations, combines them and squashes
- We studied feed-forward structures
- layers of “units”, fully-connected between layers
- Train by gradient methods on loss function of output
- Back propagation: chain rule of gradients
- save activation values during forward propagation
- use saved activation values to compute gradients
- propagate error gradient backwards through weights
- Modular: choose any differentiable squashing function
- Get non-linearity, but lose convexity
- local solutions are usually good
- Convex objective functions are nice
- (sub)gradient descent
- line search, conjugate gradient, second-order methods
- “Efficient” non-convex methods:
- local optimization
- bound minimization
- Methods for handling constraints:
- Lagrange or KKT multipliers: adversarial terms to make constraint satisfaction a game
- Technical jargon: Lagrange multipliers are for equality constraints, KKT for inequality constraints
- Penalty/barrier functions
- Generally: a dual is an alternate optimization that tells you something about the original, primal, problem
- Usual scenario #1: solution to dual allows you to solve for the solution of the primal
- Usual scenario #2: solution to dual allows you to solve for an approximate solution of the primal
- Lagrangian dual for constrained optimization:
- take constraints from original problem, write Lagrangian (or KKT) form
- solve objective for primal variables
- remaining optimization over KKT multipliers is a dual problem
Support Vector Machines
- Large margin linear classifier
- Solution found via quadratic optimization
- Each labeled example is a constraint: linear classifier score must correctly classify by at least 1.0 (or –1.0)
- Minimizing norm of weight vector is equivalent to maximizing the margin
- Slack: change constraints to have nonnegative slack variables
- penalize slack variable
- Equivalently, express objective with hinge-loss on misclassification
- Learning theoretic analysis shows large margin reduces model complexity (VC dimension)
- Dual SVM is also a quadratic program
- solve directly for KKT multipliers on misclassification constraints
- Form of dual QP allows kernelization, since all input data appears in inner products
- dual variables will be sparse. Nonzero values for the support vectors, which are the inputs that affect the solution
- SMO: do closed-form optimization of dual objective on two alphas at a time
- Stochastic: take gradient step of SVM objective for small mini-batch
- A way to take a linear classifier and make it non-linear
- replace all inner products with kernel functions
- kernel function: inner product in some vector space
- ideally efficiently computable without explicitly doing inner products, especially if high dimensional
- Example kernels
- polynomial: take input features and group them into different polynomial terms
- can be efficiently computed implicitly by taking linear kernel (K + 1).^p
- radial basis function (Gaussian): kernel value is Gaussian density of point x with y as mean
- implies an infinite dimensional vector space (via a Taylor expansion of the kernel value)
- easier to think about as the kernel function than the infinite dimensional vector space
- Formalize analysis, theorems given mathematical assumptions
- Ex. 1: mistake bounds for online learning
- perceptron will make a finite number of mistakes if data is separable by a margin and bounded
- Ex. 2: generalization analysis, risk bounds
- Bound with high probability the true risk
- Bound includes the empirical risk and some measure of model complexity
- Related to bias-variance
- Vapnik-Chervonenkis dimension:
- how many points can the hypothesis class shatter
- label in all possible binary configurations
- Smaller complexity -> smaller gap between empirical risk and true risk
Principal Components Analysis
- Realign data vectors along new basis
- New dimensions ordered by amount of variance
- (Usually) truncate to first few dimensions, removing noise from directions of less variance
- Computed by eigendecomposition of covariance matrix
- Or equivalently, singular-value decomposition of data matrix
EM for Gaussian Mixture Models
- Mixture model: randomly sample from set of sub-distributions, then sample from chosen distribution
- Gaussian mixture model: each sub-distribution is a Gaussian
- Often used for clustering
- Latent variables: which Gaussian each data point came from
- never observed, so we keep track of it as a random variable from a multinomial probability
- Latent variables make log-likelihood non-convex
- Expectation maximization:
- E-step: given current Gaussian parameters and prior, estimate the probability of each latent variable
- M-step: given current latent variable probabilities, set Gaussian parameters to maximize the expected log likelihood under those distributions.
Variational EM and K-Means
- EM can be interpreted as maximizing a lower bound on the true log likelihood
- First scale each setting in the expectation summation by a variational distribution q divided by itself (so 1.0)
- Bound comes from applying Jensen’s inequality when we move the log from outside to inside the expectation
- Limiting q distribution to independent multinomials for each example gets the standard EM algorithm
- Limiting q to point distributions (and making the Gaussians spherical) gets k-means
- E-step: assign all points to their nearest mean
- M-step: recompute mean of points assigned to each cluster
- Also a bound on log-likelihood, but has discrete point distributions, so optimization isn’t as smooth
- Describe conditional independence relationships with a directed acyclic graph
- Nodes are variables
- Edges between variables
- The factorized probability distribution includes the conditional probability of child given its parents
- Explaining away
- observing a child makes the parents dependent
- but if child is unobserved, parents are independent
- Markov blanket: variables that need to be observed to make a variable independent of all other variables
- Children, parents
- Parents of children
- Variable elimination algorithm for exact inference
- dynamic programming method to avoid redundant computation
- choose variable to eliminate
- sum out all settings of variable over relevant factors
- compute new factor
- function of any other variables in summed out factors
- rewrite probability using new factor
- Special structured Bayes nets
- Used for modeling sequences, often time series
- Hidden states form a Markov chain
- past is independent of future given present
- Observed values depend on hidden state
- Forward-backward inference
- compute forward messages: probabilities of state and previous evidence
- compute backward messages: probability of future evidence given state
- combine and normalize to get marginal probabilities
Markov Random Fields
- Undirected graph to describe independence structure
- Probability factorized into potential functions for each clique of the graph
- Can convert Bayes net to MRF, but lose independence of parents if child is unobserved
- connect all mutual parents
- Markov blanket in MRFs: all neighbors in graph
- Collect and distribute belief propagation in trees is exact
- Choose node to be root
- compute messages from leaves, then leaf-parents, etc. until root has all messages from entire tree
- compute outgoing messages from root, then from root-children, etc. until leaf has messages from root
- Fuse messages into beliefs
- Loopy belief propagation is only an approximation in general, but it works well in practice
- A structured predictor is a function that outputs a complex object. Typically the object is a set of dependent variable states, such as a graph structure, a tree, a path, etc.
- Structured output learning is the task of learning the parameters for a structured predictor.
Aside from probabilistic graphical models, other frameworks for structured prediction typically involve learning parameters for an optimization, such as learning the weights for a minimum-spanning-tree problem.
- Structured perceptron tries to enforce that the score of the ground-truth structures are better than the scores for any other possible output.
- Learner receives (1) state, (2) action, (3) result, (4) reward.
- (Depending on setting, the learner may also get to choose the action.)
- Key challenge: delayed rewards require learner to figure out what states and actions led to the reward.
- Most algorithms represent problems as Markov decision processes.
- Pr(s' | s, a): probability of transitioning to next state given the current state and action.
- Allows for randomness as a result of an action.
- A policy is a function that reads in a state and returns an action.
- We can evaluate a policy by considering the expected reward if the learner uses the policy.
- The expected discounted reward scales future rewards by an exponentially decaying weight, allowing us to reason about the expected reward of a learner running for infinite steps into the future.
- The expected reward of the optimal policy has a recursuve form, known as the Bellman equation.
- Q-learning attempts to learn a function approximating the expected discounted future reward given the current state and an action (the q-function).
- Can be updated by taking a gradient step in the loss, which is the difference between the empirical estimate of the q-function and its current value.
- The q-function can be approximated by any other function, so modern reinforcement learning is typically done by training a neural network to mimic the q-function.
- Aside from larger datasets and faster computation (e.g., on GPUs), various techniques have been developed to overcome vanishing gradients to allow training of much deeper models than in previous years.
- New activation functions like rectified linear units (ReLUs) partially prevent gradients from vanishing by not squashing one side of the function. Many other variations have been proposed and experimented with.
- Stochastic optimization methods, especially adaptive ones that adjust learning rates per parameter.
- Dropout regularization disables units randomly on gradient updates.
- Batch normalization normalizes the inputs to each layer, rather than just the input to the entire neural network.
- Convolutional neural networks are structured such that convolutional layers run linear filters over the input grid, structuring the neural network to be spatially coherent.
- Recurrent neural networks output vectors that are fed into itself, allowing the network to retain "memory."
- Long short term memory (LSTM) explicitly organizes this so there is a long-term memory vector that is passed onto the next time step, with modules that decide what parts of the memory to remove and replace with new information.
- RNNs (and LSTMs) are currently good tools for processing text and other sequental data. ConvNets are good tools for processing images, video, and other spatial data.
- Mixing these structures together, we can form models like sequence-to-sequence models, which use a RNN to read in data and output sequences via another RNN.
- Generative adversarial networks (GANs) are trained in a setup that includes a generative model and a discriminative model. The discriminator tries to distinguish between generated examples and real data, and the generator tries to create realistic generated data.