Machine Learning Fall 2017 Notes

This high-level outline includes the major takeaway points from the topics we covered so far this semester.

Types of Machine Learning

Supervised learning
- Learner gets examples of inputs and outputs
- naive Bayes, decision trees, logistic regression, SVM, neural nets, batch perceptron
Unsupervised learning
- Learner doesn’t get examples of output
- Use other insights to extract knowledge from data.
Online learning
- Learner gets one example at a time, must make decision on that example, then gets to update model once the label is revealed
- Learner is evaluated while it is learning
- E.g., perceptron
Parametric model: finite dimensional parameter describing model
- Can still have very high complexity (e.g., large neural networks or rich kernels)
Non-parametric models: arbitrary complexity models, can have infinite complexity to fit infinite data
- Still usually have parameters to control complexity (otherwise non-parametric models would never work)

Modularity shows up a lot
- Loss functions for optimization-based methods
- Squashing functions for neural nets
- Kernels for SVM
- Features are exchangeable
Optimization approach to learning:
- formulate learning as optimizing an objective function
- solve with general-purpose optimizer

Usually measure performance with loss function
Tradeoff between bias and variance
- bias: error caused by using the wrong model class
- variance: error caused by randomness in sample
If we use a really complex model, we can do really well on training data
- low bias, high variance
- our excellent training-time performance is not a good estimate of how we’ll do at test time
If we use a really simple model, we might do poorly on training data
- high bias, low variance
- good estimate of how we’ll do at test time, but we know we won’t do well
Regularization to control model complexity
Cross-validation to estimate true risk, find best tradeoff

Independence assumption: observations independent given label
Can substitute various probability functions (like multinomial, Bernoulli, or Gaussian distributions)
A simple Bayesian network

Training: greedily split input space (and data set) by criterion
- information gain
- misclassification rate
- Gini index (expected misclassification rate)
Prediction: recurse down subtree according to decision rule
Pruning:
- faster to train a full tree, then prune away subtrees
- one strategy is to prune to obtain the best held-out validation performance

Linear classifier
Logistic function (sigmoid) to squash to zero-one range
Maximum likelihood training via convex programming
Stochastic methods work too
- Repeat: randomly sample mini batch of examples, and take gradient step for those examples
Multiclass version exponentiates linear score and normalizes

Another linear classifier
Online update
- Binary class:
  - add input features to w if wrong and label is positive
  - subtract input features to w if wrong and label is negative
  - do nothing if correct
- Multi class:
  - If wrong: add input features to correct class weight vector
  - If wrong: subtract input features to incorrect class weight vector
Mistake bound: guaranteed maximum number of updates, scales with margin and input bound

Each node takes input activations, combines them and squashes
We studied feed-forward structures
- layers of “units”, fully-connected between layers
Train by gradient methods on loss function of output
- e.g., log likelihood
Back propagation: chain rule of gradients
- save activation values during forward propagation
- use saved activation values to compute gradients
- propagate error gradient backwards through weights
Modular: choose any differentiable squashing function
Get non-linearity, but lose convexity
- local solutions are usually good

Convex objective functions are nice
- (sub)gradient descent
- line search, conjugate gradient, second-order methods
“Efficient” non-convex methods:
- local optimization
- bound minimization
Methods for handling constraints:
- Lagrange or KKT multipliers: adversarial terms to make constraint satisfaction a game
  - Technical jargon: Lagrange multipliers are for equality constraints, KKT for inequality constraints
- Penalty/barrier functions

Generally: a dual is an alternate optimization that tells you something about the original, primal, problem
Usual scenario #1: solution to dual allows you to solve for the solution of the primal
Usual scenario #2: solution to dual allows you to solve for an approximate solution of the primal
Lagrangian dual for constrained optimization:
- take constraints from original problem, write Lagrangian (or KKT) form
- solve objective for primal variables
- remaining optimization over KKT multipliers is a dual problem

Large margin linear classifier
Solution found via quadratic optimization
Each labeled example is a constraint: linear classifier score must correctly classify by at least 1.0 (or –1.0)
Minimizing norm of weight vector is equivalent to maximizing the margin
Slack: change constraints to have nonnegative slack variables
- penalize slack variable
- Equivalently, express objective with hinge-loss on misclassification
Learning theoretic analysis shows large margin reduces model complexity (VC dimension)
Dual SVM is also a quadratic program
- solve directly for KKT multipliers on misclassification constraints
- Form of dual QP allows kernelization, since all input data appears in inner products
- dual variables will be sparse. Nonzero values for the support vectors, which are the inputs that affect the solution
SMO: do closed-form optimization of dual objective on two alphas at a time
Stochastic: take gradient step of SVM objective for small mini-batch

A way to take a linear classifier and make it non-linear
replace all inner products with kernel functions
- kernel function: inner product in some vector space
- ideally efficiently computable without explicitly doing inner products, especially if high dimensional
Example kernels
- polynomial: take input features and group them into different polynomial terms
  - can be efficiently computed implicitly by taking linear kernel (K + 1).^p
- radial basis function (Gaussian): kernel value is Gaussian density of point x with y as mean
  - implies an infinite dimensional vector space (via a Taylor expansion of the kernel value)
  - easier to think about as the kernel function than the infinite dimensional vector space

Formalize analysis, theorems given mathematical assumptions
Ex. 1: mistake bounds for online learning
- perceptron will make a finite number of mistakes if data is separable by a margin and bounded
Ex. 2: generalization analysis, risk bounds
- Bound with high probability the true risk
- Bound includes the empirical risk and some measure of model complexity
- Related to bias-variance
- Vapnik-Chervonenkis dimension:
  - how many points can the hypothesis class shatter
  - label in all possible binary configurations
- Smaller complexity -> smaller gap between empirical risk and true risk