Machine Learning Top Algorithms Revision Material
March 1, 2016
1
Deci Decisi sion on Trees rees
When to use decision trees? • When there are fixed set of features and features take on a small set of values • Target function has discrete small set of values • It is robust to errors in classification of training data and can also handle some missing missing values. values. Why? Why? Because Because it considers all available available training training examples during learning..
Important algorithms and extensions 1. ID3 (a) The hypothesis hypothesis space for ID3 consists of all possible hypothesis hypothesis functions and amongst the multiple possible valid hypotheses, the algorithm prefers shorter trees over longer trees. (b) Why shorter shorter trees? trees? Occam’s Occam’s razor: Prefer Prefer the simplest simplest hypothesis hypothesis that fits the data. Why? (c) It is a greedy search search algorithm i.e. the algorithm algorithm never backtrac backtracks ks to re-consider its previous choice. Therefore it might end up at local optimum (d) The crux of the algorithm algorithm is to specify a way to choose an optimal optimal root node (e) Optimal Optimal selection selection of root node is decided decided by calculating calculating Information gain which which measures how well the node/feature separates the training examples w.r.t target classification (f) For variable variable with only binary values, values, entropy entropy is defined defined as -
−P ⊕ log2 P ⊕ − P log2 P 1
Therefore for a training set with 9 ⊕ and 5 values of target variable, the entropy will be -
−
9 9 5 5 log 2 − log 2 = 0 .940 14 14 14 14
(g) Information gain of a node with attribute A is defined as Entropy(S ) −
|S v | i∈A
|S |
Entropy (S v )
Where S is the set of all examples at root ans S v is the set of examples in the child node corresponding to a value of A. Node with maximum information gain is selected as root 2. C4.5 (a) If training data has significant noise, it results in over-fitting. (b) Over-fitting can be resolved by post pruning. One such successful method is called Rule Post Pruning and a variant of the algorithm is called C4.5 (c) Infer the decision tree from the training set, growing the tree until the training data is fit as well as possible and allowing over-fitting to occur (d) Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to a leaf node. Why? (e) Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy. Done recursively on that rule till the accuracy worsens. (f) Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances (g) Drawback is keeping a separate validation set. To apply algorithm using same training set, use pessimistic estimate Other Modifications
• Substitute maximum occurring value at that node in case of missing values • Divide information gains by weights if the features are to be weighted • Use thresholds for continuous values • Need to use a modified version of information gain for choosing root node such as Split Information Ratio
References • Machine Learning - Tom Mitchell, Chapter 3 2
2
Logistic Regression
When to use logistic regression? • When target variable is discrete valued
Algorithm 1. The hypothesis function is of the form (for binary classification)hθ (x) = g (θT x) =
1 1 + e−θ
T
x
Where g is the logistic function which varies between 0 and 1 2. How did we choose this logistic function? Answer - GLMs 3. Let P (y = 1x; θ) = h θ (x) P (y = 0x; θ) = 1 − hθ (x)
More generally, P (yx; θ) = h θ (x)y (1 − hθ (x))1−y
4. Likelihood, which is the joint probability will be m
→ P (− y X ; θ) = h θ (x(i) )y (1 − hθ (x(i) ))1−y (i )
(i )
i=1
5. We maximize log of this likelihood function by using batch gradient ascent method θ := θ + α∇θ l(θ ) 6. for one training example, derivative of log likelihood w.r.t θ is (simple derivation)∂l (θ ) = (y − hθ (x))xj ∂θ j
7. Therefore the update rule becomes (i)
θj := θ j + α(y(i) − hθ (x(i) ))xj
8. Can also use Newton’s method for maximizing log likelihood for fast convergence but it involves computing inverse of hessian of log likelihood function w.r.t θ at each iteration 9. Use Softmax Regression which can also b e derived from GLM theory 3
References • Stanford CS229 Notes 1
3
Linear Regression
When to use linear regression? • For any regression problem
Algorithm 1. The hypothesis function is of the form (for binary classification)hθ (x) = θ T x
With x 0 = 1 2. Intuitively, we reduce the sum squared error over the training data to find θ
3. Cost function becomes 1 J (θ) = (hθ (x(i) ) − y (i) )2 2 =1 m
i
And to solve we minimize J (θ ) 4. Why did we choose least squares cost function? Answer - Assume the training data is just a Gaussian noise around an actual polynomial. The log of probability distribution of noise i.e. log likelihood gives you the least squares formula 5. Use gradient descent to minimize cost function 6. Normal equation solution -
→ θ = (X T X )−1 X T − y ) 7. Locally weighted linear regression is one modification to account for highly varying data without over-fitting
References • Stanford CS229 Notes 1
4
4
Nearest Neighbours
5
Gaussian Discriminant Analysis
When to use GDA? • For classification problems with continuous features • GDA makes stronger assumptions about the training data than logistic regression. When these are correct, GDA generally performs better than logistic regression.
Algorithm 1. GDA is a generative algorithm 2. Intuitive explanation for generative algorithms - First, looking at elephants, we can build a model of what elephants look like. Then, looking at dogs, we can build a separate model of what dogs look like. Finally, to classify a new animal, we can match the new animal against the elephant model, and match it against the dog model, to see whether the new animal looks more like the elephants or more like the dogs we had seen in the training set. 3. Model y = Bernoulli(φ)
(x|y = 0) = ℵ(µ0 , Σ) (x|y = 1) = ℵ(µ1 , Σ)
5
6
Naive Bayes
7
Bayes Networks
8
Adaboost
9
SVM
10
K-means Clustering
11
Expectation Maximization
12
SVD
13
PCA
14
Random Forests
15
Artificial Neural Networks
6