Introduction to Statistical Machine Learning
Introduction Intr oduction to Statistical Machine Learning
c 2013 Christfried Webers NICTA NICT A The Australian National University
Christfried Webers Outlines
Statistical Machine Learning Group NICTA and College of Engineering and Computer Science The Australian National University
Canberra February – June 2013
Overview Introduction Introduction Linear Algebra Probability Linear Regression Regression 1 Linear Regression Regression 2 Linear Classification 1 Linear Classification 2 Neural Networks 1 Neural Networks 2 Kernel Methods Sparse Kernel Methods Graphical Models 1 Graphical Models 2 Graphical Models 3 Mixture Models and EM 1 Mixture Models and EM 2 Approximatee Inference Approximat Sampling Principal Component Analysis Sequential Data 1
(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
Sequential Data 2 Combining Models Selected Topi Topics cs Discussion and Summary
Introduction to Statistical Machine Learning
c 2013 Christfried Webers NICTA NICT A The Australian National University
Part VII Classification
Linear Classification 1
Generalised Linear Model Inference and Decision Discriminant Functions Fisher’ss Linear Fisher’ Discriminant The Perceptron Algorithm
Classification Goal : Given input data x, assign it to one of K discrete discrete classes C k k where k = 1 , . . . , K . Divide the input space into different regions.
Introduction to Statistical Machine Learning
c 2013 Christfried Webers NICTA NICT A The Australian National University
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’ss Linear Fisher’ Discriminant The Perceptron Algorithm
Figure : Length of the petal [in cm] for a given sepal [cm] for iris
flowers (Iris Setosa, Iris Versicolor, Iris Virginica).
How to represent binary class labels?
Class labels are no longer real values as in regression, but a discrete set. Two classes : t ∈ {0, 1} ( t = 1 represents class C 1 and t = 0 represents class C 2 )
Introduction to Statistical Machine Learning
c 2013 Christfried Webers NICTA The Australian National University
Classification Generalised Linear Model Inference and Decision
Can interpret the value of t as the probability of class C 1 , with only two values possible for the probability, 0 or 1 .
Discriminant Functions
Note: Other conventions to map classes into integers possible, check the setup.
The Perceptron Algorithm
Fisher’s Linear Discriminant
How to represent multi-class labels?
Introduction to Statistical Machine Learning
c 2013 Christfried Webers NICTA The Australian National University
If there are more than two classes ( K > 2), we call it a multi-class setup. Often used: 1-of-K coding scheme in which t is a vector of length K which has all values 0 except for t j = 1 , where j comes from the membership in class C j to encode. Example: Given 5 classes, {C 1 , . . . , C 5 }. Membership in class C 2 will be encoded as the target vector t = (0, 1, 0, 0, 0)
T
Note: Other conventions to map multi-classes into integers possible, check the setup.
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Introduction to Statistical Machine Learning
Linear Model Idea: Use again a Linear Model as in regression: y(x, w) is a linear function of the parameters w
c 2013 Christfried Webers NICTA The Australian National University
y(xn , w) = wT φ(xn )
But generally y(xn , w) ∈ R. Example: Which class is y(x, w) = 0.71623 ?
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Introduction to Statistical Machine Learning
Generalised Linear Model Apply a mapping f : R → discrete class labels.
Z to
the linear model to get the
c 2013 Christfried Webers NICTA The Australian National University
Generalised Linear Model y(xn , w) = f (wT φ(xn ))
Generalised Linear Model
Activation function: f (·) −1
Link function : f
Classification
Inference and Decision
(·)
Discriminant Functions Fisher’s Linear Discriminant
signz 1.0
The Perceptron Algorithm
0.5
z
0.5
0.0
0.5
1.0
0.5
1.0
Figure : Example of an activation function f ( z) = sign ( z) .
Three Models for Decision Problems
Introduction to Statistical Machine Learning
c 2013 Christfried Webers NICTA The Australian National University
In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. Discriminative Models 1
2
Solve the inference problem of determining the posterior class probabilities p (C k | x). Use decision theory to assign each new x to one of the classes.
Generative Models 1
2 3 4 5
Solve the inference problem of determining the class-conditional probabilities p (x | C k ). Also, infer the prior class probabilities p(C k ). Use Bayes’ theorem to find the posterior p (C k | x). Alternatively, model the joint distribution p(x, C k ) directly. Use decision theory to assign each new x to one of the classes.
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Introduction to Statistical Machine Learning
Two Classes
c 2013 Christfried Webers NICTA The Australian National University
Definition
A discriminant is a function that maps from an input vector x to one of K classes, denoted by C k . Consider first two classes ( K = 2 ). Construct a linear function of the inputs x T
y(x) = w x + w0
such that x being assigned to class C 1 if y(x) ≥ 0 , and to class C 2 otherwise. weight vector w bias w0 ( sometimes − w0 called threshold )
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Introduction to Statistical Machine Learning
Two Classes
c 2013 Christfried Webers NICTA The Australian National University
Decision boundary y(x) = 0 is a ( D − 1)-dimensional hyperplane in a D-dimensional input space (decision surface). w is orthogonal to any vector lying in the decision surface.
Proof: Assume x A and x B are two points lying in the decision surface. Then, 0 = y(x A ) − y(x B ) = wT (x A − x B )
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Introduction to Statistical Machine Learning
Two Classes
c 2013 Christfried Webers NICTA The Australian National University
The normal distance from the origin to the decision surface is wT x
w
= −
w0
w
Classification Generalised Linear Model
x2
y > 0 y =0
Inference and Decision
R1
y <0 R2
Discriminant Functions Fisher’s Linear Discriminant
x
The Perceptron Algorithm
w y (x)
w x⊥
x1 −w0 w
Introduction to Statistical Machine Learning
Two Classes
c 2013 Christfried Webers NICTA The Australian National University
The value of y(x) gives a signed measure of the perpendicular distance r of the point x from the decision surface, r = y(x)/w. Classification
x
T
y(x) = w
x⊥ + r
w
w
0
T
w w +w0 = r + wT x⊥ + w0 = r w w
Inference and Decision Discriminant Functions Fisher’s Linear Discriminant
x2
y > 0
Generalised Linear Model
y = 0 R1
y < 0
The Perceptron Algorithm
R2
x w y(x)
w x⊥
x1 −w0 w
Introduction to Statistical Machine Learning
Two Classes
c 2013 Christfried Webers NICTA The Australian National University
More compact notation : Add an extra dimension to the input space and set the value to x0 = 1. Also define w = (w0 , w) and x = (1, x)
y(x) = wT x
Decision surface is now a D-dimensional hyperplane in a D + 1-dimensional expanded input space.
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Introduction to Statistical Machine Learning
Multi-Class
c 2013 Christfried Webers NICTA The Australian National University
Number of classes K > 2 Can we combine a number of two-class discriminant functions using K − 1 one-versus-the-rest classifiers? Classification Generalised Linear Model Inference and Decision ?
Discriminant Functions Fisher’s Linear Discriminant
R1
The Perceptron Algorithm
R2
C 1 R3 C 2 not C 1 not C 2
Introduction to Statistical Machine Learning
Multi-Class
c 2013 Christfried Webers NICTA The Australian National University
Number of classes K > 2 Can we combine a number of two-class discriminant functions using K (K − 1)/2 one-versus-one classifiers? Classification Generalised Linear Model
C 3 C 1
Inference and Decision Discriminant Functions R1
Fisher’s Linear Discriminant
R3 C 1
?
The Perceptron Algorithm C 3
R2 C 2 C 2
Introduction to Statistical Machine Learning
Multi-Class
c 2013 Christfried Webers NICTA The Australian National University
Number of classes K > 2 Solution: Use K linear functions yk (x) = wT k x + w k 0
= k . Assign input x to class C k if yk (x) > y j (x) for all j Decision boundary between class C k and C j given by yk (x) = y j (x)
Ri
Rk xB x
Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Rj
xA
Classification
Least Squares for Classification
Regression with a linear function of the model parameters and minimisation of sum-of-squares error function resulted in a closed-from solution for the parameter values. Is this also possible for classification?
Introduction to Statistical Machine Learning
c 2013 Christfried Webers NICTA The Australian National University
Classification Generalised Linear Model Inference and Decision
Given input data x belonging to one of K classes C k .
Discriminant Functions
Use 1 -of-K binary coding scheme.
Fisher’s Linear Discriminant
Each class is described by its own linear model yk (x) = wT k x + wk 0
k = 1, . . . , K
The Perceptron Algorithm
Introduction to Statistical Machine Learning
Least Squares for Classification
c 2013 Christfried Webers NICTA The Australian National University
With the conventions
wk 0 wk = wk
1 x = x
W = w1
. . . wK
∈ R D+1 Classification
D+1
∈R
Inference and Decision
∈ R( D+1)×K
Discriminant Functions Fisher’s Linear Discriminant
we get for the discriminant function (vector valued) y(x) = WT x
Generalised Linear Model
∈ RK .
For a new input x, the class is then defined by the index of the largest value in the row vector y(x)
The Perceptron Algorithm
Introduction to Statistical Machine Learning
Determine W
c 2013 Christfried Webers NICTA The Australian National University
Given a training set { xn , t} where n = 1, . . . , N , and t is the class in the 1 -of-K coding scheme. Define a matrix T where row n corresponds to tT n. The sum-of-squares error can now be written as
1 E D (W) = tr (XW − T)T (XW − T) 2
The minimum of E D (W) will be reached for W = (XT X)−1 XT T = X† T
where X† is the pseudo-inverse of X.
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Discriminant Function for Multi-Class
Introduction to Statistical Machine Learning
c 2013 Christfried Webers NICTA The Australian National University
The discriminant function y(x) is therefore
y(x) = WT x = TT (X† )T x,
where X is given by the training data, and x is the new input.
Interesting property: If for every tn the same linear constraint aT tn + b = 0 holds, then the prediction y(x) will also obey the same constraint T
a y(x) + b = 0.
For the 1 -of-K coding scheme, the sum of all components in tn is one, and therefore all components of y(x) will sum to one. BUT: the components are not probabilities, as they are not constraint to the interval ( 0, 1).
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Introduction to Statistical Machine Learning
Deficiencies of the Least Squares Approach
c 2013 Christfried Webers NICTA The Australian National University
Magenta curve : Decision Boundary for the least squares approach ( Green curve : Decision boundary for the logistic regression model described later) 4
4
Classification
2
2
Generalised Linear Model
0
0
Inference and Decision Discriminant Functions −2
−2
Fisher’s Linear Discriminant
−4
−4
The Perceptron Algorithm
−6
−6
−8
−8
−4
−2
0
2
4
6
8
−4
−2
0
2
4
6
8
Introduction to Statistical Machine Learning
Deficiencies of the Least Squares Approach
c 2013 Christfried Webers NICTA The Australian National University
Magenta curve : Decision Boundary for the least squares approach ( Green curve : Decision boundary for the logistic regression model described later) 6
4
6
Classification
4
Generalised Linear Model Inference and Decision
2
2
0
0
−2
−2
−4
−4
−6 −6
−4
−2
0
2
4
6
−6 −6
Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
−4
−2
0
2
4
6
Fisher’s Linear Discriminant
Introduction to Statistical Machine Learning
c 2013 Christfried Webers NICTA The Australian National University
View linear classification as dimensionality reduction. y(x) = wT x Classification
If y ≥ −w0 then class C 1 , otherwise C 2 .
Generalised Linear Model
But there are many projections from a D-dimensional input space onto one dimension.
Inference and Decision
Projection always means loss of information. For classification we want to preserve the class separation in one dimension. Can we find a projection which maximally preserves the class separation ?
Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Introduction to Statistical Machine Learning
Fisher’s Linear Discriminant
c 2013 Christfried Webers NICTA The Australian National University
Samples from two classes in a two-dimensional input space and their histogram when projected to two different one-dimensional spaces. Classification Generalised Linear Model 4
4
Inference and Decision Discriminant Functions
2
2
0
0
−2
−2
−2
2
6
Fisher’s Linear Discriminant The Perceptron Algorithm
−2
2
6
Fisher’s Linear Discriminant - First Try Given N 1 input data of class C 1 , and N 2 input data of class C 2 , calculate the centres of the two classes 1 m1 = N 1
1 m2 = N 2
xn ,
n∈C1
c 2013 Christfried Webers NICTA The Australian National University
xn
n∈C2
Choose w so as to maximise the projection of the class means onto w m1 − m2 = wT (m1 − m2 )
Introduction to Statistical Machine Learning
Classification Generalised Linear Model Inference and Decision Discriminant Functions
Problem with non-uniform covariance 4
Fisher’s Linear Discriminant The Perceptron Algorithm
2
0
−2
−2
2
6
Introduction to Statistical Machine Learning
Fisher’s Linear Discriminant Measure also the within-class variance for each class s2k
=
c 2013 Christfried Webers NICTA The Australian National University
( yn − mk )2
n∈Ck
where yn = wT xn . Maximise the Fisher criterion
Classification
(m2 − m1 )2 J (w) = s21 + s22
Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant
4
The Perceptron Algorithm
2
0
−2
−2
2
6
Introduction to Statistical Machine Learning
Fisher’s Linear Discriminant
c 2013 Christfried Webers NICTA The Australian National University
The Fisher criterion can be rewritten as J (w) =
wT S B w Classification
wT SW w
Generalised Linear Model
S B is the between-class covariance
Inference and Decision
T
S B = (m2 − m1 )(m2 − m1 ) SW is the within-class covariance SW =
n∈C1
T
(xn − m1 )(xn − m1 ) +
Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
n∈C2
(xn − m2 )(xn − m2 )T
Fisher’s Linear Discriminant
Introduction to Statistical Machine Learning
c 2013 Christfried Webers NICTA The Australian National University
The Fisher criterion J (w) =
wT S B w wT SW w
has a maximum for Fisher’s linear discriminant
Classification Generalised Linear Model Inference and Decision Discriminant Functions
1 w ∝ S− W (m2 − m1 )
Fisher’s linear discriminant is NOT a discriminant, but can be used to construct one by choosing a threshold y0 in the projection space.
Fisher’s Linear Discriminant The Perceptron Algorithm
Fisher’s Discriminant For Multi-Class Assume that the dimensionality of the input space D is greater than the number of classes K . Use D > 1 linear ’features’ yk = wT x and write everything in vector form (no bias involved!) y = WT x.
The within-class covariance is then the sum of the covariances for all K classes
Introduction to Statistical Machine Learning
c 2013 Christfried Webers NICTA The Australian National University
Classification Generalised Linear Model Inference and Decision Discriminant Functions
K
SW =
Sk
k =1
where Sk =
(xn − mk )(xn − mk )T
n∈Ck
1 mk = N k
n∈Ck
xn
Fisher’s Linear Discriminant The Perceptron Algorithm
Introduction to Statistical Machine Learning
Fisher’s Discriminant For Multi-Class
c 2013 Christfried Webers NICTA The Australian National University
Between-class covariance K
S B =
N k (mk − m)(mk − m)T .
k =1
Classification
where m is the total mean of the input data 1 m = N
Generalised Linear Model
N
Inference and Decision
xn .
Discriminant Functions
n=1
Fisher’s Linear Discriminant
One possible way to define a function of W which is large when the between-class covariance is large and the within-class covariance is small is given by
T
−1
J (W) = tr (W SW W)
T
(W S B W)
The maximum of J (W) is determined by the D 1 eigenvectors of S− W S B with the largest eigenvalues.
The Perceptron Algorithm
Fisher’s Discriminant For Multi-Class
How many linear ’features’ can one find with this method? S B is of rank at most K − 1 because of the sum of K rank one matrices and the global constraint via m.
Projection onto the subspace spanned by S B can not have more than K − 1 linear features.
Introduction to Statistical Machine Learning
c 2013 Christfried Webers NICTA The Australian National University
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
The Perceptron Algorithm
Introduction to Statistical Machine Learning
c 2013 Christfried Webers NICTA The Australian National University
Frank Rosenblatt (1928 - 1969) "Principles of neurodynamics: Perceptrons and the theory of brain mechanisms" (Spartan Books, 1962) Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
The Perceptron Algorithm
Introduction to Statistical Machine Learning
c 2013 Christfried Webers NICTA The Australian National University
Perceptron ("MARK 1") was the first computer which could learn new skills by trial and error Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Introduction to Statistical Machine Learning
The Perceptron Algorithm
c 2013 Christfried Webers NICTA The Australian National University
Two class model Create feature vector φ(x) by a fixed nonlinear transformation of the input x. Generalised linear model T
y(x) = f (w φ(x))
with φ(x) containing some bias element φ 0 (x) = 1 . nonlinear activation function f (a) =
+1, a ≥ 0 −1, a < 0
Target coding for perceptron t =
+1, if C 1 −1, if C 2
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
The Perceptron Algorithm - Error Function
Introduction to Statistical Machine Learning
c 2013 Christfried Webers NICTA The Australian National University
Idea : Minimise total number of misclassified patterns. Problem : As a function of w, this is piecewise constant and therefore the gradient is zero almost everywhere. Better idea: Using the ( −1, +1) target coding scheme, we want all patterns to satisfy wT φ(xn )t n > 0 . Perceptron Criterion : Add the errors for all patterns belonging to the set of misclassified patterns M E P (w) = −
n∈M
wT φ(xn )t n
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Perceptron - Stochastic Gradient Descent
Introduction to Statistical Machine Learning
c 2013 Christfried Webers NICTA The Australian National University
Perceptron Criterion (with notation φn = φ(xn ) ) E P (w) = −
wT φn t n
Classification
n∈M
Generalised Linear Model
One iteration at step τ 1 2
Inference and Decision
Choose a training pair ( xn , t n ) Update the weight vector w by (τ +1)
w
= w(
τ
)
Discriminant Functions
− η ∇ E P (w) = w(
Fisher’s Linear Discriminant τ
)
+ η φn t n
As y(x, w) does not depend on the norm of w, one can set η = 1 w( +1) = w( ) + φn t n τ
τ
The Perceptron Algorithm
Introduction to Statistical Machine Learning
The Perceptron Algorithm - Update 1
c 2013 Christfried Webers NICTA The Australian National University
Update of the perceptron weights from a misclassified pattern (green) w(
τ
+1)
= w(
τ
)
+ φn t n Classification Generalised Linear Model
1
Inference and Decision
1
Discriminant Functions
0.5
Fisher’s Linear Discriminant
0.5
The Perceptron Algorithm 0
0
−0.5
−0.5
−1 −1
−0.5
0
0.5
1
−1 −1
−0.5
0
0.5
1
Introduction to Statistical Machine Learning
The Perceptron Algorithm - Update 2
c 2013 Christfried Webers NICTA The Australian National University
Update of the perceptron weights from a misclassified pattern (green) w(
τ
+1)
= w(
τ
)
+ φn t n Classification Generalised Linear Model
1
Inference and Decision
1
Discriminant Functions
0.5
Fisher’s Linear Discriminant
0.5
The Perceptron Algorithm 0
0
−0.5
−0.5
−1 −1
−0.5
0
0.5
1
−1 −1
−0.5
0
0.5
1