07 Linear Classification 1

Introduction to Statistical Machine Learning

Introduction Intr oduction to Statistical Machine Learning

c 2013  Christfried Webers NICTA NICT A The Australian National University

Christfried Webers Outlines

Statistical Machine Learning Group NICTA and College of Engineering and Computer Science The Australian National University

Canberra February – June 2013

Overview Introduction Introduction Linear Algebra Probability Linear Regression Regression 1 Linear Regression Regression 2 Linear Classification 1 Linear Classification 2 Neural Networks 1 Neural Networks 2 Kernel Methods Sparse Kernel Methods Graphical Models 1 Graphical Models 2 Graphical Models 3 Mixture Models and EM 1 Mixture Models and EM 2 Approximatee Inference Approximat Sampling Principal Component Analysis Sequential Data 1

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Sequential Data 2 Combining Models Selected Topi Topics cs Discussion and Summary



Part VII Classification

Linear Classification 1

Generalised Linear Model Inference and Decision Discriminant Functions Fisher’ss Linear Fisher’ Discriminant The Perceptron Algorithm

Classification Goal : Given input data x, assign it to one of K discrete discrete classes C k k where k = 1 , . . . , K . Divide the input space into different regions.



Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’ss Linear Fisher’ Discriminant The Perceptron Algorithm

Figure : Length of the petal [in cm] for a given sepal [cm] for iris

flowers (Iris Setosa, Iris Versicolor, Iris Virginica).

How to represent binary class labels?

Class labels are no longer real values as in regression, but a discrete set. Two classes : t ∈ {0, 1} ( t = 1 represents class C 1 and t = 0 represents class C 2 )


c 2013  Christfried Webers NICTA The Australian National University

Classification Generalised Linear Model Inference and Decision

Can interpret the value of t as the probability of class C 1 , with only two values possible for the probability, 0 or 1 .

Discriminant Functions

Note: Other conventions to map classes into integers possible, check the setup.

The Perceptron Algorithm

Fisher’s Linear Discriminant

How to represent multi-class labels?



If there are more than two classes ( K > 2), we call it a multi-class setup. Often used: 1-of-K coding scheme in which t is a vector of length K which has all values 0 except for t j = 1 , where j comes from the membership in class C j to encode. Example: Given 5 classes, {C 1 , . . . , C 5 }. Membership in class C 2 will be encoded as the target vector t = (0, 1, 0, 0, 0)

T

Note: Other conventions to map multi-classes into integers possible, check the setup.

Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm


Linear Model Idea: Use again a Linear Model as in regression: y(x, w) is a linear function of the parameters w


y(xn , w) = wT φ(xn )

But generally y(xn , w) ∈ R. Example: Which class is y(x, w) = 0.71623 ?



Generalised Linear Model Apply a mapping f : R → discrete class labels.

Z to

the linear model to get the


Generalised Linear Model y(xn , w) = f (wT φ(xn ))

Generalised Linear Model

Activation function: f (·) −1

Link function : f

Classification

Inference and Decision

(·)

Discriminant Functions Fisher’s Linear Discriminant

signz 1.0


0.5

z

0.5

0.0

0.5

1.0

0.5

1.0

Figure : Example of an activation function f ( z) = sign ( z) .

Three Models for Decision Problems



In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. Discriminative Models 1

2

Solve the inference problem of determining the posterior class probabilities p (C k | x). Use decision theory to assign each new x to one of the classes.

Generative Models 1

2 3 4 5

Solve the inference problem of determining the class-conditional probabilities p (x | C k ). Also, infer the prior class probabilities p(C k ). Use Bayes’ theorem to find the posterior p (C k | x). Alternatively, model the joint distribution p(x, C k ) directly. Use decision theory to assign each new x to one of the classes.



Two Classes


Definition

A discriminant is a function that maps from an input vector x to one of K classes, denoted by C k . Consider first two classes ( K = 2 ). Construct a linear function of the inputs x T

y(x) = w x + w0

such that x being assigned to class C 1 if y(x) ≥ 0 , and to class C 2 otherwise. weight vector w bias w0 ( sometimes − w0 called threshold )



Two Classes


Decision boundary y(x) = 0 is a ( D − 1)-dimensional hyperplane in a D-dimensional input space (decision surface). w is orthogonal to any vector lying in the decision surface.

Proof: Assume x A and x B are two points lying in the decision surface. Then, 0 = y(x A ) − y(x B ) = wT (x A − x B )



Two Classes


The normal distance from the origin to the decision surface is wT x

w

= −

w0

w

Classification Generalised Linear Model

x2

y > 0 y =0


R1

y <0 R2


x


w y (x)

w x⊥

x1 −w0 w


Two Classes


The value of y(x) gives a signed measure of the perpendicular distance r of the point x from the decision surface, r = y(x)/w. Classification

x

   T

y(x) = w

x⊥ + r

w

w

0

  

T

w w +w0 = r + wT x⊥ + w0 = r w w

Inference and Decision Discriminant Functions Fisher’s Linear Discriminant

x2

y > 0


y = 0 R1

y < 0


R2

x w y(x)

w x⊥

x1 −w0 w


Two Classes


More compact notation : Add an extra dimension to the input space and set the value to x0 = 1. Also define w = (w0 , w) and x = (1, x)





y(x) = wT x



Decision surface is now a D-dimensional hyperplane in a D + 1-dimensional expanded input space.



Multi-Class


Number of classes K > 2 Can we combine a number of two-class discriminant functions using K − 1 one-versus-the-rest classifiers? Classification Generalised Linear Model Inference and Decision ?


R1


R2

C 1 R3 C 2 not C 1 not C 2


Multi-Class


Number of classes K > 2 Can we combine a number of two-class discriminant functions using K (K − 1)/2 one-versus-one classifiers? Classification Generalised Linear Model

C 3 C 1

Inference and Decision Discriminant Functions R1


R3 C 1

?

The Perceptron Algorithm C 3

R2 C 2 C 2


Multi-Class


Number of classes K > 2 Solution: Use K linear functions yk (x) = wT k x + w k 0

= k . Assign input x to class C k if yk (x) > y j (x) for all j  Decision boundary between class C k and C j given by yk (x) = y j (x)

Ri

Rk xB x



Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm

Rj

xA

Classification

Least Squares for Classification

Regression with a linear function of the model parameters and minimisation of sum-of-squares error function resulted in a closed-from solution for the parameter values. Is this also possible for classification?



Classification Generalised Linear Model Inference and Decision

Given input data x belonging to one of K classes C k .


Use 1 -of-K binary coding scheme.


Each class is described by its own linear model yk (x) = wT k x + wk 0

k = 1, . . . , K



Least Squares for Classification


With the conventions

        

wk 0 wk = wk

1 x = x

W = w1

. . . wK

∈ R D+1 Classification

D+1

∈R


∈ R( D+1)×K


we get for the discriminant function (vector valued) y(x) = WT x


∈ RK .

For a new input x, the class is then defined by the index of the largest value in the row vector y(x)





Determine W


Given a training set { xn , t} where n = 1, . . . , N , and t is the class in the 1 -of-K coding scheme. Define a matrix T where row n corresponds to tT n. The sum-of-squares error can now be written as

            1 E D (W) = tr (XW − T)T (XW − T) 2

The minimum of E D (W) will be reached for W = (XT X)−1 XT T = X† T

where X† is the pseudo-inverse of X.


Discriminant Function for Multi-Class



The discriminant function y(x) is therefore

  

y(x) = WT x = TT (X† )T x,



where X is given by the training data, and x is the new input.



Interesting property: If for every tn the same linear constraint aT tn + b = 0 holds, then the prediction y(x) will also obey the same constraint T

a y(x) + b = 0.

For the 1 -of-K coding scheme, the sum of all components in tn is one, and therefore all components of y(x) will sum to one. BUT: the components are not probabilities, as they are not constraint to the interval ( 0, 1).



Deficiencies of the Least Squares Approach


Magenta curve : Decision Boundary for the least squares approach ( Green curve : Decision boundary for the logistic regression model described later) 4

4

Classification

2

2


0

0

Inference and Decision Discriminant Functions −2

−2


−4

−4


−6

−6

−8

−8

−4

−2

0

2

4

6

8

−4

−2

0

2

4

6

8


Deficiencies of the Least Squares Approach


Magenta curve : Decision Boundary for the least squares approach ( Green curve : Decision boundary for the logistic regression model described later) 6

4

6

Classification

4

Generalised Linear Model Inference and Decision

2

2

0

0

−2

−2

−4

−4

−6 −6

−4

−2

0

2

4

6

−6 −6

Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm

−4

−2

0

2

4

6




View linear classification as dimensionality reduction. y(x) = wT x Classification

If y ≥ −w0 then class C 1 , otherwise C 2 .


But there are many projections from a D-dimensional input space onto one dimension.


Projection always means loss of information. For classification we want to preserve the class separation in one dimension. Can we find a projection which maximally preserves the class separation ?





Samples from two classes in a two-dimensional input space and their histogram when projected to two different one-dimensional spaces. Classification Generalised Linear Model 4

4

Inference and Decision Discriminant Functions

2

2

0

0

−2

−2

−2

2

6

Fisher’s Linear Discriminant The Perceptron Algorithm

−2

2

6

Fisher’s Linear Discriminant - First Try Given N 1 input data of class C 1 , and N 2 input data of class C 2 , calculate the centres of the two classes 1 m1 = N 1



1 m2 = N 2

xn ,

n∈C1




xn

n∈C2

Choose w so as to maximise the projection of the class means onto w m1 − m2 = wT (m1 − m2 )


Classification Generalised Linear Model Inference and Decision Discriminant Functions

Problem with non-uniform covariance 4


2

0

−2

−2

2

6


Fisher’s Linear Discriminant Measure also the within-class variance for each class s2k

=




( yn − mk )2

n∈Ck

where yn = wT xn . Maximise the Fisher criterion

Classification

(m2 − m1 )2 J (w) = s21 + s22

Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant

4


2

0

−2

−2

2

6




The Fisher criterion can be rewritten as J (w) =

wT S B w Classification

wT SW w


S B is the between-class covariance


T

S B = (m2 − m1 )(m2 − m1 ) SW is the within-class covariance SW =



n∈C1

T

(xn − m1 )(xn − m1 ) +




n∈C2

(xn − m2 )(xn − m2 )T




The Fisher criterion J (w) =

wT S B w wT SW w

has a maximum for Fisher’s linear discriminant


1 w ∝ S− W (m2 − m1 )

Fisher’s linear discriminant is NOT a discriminant, but can be used to construct one by choosing a threshold y0 in the projection space.


Fisher’s Discriminant For Multi-Class Assume that the dimensionality of the input space D is greater than the number of classes K . Use D > 1 linear ’features’ yk = wT x and write everything in vector form (no bias involved!) y = WT x.

The within-class covariance is then the sum of the covariances for all K classes




K

SW =



Sk

k =1

where Sk =



(xn − mk )(xn − mk )T

n∈Ck

1 mk = N k



n∈Ck

xn



Fisher’s Discriminant For Multi-Class


Between-class covariance K

S B =



N k (mk − m)(mk − m)T .

k =1

Classification

where m is the total mean of the input data 1 m = N


N




xn .


n=1


One possible way to define a function of W which is large when the between-class covariance is large and the within-class covariance is small is given by



T

−1

J (W) = tr (W SW W)

T



(W S B W)

The maximum of J (W) is determined by the D 1 eigenvectors of S− W S B with the largest eigenvalues.


Fisher’s Discriminant For Multi-Class

How many linear ’features’ can one find with this method? S B is of rank at most K − 1 because of the sum of K rank one matrices and the global constraint via m.

Projection onto the subspace spanned by S B can not have more than K − 1 linear features.







Frank Rosenblatt (1928 - 1969) "Principles of neurodynamics: Perceptrons and the theory of brain mechanisms" (Spartan Books, 1962) Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm




Perceptron ("MARK 1") was the first computer which could learn new skills by trial and error Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm




Two class model Create feature vector φ(x) by a fixed nonlinear transformation of the input x. Generalised linear model T

y(x) = f (w φ(x))

with φ(x) containing some bias element φ 0 (x) = 1 . nonlinear activation function f (a) =



+1, a ≥ 0 −1, a < 0

Target coding for perceptron t =



+1, if C 1 −1, if C 2


The Perceptron Algorithm - Error Function



Idea : Minimise total number of misclassified patterns. Problem : As a function of w, this is piecewise constant and therefore the gradient is zero almost everywhere. Better idea: Using the ( −1, +1) target coding scheme, we want all patterns to satisfy wT φ(xn )t n > 0 . Perceptron Criterion : Add the errors for all patterns belonging to the set of misclassified patterns M E P (w) = −



n∈M

wT φ(xn )t n


Perceptron - Stochastic Gradient Descent



Perceptron Criterion (with notation φn = φ(xn ) ) E P (w) = −



wT φn t n

Classification

n∈M


One iteration at step τ 1 2


Choose a training pair ( xn , t n ) Update the weight vector w by (τ +1)

w

= w(

τ

)


− η ∇ E P (w) = w(

Fisher’s Linear Discriminant τ

)

+ η φn t n

As y(x, w) does not depend on the norm of w, one can set η = 1 w( +1) = w( ) + φn t n τ

τ



The Perceptron Algorithm - Update 1


Update of the perceptron weights from a misclassified pattern (green) w(

τ

+1)

= w(

τ

)

+ φn t n Classification Generalised Linear Model

1


1


0.5


0.5

The Perceptron Algorithm 0

0

−0.5

−0.5

−1 −1

−0.5

0

0.5

1

−1 −1

−0.5

0

0.5

1


The Perceptron Algorithm - Update 2


Update of the perceptron weights from a misclassified pattern (green) w(

τ

+1)

= w(

τ

)

+ φn t n Classification Generalised Linear Model

1


1


0.5


0.5

The Perceptron Algorithm 0

0

−0.5

−0.5

−1 −1

−0.5

0

0.5

1

−1 −1

−0.5

0

0.5

1

07 Linear Classification 1

Recommend Documents