Project 1 - Two Category Classification Using Baysian Decision Rule ECE571 - Pattern Recognition Michael Jugan
[email protected]
Abstract This project focuses on two category classification. Training and test data are provided, and each data sample contains two features. The data classes are modeled as both one and two-modal Gaussian distributions. Distribution parameters are calculated using maximum likelihood estimation. Classification rules are derived using the likelihood ratio and discriminant functions. MATLAB and C++ are used two analyze the decision rules' accuracies both graphically and analytically. Using a brute-force technique, the estimated two-modal Gaussian parameters were improved to yield a testing accuracy of 91.8%.
Introduction Pattern recognition is a multifaceted process involving both feature extraction and pattern classification. This project serves as a practical introduction to pattern classification. In particular, the fundamental classification technique known as Bayesian decision theory is explored. This approach to classification uses statistical methods to classify the d ata into distinct classes [1]. The overall goal of this project is to classify supplied data samples as belonging to one of two classes. Each data sample includes decimal values for two features. Two sets of data are supplied: training data and test data. Decision rules are derived to fit the training data. The rules are then tested on the test data, and the classification accuracy of each rule is measured. The previously outlined goal is accomplished by first assuming that the data can be accurately modeled by one or two-modal Gaussian distributions. Maximum likelihood estimation is used to approximate the mean and covariance parameters for the distributions. Once the distributions are calculated, various decision rules are created. These rules are derived from either likelihood ratios or discriminant functions. This report begins with an outline of the technical details associated with the decision rules. Afterwards, plots are used to illustrate the parameter estimations and decision rules. Lastly, classification accuracies of the rules are compared, and the results are analyzed.
Technical Details Maximum Likelihood Parameter Estimation
Before decision rules can be made, the training data must be used to estimate the probability distribution functions of classes ω0 and ω 1 . As previously noted, both one and two-modal Gaussian distributions are used. Duda [1] defines the one-modal Gaussian conditional density function for class ωi .
ρ( x∣ωi )=
1
exp 1/ 2
( 2 π)d / 2∣Σi∣
[
1 − ⃗( x −μ⃗i )t Σ−i 1⃗( x −μ⃗i ) 2
]
Similarly, [2] gives the equation for a one dimensional, two-modal, Gaussian distribution. The twomodal distribution is the sum of two one-modal distributions. Each one-modal distribution has a scaling factor, A. The equation can be generalized to handle d features:
ρ( x∣ωi )= +
A1
exp 1/ 2
( 2 π)d / 2∣Σi1∣ A2
exp 1/ 2
( 2 π)d / 2∣Σi2∣
[ [−
] ( ⃗ −μ⃗ ) ]
1 2
− ( x⃗−μ⃗i1 )t Σ−i11 ( x⃗ −μ⃗i1) 1 ( x⃗−⃗μi2 )t Σ−i21 x 2
i2
Both Gaussian distributions depend upon a mean vector and a covariance matrix as parameters. The parameters must be estimated such that the density functions accurately model the two classes. The parameter estimation technique used in this project is known as “maximum likelihood estimation”. The details of this approach are as follows: A single sample, x ⃗ , contains two features, x and y: x⃗i =
[] x y
For the one-modal Gaussian case, each class is assigned a mean column-vector, and 2x2 covariance matrix: ix μ⃗i = μ μ
[ ] iy
[
2 2 σ xx σ xy Σi= 2 2 σ yx σ yy
]
Duda [1] defines the parameters for class
[ ]
ωi
as follows:
ni
ix μ⃗i = μ μ =
[ ] iy
1 x ni k =1 k
∑ ni
1 y n i k =1 k
∑
[
]
ni
2 2 σ xx σ xy 1 Σ i= 2 = (⃗ x k −μ⃗i )( x⃗k −μ⃗i )T ∑ 2 σ yx σ yy n i −1 k =1
The same equations are used when estimating parameters for the two-modal Gaussian distribution. However, the data for each class is first split into two groups. The parameters for each group a re then found separately. Decision Rule: Likelihood Ratio
Once the distribution parameters have been estimated, the decision rules can be derived. The first classification rule is determined using the likelihood ratio. This is the ratio of the class' conditional density functions:
ρ( x∣w ) ρ( x∣w ) 0 1
The likelihood ratio is used to make a least conditional risk rule. As the name implies, this rule classifies data such that the risk associated with the decision is minimized. The least cond itional risk rule is defined as:
w 0 ,
ρ( x∣w 0 ) P ( w 1)( λ 01−λ 11 ) > ρ( x∣w 1) P ( w 0)( λ 10 −λ 00 )
w1 , otherwise This decision rule depends on the prior probabilities of each class. The supplied training data contains an equal number of samples for each class. Therefore, it is assumed that both prior probabilities are equal. The rule also requires losses associated with making correct and incorrect classifications. The meaning of the data is unknown; therefore, zero-one loss is assumed. Due to these assumptions, the classification rule can be simplified.
w 0 ,
ρ( x∣w0 ) >1 ρ( x∣w1)
w1 , otherwise Decision Rules: Discriminant Functions
Classification can also be performed using discriminant functions. With this method, each class is assigned a discriminant function, g i (⃗ x ) . A sample, x ⃗ , is classified as belonging to the class whose discriminant function evaluates to the largest value.
w 0 , g 0 (⃗ x )> g 1⃗( x ) w 1 , g 1⃗( x )> g 0⃗( x ) arbitrary , otherwise The discriminant function can be set such that the maximum discriminant function will have the least conditional risk [1].
g i⃗( x )=− R (α ⃗ i∣ x ) However, if zero-one loss is assumed, the discriminant functions simplify to the a- posteriori probabilities [1].
g i⃗( x )= P ( ω⃗ x )= i∣
x∣ω i) P (ω i ) ρ(⃗ ρ( x⃗ )
Only the relative values of the functions are important for classification purposes. Therefore, the prior probability and normalization constant can be ignored. This leaves only the density function.
g i⃗( x )=ρ(⃗ x∣ωi )
The natural log of the density function can also be used due to the fact that natural log is a monotonically increasing function [1]. For the single-modal Gaussian distribution, taking the natural log of the density function assists with simplifying the equation.
g i ( x ⃗ )= ln (ρ ( x⃗∣ωi ))= −
1 ( x⃗ −μi )t Σ−i 1 ( x⃗ −μi ) − d ln ( 2 π) − 1 ln (∣Σi∣) 2 2 2
There are three cases where some terms in the distribution function can be ignored. The va lues of the class' covariance matrices select which case is used.
Σ i =σ 2 Ι
Case I:
For case I, all covariance matrices are equal. Furthermore, the value is equal to a scalar multiplied by an identity matrix. The simplified discriminant function is linear [1]. t
g i ( x )= wi x +ωi0 1 w i= 2 μ i
σ −1 ωi0 = 2 μt i μi + ln ( P ( ωi )) 2σ Σ i =Σ
Case II:
Similarly to case I, all covariance matrices are equal in case II. However, there is no constraint on their value. Once again, the simplified discriminant function is linear [1]. t
g i ( x )= wi x +ωi0 −1
w i =Σ
ωi0 =
μi
−1 2
Case III:
μt i Σ−1 μi + ln ( P ( ωi ))
Σ i =arbitrary
In case III, all covariance matrices can be arbitrarily valued. Unlike the previous two cases, the resulting discriminant functions are quadratic [1]. t
t
g i ( x )= x W i x + w i x +ωi0 − 1 −1 Σ W i = 2 i −1 w i =Σi μi −1 1 ωi0 = μt i Σ−1 μi − ln (∣Σi∣)+ ln ( P (ω i )) 2 2
Experiments and results Basic preparation
Two preliminary measures were taken to aid in the completion of this project. First, the data files were divided according to the samples' known classes. The training data in synth.tr was split into tr_0.dat and tr_1.dat . This action was also done for the test data in synth.te. Secondly, a print() method was added to the Matrix class. This method aided greatly with debugging. Parameter Estimations
The first program written, est_params, was coded in C++. Est_params uses the training data to estimate the single-modal means and covariance matrices of b oth classes. Table I shows the results obtained from this program. The values shown in the table are rounded. Table I– maximum likelihood estimated parameters for the single-modal Gaussian case Parameter
Class 0
x ⃗μ= μ μ y
[ ]
[ ]
[
2 2 σ xx σ xy Σ i= 2 2 σ yx σ yy
Class 1
[ ]
⃗μ= −.22
⃗μ= .08
.33
]
[
Σ i = .28 .01 .01 .04
.68
]
[
Σ i = .16 −.02 −.02 .03
Next, a MATLAB program, plot_est_params.m, was written to illustrate the modeled distributions. The results are shown in figure I.
Figure I – contour plots of the estimated single-modal distributions
]
A similar program named est_2modal_params was written to estimate parameters for the twomodal Gaussian distributions. This program divides the data in each class into two distinct groups. A value of x was chosen such that the number of samples in each group were approximately equal. For class
ω0
,
group 1 , x < − .25 group 2 , otherwise For class
ω1
,
group 1 , x < − .05 group 2 , otherwise Using these division points, group one contained 62 samples, and group two contained 63 samples. Each group's mean and covariance matrix were calculated using the same method as in the singlemodal case. Table II shows the rounded results obtained by running est_2modal_params.m. Table II - estimated two-modal Gaussian parameters Class 0 Parameter
Group 1
x ⃗μ= μ μ y
[ ]
[ ]
[
2 2 σ xx σ xy Σi= 2 2 σ yx σ yy
Class 1
⃗μ= −.72
]
[
.002
.002 .05
Group 1
[ ]
⃗μ= −.29
.35
]
[
Σ i = .03 −.004 −.004 .02
Group 2
[ ]
⃗μ= .27
.30
Σ i = .03
Group 2
[ ]
⃗μ= .44
.73
]
[
Σ i = .02 0
0 .02
.64
]
[
Σ i = .03
.004
.004 0.03
]
Once again, a MATLAB program, plot_est_2modal_params.m, was written to display the estimated distributions. The plots in figure II confirm that the estimated parameters provide reasonable distributions for the training data.
Figure II – contour plots of the estimated two-modal distributions
It is important to note that the covariance matrices of e ach class are different. Therefore, the case III discriminant function is easily used. Conversely, the covariance matrices required modification to allow the other cases to be used. The modifications made to the covariance matrices are shown below: Original covariance matrices for classes i and j:
[ ] [
σ 2a σ 2b Σ i= 2 2 σb σc
σ 2d σ e2 Σ j = 2 2 σ e σ f
]
New covariance matrix used for case I:
[ [
Σi=
2
2
σ a+σ d
0
2 2
0
2
σ c +σ f 2
] ]
New covariance matrix used for case II:
Σ i=
σ 2a+σ 2d σ 2b+σ e2 2
2
σ 2b+σ 2e
σ 2c +σ 2 f
2
2
There is an additional consequence of the covariance matrices being arbitrary and unique. The classification rule based on the likelihood ratios is equivalent to classification using the case III discriminant functions. Boundary Plots
Once the parameters were estimated, the density functions and decision rules could be calculated. A MATLAB program named plot_boundaries.m was created to help visualize the classification rules. This program generates ten plots showing the various boundary lines on top of both the training and test data. The one-modal plo ts are shown in figures 3 and 4. The curve for case III is also the curve resulting from using the likelihood ratio decision rule. The two-modal classification plots are shown in figure 5. It appears from the training data plots that the two-modal distribution is the most accurate, and that case I is the least accurate. The two-modal cu rve looks excellent for the training data. Howeve r, it may be fitting the training data too closely; the curve appears less accurate with the test data. The boundary lines for cases II and III look very similar and are expected to perform similarly. Although case III is quadratic, it only begins to curve at the ends of the plot where there are few data points.
Figure 3 – decision boundary lines for the one-modal Gaussian distributions
Figure 4 - decision boundary lines for the one-modal Gaussian distributions
Figure 5 – decision boundary lines for the two-modal Gaussian distributions
Accuracy Tests
The final, and most important step, of this project was testing the accuracies of the classification rules. A program called classify was written in C++ to measure each decision rule's accuracy. The classify program relies heavily on two custom classes: Classifier and DisFunc. Disfunc is short for “discriminant function”. The DisFunc class serves as the base class to four derived classes. Each classification rule is implemented as one of these derived classes. The derived classes implement a constructor and a virtual method called Classify(Matrix & m_x). Designing the program in this fashion provides a flexible and easy to use user interface.
The derived class' constructors initiate members of type double. These variables are the constants specific to each discriminant function. After the constants are initiated, Classify() is called for each sample. Classify() only needs to compute terms relying on the sample , m_x, because the constants have already been determined. Although this scheme does not benefit performance for the small data sets tested, it would make a difference when more samples are used. The estimated two-modal Gaussian parameters were further tweaked to increase accuracy. A procedure was written to incrementally modify and test parameters. This brute-force technique found new parameters which resulted in improved test data classification. The new p arameters decrease the training data classification accuracy. It is suspected that th e previously estimated parameters were overfitting the data. The new parameters are shown in table III. Table III - improved two-modal Gaussian parameters Class 0 Parameter
Group 1
Group 2
A
.3
.7
x ⃗μ= μ μ y
[
[ ]
[
2 2 σ xx σ xy Σi= 2 2 σ yx σ yy
⃗μ= −.721356
]
Σ i=
[
.30116
[
]
.033915
.002387
.002387
.03994
⃗μ= .270481
]
Σ i=
.349969
[−
.024324 .004285
]
−.004285 .016863
]
Class 1 Parameter
Group 1
Group 2
A
.62
.38
x ⃗μ= μ μ y
[
[ ]
[
2 2 σ xx σ xy Σ i= 2 2 σ yx σ yy
⃗μ= −.291322
]
Σ i=
[
.731595
[
]
.023131
.000396
.000396
.023603
⃗μ= .437401
]
[
.635115
]
Σ i= .029094 .003895 .003895 .032003
]
Running the classifier program outputs the accuracies for the training and test data. These values are shown in figure 6. The results agree with the speculations made from the boundary line plots. Case I has poor performance for both training and test data. For training data, the two-modal method is clearly better than cases II and III. However, for testing data, cases II and III are comparable to the two-modal method. As expected, cases II and III always perform similarly.
Figure 6 - classification accuracy results
Conclusion This project provided a practical introduction to pattern classification. Maximum likelihood parameter estimation was used to estimate the parameters of both one and two-modal Gaussian distributions. Classification rules were created using the likelihood ratio and discriminant functions. Due to the assumptions of zero-one loss and equal prior probabilities, the likelihood ratio rule was greatly simplified. The simplified rule was equivalent to the rule resulting from the case III discriminant function. MATLAB was used to plot the decision boundary lines for each rule. The resulting plots provided an easy way to visualize the effectiveness of each classifier. Additionally, a program was written in C++ to measure the actual accuracies of each method. As expected, the two-modal classifier resulted in the best performance of 91.8% accuracy. Given more time, it would be interesting to tweak the two-modal classifier to achieve a higher accuracy. It will be interesting to see if higher accuracies can be obtained at the end of this semester. This course may teach more powerful classification techniques that can be applied.
Reference [1] R. Duda, P. Hart, Pattern classification, 2nd ed., Wiley, 2000 [2] H. Qi, "ECE 471/571 - Lecture 4,": http://web.eecs.utk.edu/~qi/ece471-571/lecture04_gaussian.pdf.
Appendix plot_est_params.m %This file plots the one-modal Gaussian distributions with the estimated parameters. %%%%%%%%%%%%%%%%%%%%%%%%%%%%% %load class 0 training data %%%%%%%%%%%%%%%%%%%%%%%%%%%%% load ../data/tr_0.dat sample_x = tr_0(:,1); sample_y = tr_0(:,2); %plot the samples subplot(2,1,1); scatter(sample_x,sample_y) hold on; mu = [-0.221470 0.325755]; sigma = [0.276810 0.011229;0.011229 0.036119]; plot2dGauss(mu,sigma) title('Training data (Class 0): Estimated Gaussian Distribution','FontWeight','bold','FontSize',14); xlabel('X'); ylabel('Y'); hold off; %%%%%%%%%%%%%%%%%%%%%%%%%%%%% %load class 1 training data %%%%%%%%%%%%%%%%%%%%%%%%%%%%% load ../data/tr_1.dat sample_x = tr_1(:,1); sample_y = tr_1(:,2); %plot the samples subplot(2,1,2); scatter(sample_x,sample_y) hold on; mu = [0.075954 0.682969]; sigma = [0.159748 -0.015575; -0.015575 0.029958]; plot2dGauss(mu,sigma) title('Training data (Class 1): Estimated Gaussian Distribution','FontWeight','bold','FontSize',14); xlabel('X'); ylabel('Y');