Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork John L. Weatherwax∗ February 24, 2008
Problem Solutions Chapter 2 (Bayesian Decision Theory) Problem 11 (randomized rules) Let R((x) be the average risk given measurement x. Then since this risk depends Part (a): Let R on the action and correspondingly the probability of taking such action, this risk can be expressed as m
R(x) =
|
|
R(ai x)P ( P (ai x)
i=1
Then the total risk (over all possible x) is given by the average of R of R((x) or m
R =
(
Ωx i=1
|
|
R(ai x)P ( P (ai x)) p( p(x)dx
Part (b): One would pick
|
P ( P (ai x) =
|
1 i = ArgMini R(ai x) 0 i = anything else
Thus for every x every x pick the action that minimizes the local risk. This amounts to a deterministic rule i.e. there is no benefit to using randomness. ∗
[email protected]
1
Part (c): Yes, since in a deterministic decision rule setting randomizing some decisions might further improve performance. But how this would be done would be difficult to say.
Problem 13 Let
|
λ(αi ω j ) =
0 i = j = j = c + + 1 λr i = c λs otherwise
|
|
Then since the definition of risk is expected loss we have c
| |x)
R(αi x) = R(αc+1
λ(ai ω j )P ( P (ω j x)
i=1
i = 1, 2, . . . , c
(1)
= λr
(2)
Now for i for i = = 1, 2, . . . , c the c the risk can be simplified as c
|
R(αi x) = λ s
|
|
λ(ai ω j )P ( P (ω j x) = λs (1
j=1 j =1,, j =i
− P ( P (ω |x)) i
Our decision function is to choose the action with the smallest risk, this means that we pick the reject option if and only if λr < λs (1
− P ( P (ω |x))
∀
i
This is equivalent to
|
P ( P (ωi x) < 1 < 1
− λλ
r
∀
s
i = 1, 2, . . . , c
i = 1, 2, . . . , c
This can be interpreted as if all posterior probabilities are below a threshold (1 should reject. Otherwise we pick action i such that λs (1
)) < λ (1 − P ( − P ( P (ω |x)) < P (ω |x)) i
s
i
∀
−
λr ) λs
we
j = 1, 2, . . . , c, c , j = i
This is equivalent to
|
|
P ( P (ωi x) > P ( P (ω j x)
∀ j = i
This This can be inte interpr rprete eted d as selec selectin tingg the class with the largest largest poster posterior iorii probab probabili ility ty.. In summary we should reject if and only if
|
P ( P (ωi x) < 1 < 1
− λλ
r s
∀
i = 1, 2, . . . , c
(3)
Note that if λr = 0 there is no loss for rejecting (equivalent benefit/reward for correctly classifying) and we always reject. If λr > λs , the loss from rejecting is greater than the loss from missclassifing and we should never reject. This can be seen from the above since λλ > 1 so 1 λλ < 0 so equation 3 will never be satisfied. r
s
−
r
s
Problem 31 (the probability of error) Part (a): We can assume with out loss of generality that µ1 < µ2 . If this this is not the case case initially change the labels of class one and class two so that it is. We begin by recalling the probability of error P error P e ˆ = H 1 H 2 )P ( ˆ = H 2 H 1)P ( + P ((H P e = P ( P (H P (H 2) + P P (H 1)
|
ξ
=
|
p( p(x H 2 )dx P ( P (H 2) +
−∞
∞
|
|
p( p(x H 1 )dx P ( P (H 1) .
ξ
Here ξ between the two classes classes i.e. we classify classify as class H 1 when ξ is the decision boundary between x < ξ and ξ and classify x as class H 2 when x > ξ . Here P ( P (H i ) are the priors for each of the two two classes. classes. We begin by finding the Bayes Bayes decision boundary ξ under ξ under the minimum error criterion criterion.. From the discussion discussion in the book, the decision decision boundary is given by a likelih likelihood ood ratio test. That is we decide class H 2 if
| ≥ P ( P (H ) | P ( P (H )
p( p(x H 2) p( p(x H 1)
1 2
C 21 21 C 12 12
− C − C
11 11 22 22
,
and classify the point as a member of H 1 otherwise otherwise.. The decision decision boundar b oundary y ξ is ξ is the point at which this inequality is an equality . If we use a minim minimum um probabili probability ty of error error criter criterion ion the costs C ij δ ij P (H 1) = ij are given by C ij ij = 1 ij , and the decision boundary reduces when P ( 1 P ( P (H 2 ) = 2 to p( p(ξ H 1 ) = p( p (ξ H 2) .
−
|
|
If each conditional distributions is Gaussian p Gaussian p((x H i ) = (µi , σi2 ), then the above expression becomes 1 1 (ξ µ1 )2 1 1 (ξ µ2 )2 exp = exp . 2 2 σ12 σ22 2πσ1 2πσ2 If we assume (for simplicity) that σ that σ 1 = σ = σ 2 = σ = σ the above expression simplifies to (ξ ( ξ µ1 )2 = (ξ µ2 )2 , or on taking the square root of both sides of this we get ξ µ1 = (ξ µ2). If we take the plus sign in this equation we get the contradiction that µ 1 = µ = µ 2. Taking the minus sign and solving for ξ for ξ we we find our decision threshold given by
√
−
{−
|
N
} √
−
{−
−
}
−
= ξ =
± −
−
1 (µ1 + µ + µ2) . 2
Again since we have uniform priors (P ( P ((H 1 ) = P ( P (H 2 ) = 12 ), we have an error probability given by ξ
2P e = or
−∞
1 exp 2πσ
√
√
{−
1 (x µ2)2 dx + dx + 2 σ2
ξ
2 2πσP πσ P e =
−∞
{−
exp
−
}
∞
ξ
1 (x µ2 )2 dx + dx + 2 σ2
−
}
1 exp 2πσ
√ ∞
{−
{−
exp
ξ
1 (x µ1 )2 dx . 2 σ2
−
}
1 (x µ1)2 dx . 2 σ2
−
}
−µ2 so dv = √ dx , and in the second To evaluate the first integral use the substitution v = x√ 2σ 2σ x√ µ2 − dx √ use v = 2σ so dv so dv = 2σ . This then gives
√
√
2 2πσP = σ 2 πσ P e = σ
−µ2 √ 2 2σ
µ1
−∞
√ 2 e−v dv + dv + σ σ 2
∞
2
− 21√ −2 2 µ
µ
σ
e−v dv .
as |µ1 −σ µ2 | + . This can happen if µ1 and µ2 get far apart of σ shrinks to zero. In each case the classificat classification ion problem problem gets easier.
→ ∞
Problem 32 (probability of error in higher dimensions) As in Problem 31 the classification decision boundary are now the points ( x) that satisfy 2
||x − µ || = ||x − µ || 1
2
2
.
Expanding each side of this expression we find 2
t
2
2
t
2
||x|| − 2x µ + ||µ || = ||x|| − 2x µ + ||µ || . Canceling the common ||x|| from each side and grouping the terms linear in x we have that 1 ||µ || − ||µ || . x (µ − µ ) = 2 1
1
2
2
2
t
1
2
Which is the equation for a hyperplane.
2
1
2
2
Problem 34 (error bounds on non-Gaussian data) Part (a): Since the Bhattacharyya bound is a specialization of the Chernoff bound, we begin by considering the Chernoff bound. Specifically, using the bound min(a, min( a, b) aβ b1−β , for a for a > 0, b > 0, and 0 β 1, one can show that
≤ ≤ ≤ P (error) P (error) ≤ P ( P (ω ) P ( P (ω ) − e− 1
β
2
1 β
≤
k(β )
for 0
≤ β ≤ ≤ 1 .
Here the expression k(β ) is given by k(β ) =
(1 β (1
− β ) (µ − µ ) [β Σ + (1 − β )Σ )Σ ]− 2 2
1
t
1
1
2
(µ2
− µ ) + 12 ln 1
|
− | || |
)Σ2 β Σ1 + (1 β )Σ β 1−β Σ1 Σ2
|
.
When we desire to compute the Bhattacharyya bound we assign β = 1/2. For a two two class scalar problem with symmetric means and equal variances that is where µ 1 = µ, µ 2 = +µ, and σ and for k((β ) above becomes σ 1 = σ 2 = σ the expression for k
−
k (β ) =
(1 β (1
− β ) (2µ (2µ)
2 2β (1 (1 β )µ2 = . σ2
−
βσ 2 + (1
− β )σ
2
−1
1 (2µ (2µ) + ln 2
|
≤ ≤ ≤
βσ 2 + (1 β )σ2 σ2β σ2(1−β )
−
|
Since the above expression is valid for al l β in the range 0 β 1, we can obtain the tightest β in possible bound by computing the maximum of the expression k(β ) (since the dominant β expression is e−k(β ). To evaluate evaluate this maximum maximum of k( k (β ) we take the derivative with respect to β to β and and set that result equal to zero obtaining (1 β )µ2 d 2β (1 dβ σ2
−
=0
which gives 1
− 2β = = 0
or β = 1/2. Th Thus us in this this case case the Chernoff Chernoff bound equals the Bhattac Bhattacharra harra bound. When β = 1/2 we have for the following bound on our our error probability P (error) P (error)
2 − P ( P (ω1)P ( P (ω2)e 2 2 . µ
≤
σ
To further simplify things if we assume that our variance is such that σ 2 = µ 2 , and that our priors over class are taken to be uniform (P ( P ((ω1) = P ( P (ω2) = 1/2) the above becomes P (error) P (error)
≤ 12 e− ≈ 0.3033 . 1/2
Chapter 3 (Maximum Likelihood and Bayesian Estimation) Problem 34 In the problem statement we are told that to operate on a data of “size” n, requires f ( f (n) 9 − numeric calculations. Assuming that each calculation requires 10 seconds of time to compute, we can process a data structure of size n in 10−9f ( seconds. If we want want our results results f (n) seconds. finished by a time T time T then then the largest data structure that we can process must satisfy 10−9f ( f (n)
≤ T
or n
1
9
≤ f − (10 T ) T ) .
Converting the ranges of times into seconds we find that
{
} {
}
1sec, 1 hour hour,, 1day, 1day, 1 year ear = 1sec, 1sec, 3600 3600 sec sec, 86400 86400 sec, sec, 31536000 31536000 sec sec . T = 1sec, Using a very naive algorithm we can compute the largest n such that f that f ((n) by simply incrementing n repeatedly repeatedly.. This is done in the Matlab script prob 34 chap 3.m and the results are displayed in Table XXX.
Problem Problem 35 (recursiv (recursive e v.s. direct direct calculatio calculation n of means means and covaria covariances) nces) ˆ n we have to add n, d-dimensional vectors resulting in nd comPart (a): To compute µ putations. putations. We then have have to divide each component component by the scalar n, resulting in another n computations. Thus in total we have = (1 + d + d))n , nd + nd + n n = calculations. To compute the sample covariance matrix, C n , we have to first compute the differences ˆn requiring d subtraction for each vector xk . Since we must do this for each of the n xk µ vectors x vectors x k we have a total of nd calculations ˆn difference nd calculations required to compute all of the x k µ ′ vector vectors. s. We then have to compute compute the outer products ( xk µ ˆn )(x )(xk µ ˆn ) , which require 2 calculations ns to produce each each (in a naive naive implementation implementation). ). Since Since we have have n such outer d calculatio 2 products to compute computing them all requires nd operations. We next, have to add all of these n these n outer products together, resulting in another set of ( n 1)d 1)d2 operations. Finally dividing by the scalar n scalar n 1 requires another d another d 2 operations. Thus in total we have
−
−
−
−
−
−
(n nd + nd + nd nd2 + (n
calculations.
2
− 1)d 1)d
+ d2 = 2nd2 + nd ,
Part (b): We can derive recursive formulas for ˆµn and C n in the following following way. way. First for ˆn we note that µ ˆn+1 µ
1 = n + 1 =
n+1
xn+1 + xk = + 1 n k=1
1 xn+1 + n + 1
n n + 1
n
1 n n + 1 n
ˆn . µ
k=1
xk
Writing
n n+1
as
n+1 1 n+1
− = 1 − 1 the above becomes n+1 1 1 ˆn ˆn xn+1 + µ µ n + 1 n + 1 1 = µ ˆn + (xn+1 µ ˆn ) , n + 1
−
ˆn+1 = µ
−
as expected. To derive the recursive update formula for the covariance matrix C n we begin by recalling the definition of C C n of n+1 1 (xk µ ˆn+1 )(x )(xk µ ˆn+1 )′ . C n+1 = n k=1
−
−
Now introducing the recursive definition of the mean ˆµn+1 as computed above we have that C n+1
1 = n 1 = n
(xk
k =1 n+1
k =1
1 n
1 − µˆ − n +1 1 (x − µˆ ))(x − ))(x − µ ˆ − (x ˆ ))′ µ n + 1 n
n+1
(xk − µ ˆn)(x )(xk − µ ˆn )′ −
1 n(n + 1)
− Since µ ˆ n =
n+1
k
(xk
k=1
n+1
n
)(x − µˆ )(x − µˆ )′ n
1 n(n + 1) 2
k=1
n
n+1
1 n(n + 1)
n+1
n k=1 xk we
n
(xn+1 − µ ˆn )(x )(xk − µ ˆn )′ −
n+1
n
n+1
(xn+1
k =1
− µˆ )(x − µˆ )′ . )(x n
n+1
n
see that
ˆn nµ
−
1 n
n
xk = 0 or
k =1
1 n
n
(µ ˆn
k=1
−x ) = 0, k
Using this fact the second and third terms the above can be greatly simplified. For example we have that n+1
n
− µˆ ) =
−1
(xk
k=1
n
k =1
(xk
(x − µˆ ) + (x − µˆ ) = x − µˆ n
n+1
n
n+1
n
Using this identity and the fact that the fourth term has a summand that does not depend on k on k we have C n+1 given by C n+1 =
n
n
1
n
−1
n
(xk
k=1
1 − µˆ )(x − − )(x − µ ˆ )′ + (x ˆ )(x )(x ˆ )′ µ µ n−1 n
k
n
n+1
n
n+1
2 1 (xn+1 µ ˆn )(x )(xn+1 µ ˆn ) + (xn+1 µ ˆn )(x )(xn+1 n(n + 1) n(n + 1) 1 1 1 = 1 (xn+1 µˆn )(x )(xn+1 µ ˆn )′ C n + n n n(n + 1) 1 1 n + 1 1 = C n (xn+1 µˆn )(x )(xn+1 µ ˆn )′ C n + n n n + 1 1 n 1 = (xn+1 µ ˆn )(x )(xn+1 µ ˆ n )′ . C n + n n + 1
−
−
−
− − − − − −
−
−
−
−
−
−
n
− µˆ )′ n
which is the result in the book. ˆ n using the recurrence formulation we have d subtractions to comPart (c): To compute µ pute xn ˆ operations. Finally Finally, addition addition by µn−1 and the scalar division requires another d operations. the µ ˆ n−1 requires another d operations giving in total 3d 3 d operations. To be compared with the (d (d + 1)n 1)n operations in the non-recursive case.
−
To compute the number of operations required to compute C n recursively we have d computations to compute the difference x n µ ˆn−1 and then d then d 2 operations to compute the outer 2 product (x (xn µ ˆn−1)(x )(xn µˆn−1 )′ , another d another d2 operations to multiply C multiply C n−1 by nn− and finally 1 − d2 operations to add this product to the other one. Thus in total we have
−
−
−
+ d2 + d2 + d2 + d2 = 4d2 + d . d + d To be compare with 2nd 2 nd2 + nd in the non recur recursiv sivee case. case. For large n the non-recursive calculations are much more computationally intensive.
Chapter 6 (Multilayer Neural Networks) Problem 39 (derivatives of matrix inner products) Part (a): We present a slightly different method to prove this here. We desire the gradient (or derivative) of the function φ xT K x .
≡
The i-th component of this derivative is given by ∂φ ∂ = + xT Ke i xT K x = e T i K x + x ∂xi ∂x i
where e where e i is the i the i-th -th elementary basis function for zeros everywhere else. Now since
n R ,
i.e. it has a 1 in the i-th i -th position and
T T T T T (eT i K x) = x K ei = e i K x ,
the above becomes
∂φ T = eT + eT = e T + K T )x . i K x + e i K x = e i (K + K ∂xi Since multiplying by eT i on the left selects the i-th row from the expression to its right we see that the full gradient expression is given by
T
∇φ = (K + K + K )x , as requested requested in the text. text. Note that this expressi expression on can also be b e prove proved d easily easily by writing writing each each term in component notation as suggested in the text.
Part (b): If K is the above expression simplifies to K is symmetric then since K T = K K the
∇φ = 2Kx , as claimed in the book.
Chapter 9 (Algorithm-Independent Machine Learning) Problem 9 (sums of binomial coefficients) Part (a): We first recall the binomial theorem which is n
n
(x + y + y)) =
k=0
n k
xk y n−k .
If we let x = 1 and and y = 1 then x + y + y = = 2 and the sum above becomes n
2n =
k=0
n k
,
−
whic which is the state stated d ident identit ity y. As an aside, aside, note also that if we let x = 1 and y = 1 then + y = = 0 and we obtain another interesting identity involving the binomial coefficients x + y n
0=
− n k
k=0
( 1)k .
Problem 23 (the average of the leave one out means) We define the leave one out mean µ (i) as 1
µ(i) =
−1
n
x j .
j =i
This can obviously be computed for every i. i . Now we want to consider the mean of the leave one out means. To do so first notice that µ (i) can be written as µ(i) = =
1 n
−1 1
n
−1
− − x j
j =i
n
x j
xi
j=1 j =1
n
1
1 = n x j n 1 n j=1 j =1 n xi = ˆ µ . n 1 n 1
−
xi
− − −
With this we see that the mean of all of the leave one out means is given by 1 n
n
i=1
µ(i)
1 = n = =
n
i=1
n
n
− 1 µˆ
n
1
− − xi
1
n
n
1 n
− 1 µˆ − n − 1 ˆ n µ ˆ− =µ ˆ, µ n−1 n−1 n
xi
i=1
as expected.
Problem 40 (maximum likelihood estimation with a binomial random variable) Since k has a binomial distribution with parameters (n (n′ , p) we have that P ( P (k) =
n′ k
pk (1
n
′ − p) p) −
k
.
Then the p that maximizes this expression is given by taking the derivative of the above (with respect to p) p ) setting the resulting expression equal to zero and solving for p. p . We find that this derivative is given by d P ( P (k) = dp
n′ k
′ kp − (1 − p) p)n −k + k 1
n′ k
pk (1
′ − p) p) − − (n′ − k)(−1) . n
k 1
Which when set equal to zero and solve for p we find that p that p = = nk′ , or the empirical counting estimate of the probability of success as we were asked to show.