Name : Date :
Review Questions Please select the correct answer and explain your choices. An answer not justified is not considered. Consider the problem of predicting how well a student does in her second year of college/university, given how well they did in their first year. Specifically, let x be equal to the number of "A" grades (including A-. A and A+ grades) that a student receives in their first year of college (freshmen year). We would like to predict the value of y, which we define as the number of "A" grades they get in their second year. Questions 1 through 4 will use the following training set of a small sample of different students' performances. Here each row is one training example. Recall that in linear regression, our hypothesis is hθ ( x)= x)= θ 0 + θ 1x, and we use m to denote the number of training examples.
x
y
5
4
3
4
0
1
4
3
Question 1:
For the training set given above, what is the value of m? Please write below your answer
Question 2:
For this question, continue to assume that we are using the training set given above. Recall our definition of the cost function was J was J (θ 0, θ 1)=
∑ is J (0,1)? (0,1)? ℎ − . What is J
Question 3 :
Let f be some function so that f (θ 0, θ 1) outputs a number. For this problem, f is some arbitrary/unknown smooth function (not necessarily the cost function of linear regression, so f may have local optima). Suppose we use gradient descent to try to minimize f (θ 0, θ 1) as a function of θ 0 and θ 1. Which of the following statements are true? (Check all that apply and explain .) If θ 0 and θ 1 are initialized at the global minimum, the one iteration will not change their values. Even if the learning rate α is very large, every iteration of gradient descent will decrease the value of f (θ 0, θ 1). Setting the learning rate α to be very small is not harmful, and can only speed up the convergence of gradient descent. If the learning rate is too small, then gradient descent may take a very long time to converge.
Question 4
Suppose that for some linear regression problem (say, predicting housing prices as in the lecture), we have some training set, and for our training set we managed to find some θ 0, θ 1 such that J (θ 0, θ 1)=0. Which of the statements below must then be true? (Check all that apply and explain.) This is not possible: By the definition of J (θ 0, θ 1), it is not possible for there to exist θ 0 and θ 1 so that J (θ 0, θ 1)=0 For these values of θ 0 and θ 1 that satisfy J (θ 0, θ 1)=0, we have that hθ ( x(i))= y(i) for every training example ( x(i), y(i)) For this to be true, we must have y(i)=0 for every value of i=1,2,…,m. Gradient descent is likely to get stuck at a local minimum and fail to find the global minimum.
Question 5:
Suppose m=4 students have taken some class, and the class had a midterm exam and a final exam. You have collected a dataset of their scores on the two exams, which is as follows: midterm exam
(midterm exam) 2
final exam
89
7921
96
72
5184
74
94
8836
87
69
4761
78
You'd like to use polynomial regression to predict a student's final exam score from their midterm exam score. Concretely, suppose you want to fit a model of the form hθ ( x)= θ 0+ θ 1 x 1+ θ 2 x2, where x1 is the midterm score and x2 is (midterm score)2. Further, you plan to use both feature scaling (dividing by the "max-min", or range, of a feature) and mean normalization. What is the normalized feature x1 (1)? (Hint: midterm = 89, final = 96 is training example 1.)
Question 6:
You run gradient descent for 15 iterations with α=0.3 and compute J (θ ) after each iteration. You find that the value of J (θ ) increases over time. Based on this, which of the following conclusions seems most plausible? α=0.3 is an effective choice of learning rate. Rather than use the current value of α, it'd be more promising to try a smaller value of α (say α=0.1). Rather than use the current value of α, it'd be more promising to try a larger value of α (say α=1.0).
Question 7:
Suppose you have m=14 training examples with n=3 features (excluding the additional allones feature for the intercept term, which you should add). The normal equation is θ =( X T X )−1 X T y. For the given values of m and n, what are the dimensions of θ , X , and y in this equation? X is 14×3, y is 14×1, θ is 3×3 X is 14×3, y is 14×1, θ is 3×1 X is 14×4, y is 14×1, θ is 4×1 X is 14×4, y is 14×4, θ is 4×4
Question 8:
Suppose you have a dataset with m=1000000 examples and n=200000 features for each example. You want to use multivariate linear regression to fit the parameters θ to our data. Should you prefer gradient descent or the normal equation? The normal equation, since gradient descent might be unable to find the opti mal θ . Gradient descent, since it will always converge to the optimal θ . The normal equation, since it provides an efficie nt way to directly find the solution. Gradient descent, since ( X T X )−1 will be very slow to compute in the normal equation.
Question 9 :
Which of the following are reasons for using feature scaling? It speeds up gradient descent by making each iteration of gradient descent less expensive to compute. It prevents the matrix X T X (used in the normal equation) from being non-invertable (singular/degenerate). It is necessary to prevent the normal equation from getting stuck in local optima. It speeds up gradient descent by making it require fewer iterations to get to a good solution.
Question 10:
Suppose that you have trained a logistic regression class ifier, and it outputs on a new example x a prediction hθ ( x) = 0.7. This means (check all that appl y): Our estimate for P ( y=0| x;θ ) is 0.7. Our estimate for P ( y=1| x;θ ) is 0.3. Our estimate for P ( y=0| x;θ ) is 0.3. Our estimate for P ( y=1| x;θ ) is 0.7.
Question 11
Suppose you train a logistic classifier hθ ( x)= g (θ 0+ θ 1 x1+ θ 2 x2). Suppose θ 0=6,θ 1=0,θ 2=−1. Which of the following figures represents the deci sion boundary found by your classifier?
Question 12
Suppose you have the following training set, and fit a logistic regressi on classifier hθ ( x)= g (θ 0+ θ 1 x1+ θ 2 x2). x1
x2
y
1
0.5
0
1
1.5
0
2
1
1
3
1
0
Which of the following are true? Check all that apply. J (θ ) will be a convex function, so gradient descent should converge to the global minimum. Adding polynomial features (e.g., instead using hθ ( x)= g (θ 0+ θ 1 x1+ θ 2 x2+θ 3 x12 +θ 4 x1 x2+ θ 5 x22) ) could increase how well we can fit the training data. If we train gradient descent for enough iterations, for some examples x(i) in the training set it is possible to obtain hθ ( x(i))>1. Because the positive and negative examples cannot be se parated using a straight line, linear regression will perform as well as logistic regression on this data.
Question 13
For logistic regression, the gradient is given by ∂∂θjJ (θ )=
. Which of
these is a correct gradient descent update for logistic regression with a learning rate of α? Check all that apply. θ j:= θ j −α1/m∑ (hθ ( x(i))− y(i) ) x j(i) (simultaneously update for all j). θ :=θ −α1/m∑ (1/(1+e−θT x(i))− y(i) ) x(i) θ j:= θ j −α1/m∑ (θ T x− y(i) ) x j(i) (simultaneously update for all j). θ :=θ −α1/m∑ (θ T x− y(i) ) x(i)
Question 14
Which of the following statements are true? Check all that apply. The sigmoid function g ( z )=1/(1+e− z ) is never greater than one (>1). Since we train one classifier when there are two classes, we train two classifiers when there are three classes (and we do one-vs-all classification). The cost function J (θ ) for logistic regression trained with m≥1 examples is always greater than or equal to zero. None of the above