Probability Probability and Sta Statistics Cheat Cheat Sheet
c Matthias Vall Copyright Vallentin entin 2010
This cheatsheet cheatsheet integrate integratess a variet variety y of topics topics in probabili probability ty theory theory and statistics. statistics. It is based on literature literature [1, 6, 3] and in-class in-class material material from courses courses of the statistics statistics departmen departmentt at the Universit University y of California California in Berkeley Berkeley but also influenced influenced by other sources [4, 5]. If you find find errors errors or have sugge suggesstions for further topics, I would appreciate if you send me an email.. The most recent version of this document is available email at http://cs.berkeley.edu/~ mavam/probstat.pdf. To reproduce, please contact me.
12 Parametric Inference 12.1 Method of Moments Moments . . . . . . . . . . 12.2 Maxim Maximum um Likelihood Likelihood . . . . . . . . . . 12.2.1 Delta Method . . . . . . . . . . 12.3 Multi Multiparamet parameter er Models . . . . . . . . 12.3.1 Multi Multiparamet parameter er Delta Method 12.4 Parame Parametric tric Bootstrap . . . . . . . . . 13 Hypothesis Testing
2 Proba Probability bility Theory
14 Bayesian Inference 14.1 Credible Intervals Intervals . . . . 14.2 Fu Function nction of Parameters 3 3 14.3 Priors . . . . . . . . . . 4 14.3.1 Conjug Conjugate ate Priors Priors 14.4 Bayesian Testing Testing . . . . 6
3 Random Variabl ariables es 3.1 Transform ransformations ations . . . . . . . . . . . . .
6 7
Contents 1 Distribution Distribution Overv Overview iew 1.1 Discret Discretee Distrib Distributions utions . . . . . . . . . . 1.2 Cont Continuo inuous us Distri Distributio butions ns . . . . . . . .
4 Expec Expectat tation ion
7
5 Vari arianc ance e
7
6 Ine Inequal qualiti ities es
8
7 Distrib Distribution ution Relati Relationships onships
8
8 Prob Probab abil ilit ity y Functions
9
and an d
Mome Mo ment nt Ge Gene nera rati ting ng
9 Multivariate Multivariate Distrib Distributions utions 9.1 Stand Standard ard Biv Bivariate ariate Normal . . . . . . . 9.2 Biv Bivariate ariate Normal . . . . . . . . . . . . . 9.3 Multi Multivar variate iate Normal . . . . . . . . . . .
. . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
15 Exponential Family
11 20 Stocha Stochastic stic Process Processes es 20.1 Marko Markov v Chains . . . . . . . . . . 12 20.2 Poisson Poisson Processes . . . . . . . . . 12 12 21 Time Series 13 21.1 Stati Stationary onary Time Series . . . . . . 13 21.2 Estima Estimation tion of Correlation Correlation . . . . 13 21.3 Non-S Non-Stationa tationary ry Time Series . . . 21.3.1 Detrend Detrending ing . . . . . . . . 13 21.4 ARIM ARIMA A models . . . . . . . . . . 21.4.1 Causality and Invertibilit Invertibility y 14 14 22 Math 14 22.1 Orthogonal Functions Functions . . . . . . 22.2 Series . . . . . . . . . . . . . . . 15 22.3 Comb Combinator inatorics ics . . . . . . . . . . 15 15 16
16 Sampling Methods 16.1 The Bootstrap Bootstrap . . . . . . . . . . . . . 16.1.1 Bootstrap Confidence Confidence Intervals Intervals 16.2 Rejecti Rejection on Sampling Sampling . . . . . . . . . . . 16.3 Importan Importance ce Sampling . . . . . . . . . .
. . . .
16 16 16 17 17
17 Decision Theory 17.1 Risk . . . . . . 17.2 Admissibility . 17.3 Baye Bayess Rule . . 17.4 Minima Minimax x Rules
. . . .
. . . .
17 17 17 18 18
. . . .
. . . .
18 18 19 19 19
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
9 9 18 Linear Regression 9 18.1 Simpl Simplee Linear Regression . . . . . . 9 18.2 Prediction Prediction . . . . . . . . . . . . . . . 18.3 Multi Multiple ple Regression Regression . . . . . . . . . 10 Convergence 10 18.4 Model Selection Selection . . . . . . . . . . . . 10.1 Law of Large Numbers Numbers (LLN) . . . . . . 10 10.2 Cent Central ral Limit Theorem (CLT (CLT)) . . . . . 10 19 Non-parametric Function Estimation 19.1 Densit Density y Estimation . . . . . . . . . . 11 Statist Statistical ical Infere Inference nce 11 19.1.1 Histograms Histog rams 11.1 Poin Pointt Estimation 11
20 . . 20 20
22 . . . . 22 . . . . 22 . . . . . .
. . . . . .
. . . . . .
. . . . . .
23 23 24 24 24 24 25
26 . . . . 26 . . . . 26 . . . . 27
1
Distribution Overview
1.1
Discrete Distributions F X (x)
0
{
}
Uniform a , . . . , b Bernoulli( p)
x< a
x−a+1 a ≤ x ≤ b b−a
1
Binomial(n, p)
f X (x)
(1
−
I 1− p (n
x> b 1−x p)
− x, x + 1)
Multinomial(n, p) Hypergeometric(N,m,n)
≈Φ
NegativeBinomial(r, p) Geometric( p)
− x np np(1 p)
−
I p (r, x + 1)
1
− (1 − p)x
x
x
λi i!
e−λ
Poisson(λ)
i=0
∈ N+
I (a < x < b) b a+1
n x p (1 x
n! px1 x1 ! . . . xk ! 1
− p)n−x
5 2 . 0
···
1 q
n
q
q
q
q
q
F M P
(b
M X (s)
− a + 1)2 − 1
− e−(b+1)s s(b − a) 1 − p + pes
eas
12
− p)
p
p(1
np
np(1
− − m x
m x n x N x
− −
x+r 1 r p (1 r 1
− p(1 − p)x−1
x
λx e−λ x!
− p + pes)n
− p)
(1
− pi)
k
xi = n
npi
npi (1
i=1
nm N x
p)
∈ N+
r
1
− p
p
−
pi e
−
nm(N n)(N m) N 2 (N 1)
− 1 − p r −
1 p
1 p p2
λ
λ
N/A
p2
n = 40, p = 0.3 n = 30, p = 0.6 n = 25, p = 0.9
n
is
i=0
p (1 p)es
1
− −
1
− (1 − p)es
p
r
s eλ(e −1)
Geometric 8 . 0
Poisson
p p p
=
=
=
0.2 0.5 0.8
λ= 1 λ= 4 λ = 10 3 . 0
0 2 . 0
F M P
V [X ]
k
pxk k
Binomial
Uniform (discrete)
a+b 2
− 1−x px (1 − p)
[X ]
E
6 . 0
5 1 . 0
F M P
4 . 0
F M P
2 . 0
0 1 . 0 2 . 0
5 0 . 0
0 0 . 0
q q q q q q q q q q q q q q q q q
0 . 0
1 . 0
0 . 0
q
q
q
q
q
q
q
q
q
q
q
q
q
q
1.2
Continuous Distributions F X (x)
√− 0
Uniform(a, b)
x< a a b
x a b a
1
− −
x
2
Normal(µ, σ )
Φ(x) =
φ(t) dt
−∞
Log-Normal(µ, σ2 )
1 1 ln x µ + erf 2 2 2σ2
Multivariate Normal(µ, Σ)
1 k x γ , Γ(k/2) 2 2
Chi-square(k)
1
Gamma(α, β)
γ (α, x/β) Γ(α)
√
Pareto(xm , α)
1
− e−(x/λ) x
≥ xm
(b
eµ+σ
− a)2 σ
2
M X (s) esb esa s(b a)
12
µ /2
(eσ
2
− −
σ 2 s2 exp µs + 2
2
− 1)e2µ+σ
2
µ
Σ
k
2k
(1
1 −x/β e β
β
β2
1 xα−1 e−x/β Γ (α) β α
αβ
αβ2
1 2k/2 Γ(k/2)
(x µ)T Σ−1 (x µ)
−
−
xk/2 e−x/2
β α
k k i=1 Γ (αi ) i 1 xα i k α i=1 i i=1
Γ (α + β) α−1 x (1 Γ (α) Γ (β) k
V [X ]
1 exp µ s + sT Σs 2
1 2
I x (α, β)
α
2σ2
2
(2π)−k/2 |Σ|−1/2 e−
Γ
xm x
− µ)2
(x
β α −α−1 −β/x x e Γ (α)
Dirichlet(α)
−
−
√ 1 exp − (ln x2σ−2 µ) x 2πσ 2
Γ (α)
1
a+b 2
1 φ(x) = exp σ 2π
Weibull(λ, k)
I (a < x < b) b a
−
Γ α, βx
Beta(α, β)
E [X ]
− e−x/β
Exponential(β)
InverseGamma(α, β)
f X (x)
k x λ λ
−
− x) −
k 1
xα m α α+1 x
β 1
− −(x/λ) e
k
x
≥ xm
−1
α> 1
αi k i=1 αi
(α
−
β2 1)2 (α
1
1 λΓ 1 + k
αxm α> 1 α 1
1
1
2)2
α>2
−
2 λ2 Γ 1 + k
(α
−
−
1+
(s < 1/β)
− 4βs
∞
k 1
k=1
r=0
−
∞ sn λn
µ2
− 2) α > 2
− βs
α
2( βs)α/2 K α Γ(α)
αi + 1
αβ 2 (α + β) (α + β + 1)
xα m 1)2 (α
− βs (s < 1/β)
− α α+β
− 2s)−k/2 s < 1/2 1
− E [X i ] (1 − E [X i ]) k i=1
T
n=0
n!
α+r α+β +r Γ 1+
sk k!
n k
α( xm s)α Γ( α, xm s) s < 0
−
− −
Uniform (continuous)
2
Log−normal
Normal
χ
2
5 . 0
2
µ= 0, σ = 0.2
µ= 0, σ = 3
0 . 1
2
µ= 0, σ = 1 2
8 . 0
2
µ= 2, σ = 2
k =1 k =2
2
k =3
µ= 0, σ = 5
µ= 0, σ = 1
µ= −2, σ = 0.5
µ= 0.5, σ = 1
2
2
k =4
2
µ= 0.25 , σ = 1
k =5
4 . 0
2
8 . 0
µ= 0.125, σ = 1
6 . 0
3 . 0
6 . 0 F 1 D P b a
)
F D P
x
(
φ
−
4 . 0
2 . 0
0 . 0
a
b
−4
−2
x
0
2
4 . 0
2 . 0
2 . 0
1 . 0
0 . 0
0 . 0
4
0.0
0.5
1.0
1.5
x
α= α= α= α= α=
4 . 0
5 . 1
1,
β= 2, β= 3, β= 5, β= 9, β=
0
3.0
2
4
2 4
0.5
1,
β= 2, β= 3, β= 3, β=
0 . 3
1 1
α= α= α= α= α=
1 0.5 5 . 2
0.5,
β= 0.5 β= 1 β= 3 2, β= 2 2, β= 5 5, 1,
0 . 2
3
F D P
8
Beta
α= α= α= α=
2 1
6
x
2
3 . 0 0 . 1
2.5
InverseGamma
5 . 0
β=2 β=1 β = 0.4
2.0
x
Gamma
Exponential 0 . 2
F D P
F D P
F D P
F D P
5 . 1
2
2 . 0
0 . 1 5 . 0
1 . 0
0 . 0
1
2
3
4
5
10
15
20
1, k = 0.5
xm = 1, α= 1
1, k = 1
xm = 1, α= 2
1, k = 1.5
xm = 1, α= 4
1, k = 5 3
5 . 1 F D P
2
0 . 1
1
0
1
2
3 x
Pareto
λ= λ= λ= λ=
0 . 2
5 . 0
5
x
Weibull
0 . 0
0
0
x
F D P
5 . 0
0 . 0
0
5 . 2
1
4
5
0.0
0.2
0.4
0.6 x
0.8
1.0
2
Probability Theory
Law of Total Probability n
Definitions
• Sample space Ω • Outcome (point or element) ω ∈ Ω • Event A ⊆ Ω • σ-algebra A 1. ∅ ∈ A ∞ 2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A 3. A ∈ A =⇒ ¬A ∈ A • Probability distribution P 1. P [A] ≥ 0 for every A
2. 3.
P [Ω] P
P [B] =
=1
∞
i=1
Ai =
∞
P [Ai ]
n
P [B
i=1
Ω=
Ai
i=1
Bayes’ Theorem P [Ai
n
|| − P [B Ai ] P [Ai ] n j=1 P [B Aj ] P [Aj ]
| B] =
Ω=
Ai
i=1
Inclusion-Exclusion Principle
n
i=1
3
n
r
( 1)r−1
Ai =
r=1
Aij
i i1 <
≤
···
≤
Random Variables
Random Variable
i=1
X : Ω
• Probability space (Ω, A, P)
→R
Probability Mass Function (PMF)
Properties
f X (x) =
• P [∅] = 0 • B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B) • P [¬A] = 1 − P [A] • P [B] = P [A ∩ B] + P [¬A ∩ B] • P [Ω] = 1 P [∅] = 0 • ¬( n An) = n ¬An ¬( n An) = n ¬An DeMorgan • P [ n An] = 1 − P [ n ¬An] • P [A ∪ B] = P [A] + P [B] − P [A ∩ B] =⇒ P [A ∪ B] ≤ P [A] + P [B] • P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B] • P [A ∩ ¬B] = P [A] − P [A ∩ B] • A1 ⊂ · · · ⊂ An ∧ A = ∞n=1 An =⇒ P [An ] = P [A] • An ⊂ · · · ⊂ A1 ∧ A = ∞n=1 An =⇒ P [An] = P [A] Independence ⊥⊥ A ⊥⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B]
|Ai] P [Ai]
P [X =
x] =
P[
{ω ∈ Ω : X (ω) = x}]
Probability Density Function (PDF)
b
f X (x) =
P [a
≤ X ≤ b] =
f (x) dx
a
Cumulative Distribution Function (CDF):
→ [0, 1] F X (x) = P [X ≤ x] 1. Nondecreasing: x1 < x 2 =⇒ F (x1 ) ≤ F (x2 ) F X : R
2. Normalized: limx→−∞ = 0 and limx→∞ = 1 3. Right-continuous: limy ↓x F (y) = F (x)
b
P [a
≤ Y ≤ b | X = x] = |
a
f Y |X (y x) = Independence
|
f Y |X (y x)dy f (x, y) f X (x)
a
≤b
3.1
Transformations
• E [XY ] = xyf X,Y (x, y) dF X (x) dF Y (y) X,Y • E [ϕ(Y )] = ϕ(E [X ]) (cf. Jensen inequality) • P [X ≥ Y ] = 0 =⇒ E [X ] ≥ E [Y ] ∧ P [X = Y ] = 1 =⇒ ∞ • E [X ] = P [X ≥ x]
Transformation function Z = ϕ(X ) Discrete f Z (z) =
P [ϕ(X ) =
z] =
P[
{x : ϕ(x) = z}] = P
X
∈ ϕ−1(z)
=
f (x)
x ϕ−1 (z)
∈
x=1
Sample mean
¯n = 1 X n
Continuous F Z (z) =
P [ϕ(X )
≤ z] =
{
f (x) dx with Az = x : ϕ(x)
Az
≤ z}
Conditional Expectation
The Rule of the Lazy Statistician E [Z ] =
E [I A (x)] =
Convolution
• Z := X + Y
dz
dF X (x) =
P [X
A
f Z (z) =
∞
−∞
|J |
−∞ ∞
• E [ϕ(Y, Z ) | X = x] = ϕ(y, z)f (Y,Z )|X (y, z | x) dydz −∞ • E [Y + Z | X ] = E [Y | X ] + E [Z | X ] • E [ϕ(X )Y | X ] = ϕ(X )E [Y | X ] • E [Y | X ] = c =⇒ Cov [X, Y ] = 0
ϕ(x) dF X (x)
I A (x) dF X (x) =
f X,Y (x, z
− x) dx
X,Y 0
≥
=
∈ A]
5
z
0
∞
f X,Y (x, z
− x) dx
• Z := |X − Y | f Z (z) = 2 f X,Y (x, z + x) dx ∞ 0 ∞ X xf x (x)f X (x)f Y (xz) dx • Z := Y f Z (z) = |x|f X,Y (x,xz) dx ⊥=⊥
−∞
4
Variance
−
• V [X ] = σX2 = E (X − E [X ])2 n n • V X i = V [X i] + 2
i=1 n
•V
i=1
i=1 n
X i =
x dF X (x) =
X discrete
x
xf X (x)
X continuous
Covariance
E [X ]
2
Cov [X i , Y j ]
V
[X i ]
iff X i
i=1
sd[X ] =
xf X (x)
= E X 2
i=j
Standard deviation
Expectation
• E [X ] = µX =
Variance
−∞
Expectation
X i
i=1
d −1 dx 1 f Z (z) = f X (ϕ−1 (z)) ϕ (z) = f X (x) = f X (x) dz
n
• E [Y | X = x] = yf (y | x) dy • E [X ] = E [E [X | Y ]] ∞ • E [ϕ(X, Y ) | X = x] = ϕ(x, y)f Y |X (y | x) dx
Special case if ϕ strictly monotone
E [X ] = E
⊥⊥ X j
V [X ] =
σX
• Cov [X, Y ] = E [(X − E [X ])(Y − E [Y ])] = E [XY ] − E [X ] E [Y ] • Cov [X, a] = 0 • Cov [X, X ] = V [X ]
[Y ]
n
• Cov
m
X i ,
i=1
n
Y j
j=1
Cov [X i , Y j ]
ρ [X, Y ] = 0
⇐⇒
Sample variance S 2 =
V [X ] V [Y ]
Cov [X, Y ] = 0
Conditional Variance
E [XY ] = E [X ] E [Y ]
1
¯ n )2 X
(X i
E [Y
| X ]2
Exponential
≤E
Markov P [ϕ(X )
X 2
E
Y 2
)] ≥ t] ≤ E [ϕ(X t
Chebyshev P[
] |X − E [X ]| ≥ t] ≤ V t[X 2
Chernoff P [X
≥ (1 + δ)µ] ≤
eδ (1 + δ)1+δ
δ>
Jensen E [ϕ(X )]
7
≥ ϕ(E [X ])
ϕ convex
Distribution Relationships
Binomial X
Bernoulli( )
X
Bin(
)
X i
X j =
X i
X i
X j =
X i
X j
j=1
≥ r]
n
Poisson
i=1
P [Y
λi
i=1 n
Bin
X j ,
j=1
λi n j=1 λj
n
−1
• X ∼ N µ, σ2 =⇒ Xσ−µ ∼ N (0, 1) • X ∼ N µ, σ2 ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ2 • X ∼ N µ1 , σ12 ∧ Y ∼ N µ2, σ22 =⇒ X + Y ∼ N µ1 + µ2, σ12 + σ22 • X i ∼ N µi, σi2 =⇒ i X i ∼ N i µi, i σi2 • P [a < X ≤ b] = Φ b−σµ − Φ a−σ µ • Φ(−x) = 1 − Φ(x) φ(x) = −xφ(x) φ(x) = (x2 − 1)φ(x) • Upper quantile of N (0, 1): zα = Φ −1 (1 − α) Gamma (distribution)
• X ∼ Gamma (α, β) α⇐⇒ X/β ∼ Gamma (α, 1) • Gamma (α, β) ∼ i=1 Exp(β) • X i ∼ Gamma(α ⊥ X j =⇒ i X i ∼ Gamma ( i , β) ∧ X i ⊥ ∞ Γ(α) α−1 −λx • λα = x e dx
0
Gamma (function) n
⇒
s] =
• X i ∼ Exp(β) ∧ X i X j = X i Gamma (n, β) i=1 • Memoryless property: P [X > x + y | X > y] = P [X > x] Normal
2 E [XY ]
ri , p) P [X
n
Inequalities
Cauchy-Schwarz
≤ ∼ ∧ ⊥⊥ ⇒ ∧ ⊥⊥ ⇒ ∼ ∼ ⊥⊥ ⇒ n
• X i ∼ Poisson (λi)
i=1
Y 2 X
Poisson
• X i ∼ Poisson (λi)
n
• V [Y | X ] = E (Y − E [Y | X ])2 X = E • V [Y ] = E [V [Y | X ]] + V [E [Y | X ]] 6
⇐⇒
− − | | − 1
n
Independence
⊥⊥ Y =⇒
• X ∼ NBin (1, p) = Geometric ( p) • X ∼ NBin (r, p) = ri=1 Geometric ( p) • X i ∼ NBin (ri, p) =⇒ X i ∼ NBin ( • X ∼ NBin (r, p) . Y ∼ Bin(s + r, p) =⇒
Cov [X, Y ]
ρ [X, Y ] =
(n large, p far from 0 and 1)
Negative Binomial
i=1 j=1
Correlation
X
• limn→∞ Bin(n, p) = N (np,np(1 − p))
m
=
• Ordinary: Γ(s) =
∞ ts−1 e−t dt
0
i αi , β)
• Γ(n) = (n√− 1)! • Γ(1/2) = π
n
9
∈N
9.1
Beta (distribution)
Standard Bivariate Normal
Let X, Y
Γ(α + β) α−1 • B(α,1 β) xα−1 (1 − x)β−1 = Γ(α)Γ(β) x (1 − x)β−1 + k, β) α+k−1 k−1 • E X k = B(αB(α, E X = β) α+β +k−1 • Beta(1, 1) ∼ Unif (0, 1)
Conditionals
|
(Y X = x)
Independence
1
Probability and Moment Generating Functions
X
9.2
Bivariate Normal
Let X
∼ N µx , σx2
∞ (Xt)i
i=0
i!
=
∞
i=0
E
X i i!
x2 + y 2 2ρxy 2(1 ρ2 )
− −
(X Y = y)
⊥⊥ Y ⇐⇒
ρ =0
1
x
µx
2
+
σx
y
1
µy
z
exp
ρ2
2
2ρ
σy
2(1
x
ρ2 )
µx
y
σx
| Y ] = E [X ] + ρ σσXY (Y − E [Y ]) V [X
ti
| Y ] = σX
Multivariate Normal
Covariance Matrix Σ
Σ=
− 1
ρ2
(Precision Matrix Σ−1 )
V [X 1 ]
.. . Cov [X k , X 1 ]
· ·· ..
.
· ··
Cov [X 1 , X k ] .. . V [X k ]
µy
σy
Conditional mean and variance
9.3
∼ N ρy, 1 − ρ2
|
and
2πσ x σy
E [X
·
exp
−
ρ2 Z
∼ N µy , σy2 .
f (x, y) =
z=
ρ2
1
− − − − − − − − and Y
X
• GX (t) = E t |t| < 1 • M X (t) = GX (et) = E eXt = E • P [X = 0] = GX (0) • P [X = 1] = GX (0) (i) • P [X = i] = GXi!(0) • E [X ] = GX (1−) • E X k = M X(k)(0) − • E (X X ! k)! = G(k) X (1 )
1
ρ2
ρx, 1
Γ(x)Γ(y) t − (1 − t)y−1 dt =
N
− ∼ N − 2π
x 1
• Ordinary: B(x, y) = B(y, x) = Γ(x + y) 0 x a−1 b−1 • Incomplete: B(x; a, b) = t (1 − t) dt 0 • Regularized incomplete: a+b−1 B(x; a, b) a,b∈ (a + b − 1)! I x (a, b) = = xj (1 − x)a+b−1−j − − B(a, b) j!(a + b 1 j)! j=a • I 0 (a, b) = 0 I 1(a, b) = 1 • I x (a, b) = 1 − I 1−x(b, a)
1
f (x, y) =
−
∼ N (0, 1) ∧ X ⊥⊥ Z with Y = ρX +
Joint density
Beta (function):
8
Multivariate Distributions
Properties
Slutzky’s Theorem 1/2
• Z ∼ N (0, 1) ∧ X = µ + Σ Z =⇒ X ∼ N (µ, Σ) • X ∼ N (µ, Σ) =⇒ Σ−1/2(X − µ) ∼ N (0, 1) • X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT • X ∼ N (µ, Σ) ∧ a is vector of lenght k =⇒ aT X ∼ N aT µ, aT Σa
10
P
D
P
D
D
D
10.1
D
Law of Large Numbers (LLN)
{
}
Let X 1 , X 2 , . . . be a sequence of rv’s and let X be another the cdf of X n and let F denote the cdf of X .
}
rv.
Let F n denote
1. In distribution (weakly, in law): X n
Strong (SLLN)
lim F n (t) = F (t)
n
→∞
10.2
n
→∞
4. In quadratic mean (L2 ): X n
qm
¯n X
µ
V
¯n X
−
=
CLT Notations
as
P
P
P
P
P
qm
qm
qm
P
P
P
<
∞.
→∞
σ
E [X 1 ] =
µ, and
V [X 1 ] =
σ2 .
D
→ Z
where Z
≤ z] = Φ(z)
z
∈R
≈ N (0, 1) 2 ¯n ≈ N µ, σ X n
− ≈ N − ≈ N
¯n X
µ
√n(X ¯ µ) √n(X ¯nn − µ) n
Continuity Correction
P
0,
σ2 n
0, σ2
≈ N (0, 1)
≤ ≈ √− ≥ ≈ − − √− P
P
P
V [X 1 ]
Z n
D
D
as n
=1
• X n → X =⇒ X n → X =⇒ X n → X • X n → X =⇒ X n → X • X n → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ X n → X • X n → X ∧ Y n → Y =⇒ X n + Y n → X + Y • X n → X ∧ Y n → Y =⇒ X n + Y n → X + Y • X n → X ∧ Y n → Y =⇒ X nY n → XY • X n → X =⇒ ϕ(X n) → ϕ(X )
as
→∞ P [Z n
2 →∞ E (X n − X ) = 0
P
→∞
√n(X ¯n − µ)
X
lim
qm
→µ
as n
lim
n
Relationships
¯n X
P
n
as
→
→µ
}
Z n :=
→ X = P ω ∈ Ω : lim X n (ω) = X (ω) n→∞
µ, and
Central Limit Theorem (CLT)
{
P
lim X n = X
¯n X
Let X 1 , . . . , X n be a sequence of iid rv’s,
→ X (∀ε > 0) lim P [|X n − X | > ε] = 0 n→∞
3. Almost surely (strongly): X n P
D
→ X ∀t where F continuous
E [X 1 ] =
Weak (WLLN)
Types of Convergence
2. In probability: X n
D
Let X 1 , . . . , X n be a sequence of iid rv’s,
Convergence
{
D
• X n → X and Y n → c =⇒ X n + Y n → X + c • X n → X and Y n → c =⇒ X nY n → cX • In general: X n → X and Y n → Y =⇒ X n + Y n → X + Y
¯n X
¯n X
x
x
x + 12 µ σ/ n
Φ
1
Φ
x
1 2
σ/ n
µ
∼ N (0, 1)
11
Statistical Inference
• F ˆn → F (x) P
Let X 1 ,
··· , X n iid ∼ F if not otherwise noted.
11.1
Point Estimation
P
≈ N − − − ∼ N ±
• Point estimator θn of θ is a rv: θn = g(X 1 , . . . , Xn ) • bias(θn) = E θn − θ • Consistency: θn → θ • Sampling distribution: F (θn) • Standard error: se(θn) = V θn • Mean squared error: mse = E (θn − θ)2 = bias(θn) + V θn • limn→∞ bias(θn) = 0 ∧ limn→∞ se(θn) = 0 =⇒ θn is consistent • Asymptotic normality: θnse− θ → N (0, 1) • Slutzky’s Theorem often lets us replace se(θn) by some (weakly) consis-
sup F (x) x
Nonparametric 1
P
tent estimator σn .
=
P [L(x)
Normal-based Confidence Interval Suppose θn θ, se2 . Let zα/2 = Φ−1 (1 (α/2)), i.e., and
P
zα/2 < Z < zα/2 = 1
α where Z
C n = θn
11.3
(0, 1). Then
zα/2 se
Empirical Distribution Function
Empirical Distribution Function (ECDF)
Fn (x) =
I (X i Properties (for any fixed x)
•E
ˆn = F (x) F
Z > z α/2 = α/2
I (X i n
≤ x) ≤
1 X i x 0 X i > x
> ε = 2e−2nε
2
1 log 2n
2 α
≤ F (x) ≤ U (x) ∀x] ≥ 1 − α
Statistical Functionals
• Statistical functional: T (F ) • Plug-in estimator of θ = T (F ) : θn = T (F ˆn) • Linear functional: T (F ) = ϕ(x) dF X (x) • Plug-in estimator for linear functional:
⇒ ± • ≈ N • { ≥ } • • − − − • − − • ˆn ) = T (F
T (F ), se2
ˆn ) Often: T (F
ϕ(x) dF n (x) = ˆn ) T (F
=
th
p quantile: F −1 ( p) = inf x : F (x) p ¯n µ ˆ = X
n i=1
≤ x) =
P
11.4
− F ˆn(x)
− α confidence band for F ˆn − n , 0} L(x) = max{F ˆn + n , 1} U (x) = min{F
D
11.2
(DKW) Inequality (X 1 , . . . , X n
Dvoretzky-Kiefer-Wolfowitz
σ2 = κ ˆ=
n
1
n
1 n
ρˆ =
1
(X i
¯n )2 X
i=1 n µ ˆ)3 i=1 (X i 3 σ j n ¯ n )(Y i Y ¯n ) X i=1 (X i n n ¯ n )2 X i=1 (X i i=1 (Y i
−
12
Pa
etric Inf
− Y ¯n)
1 n
n
ϕ(X i )
i=1
zα/2 se
∼ F )
12.1
Method of Moments
j th moment αj (θ) =
E
j th sample moment
Fisher Information I (θ) =
X j =
1 α ˆj = n
Fisher Information (exponential family)
I nobs (θ) =
α ˆ1 α ˆ2 .. . α ˆk
Properties of the
≈
2. se
(θn
D
θ)
se
f (X i ; θ)
i=1
D
→ N (0, 1)
≤
˜n , θn ) = are(θ
V
θ˜n
V
θn
• Approximately the Bayes estimator 12.2.1
n
Maximum Likelihood Estimator (mle)
−
• Asymptotic optimality, i.e., smallest variance for large samples. If θ˜n is any
n
Ln(θ) =
log f (X i ; θ)
i=1
1/I n (θn )
Ln : Θ → [0, ∞)
n (θ) = log
n
mle
other estimator, then
Maximum Likelihood
Log-likelihood
2
− ∂θ∂ 2
• Consistency: θn → θ • Equivariance: θn is the mle =⇒ ϕ(θn) is the mle of ϕ(θ) • Asymptotic normality: 1. se ≈ 1/I n (θ) (θn − θ) → N (0, 1) se
D
where Σ = g E Y Y T g T , Y = (X, X 2 , . . . , X k )T , ∂ −1 g = (g1 , . . . , g k ) and gj = ∂θ αj (θ)
Ln(θ) =
∂ s(X ; θ) ∂θ
P
P
−
Observed Fisher Information
• θn exists with probability tending to 1 • Consistency: θn → θ • Asymptotic normality: √n(θ − θ) → N (0, Σ)
Likelihood:
Eθ
i=1
Properties of the MoM estimator
12.2
I (θ) =
X ij
Method of Moments Estimator (MoM)
I n (θ) = nI (θ)
xj dF X (x)
n
α1 (θ) = α2 (θ) = .. .= αk (θ) =
[s(X ; θ)]
Vθ
log f (X i ; θ)
i=1
Ln(θn) = sup Ln(θ)
Delta Method
If τ = ϕ(θ) where ϕ is differentiable and ϕ (θ) = 0:
− (τ n
τ ) se(τ )
D
→ N (0, 1)
1
12.3
13
Multiparameter Models
Let θ = (θ1 , . . . , θ k ) and θ = ( θ1 , . . . , θk ) be the ∂ 2 n H jj = ∂θ 2 Fisher Information Matrix
mle.
I n (θ) =
Eθ
[H 11 ] .. .
..
Eθ
[H 1k ] .. .
Eθ
[H kk ]
.
[H k1 ]
Under appropriate regularity conditions (θ
H 0 : θ
∂ 2 n H jk = ∂θ j ∂θ k
·· · − ·· · − ≈ N − → N Eθ
θ)
(0, J n )
θj )
D
(0, 1)
sej
where se2j = J n ( j,j) and Cov θj , θk = J n ( j,k) 12.3.1
Let τ = ϕ(θ1 , . . . , θ k ) be a function and let the gradient of ϕ be
ϕ = Suppose
∂ϕ ∂θ 1 .. . ∂ϕ ∂θ k
Retain H 0
se(τ ) =
ˆn = J n (θ) and ˆ ϕ = and J
ϕ
. θ =θ
ˆn ˆ ϕ J
√
H 0 true H 1 true
Type II error (β)
Reject H 0 Type I error (α) (power)
√
p-value
• p-value = supθ∈Θ • p-value = supθ∈Θ
0
Pθ
0
∈ ∈ ≥ ≥
[T (X ) T (x)] = inf α : T (x) Rα Pθ [T (X ) T (X )] = inf α : T (X )
1 F θ (T (X))
−
p-value
< 0.01 0.01 0.05 0.05 0.1 > 0.1
− −
se(τ )
T
∈ Θ1
0
D
ˆϕ
H 1 : θ
1
ϕ θ=θ = 0 and τ = ϕ(θ). Then, (τ − τ ) → N (0, 1)
where
versus
• Null hypothesis H 0 • Alternative hypothesis H 1 • Simple hypothesis θ = θ0 • Composite hypothesis θ > θ 0 or θ < θ 0 • Two-sided test: H 0 : θ = θ0 versus H 1 : θ = θ0 • One-sided test: H 0 : θ ≤ θ0 versus H 1 : θ > θ 0 • Critical value c • Test statistic T • Rejection Region R = {x : T (x) > c } • Power function β(θ) = P [X ∈ R] • Power of a test: 1 − P [Type II error] = 1 − β = θinf β(θ) ∈Θ • Test size: α = P [Type I error] = θsup β(θ) ∈Θ
Multiparameter Delta Method
∈ Θ0
Definitions
with J n (θ) = I n−1 . Further, if θj is the j th component of θ, then (θj
Hypothesis Testing
since T (X ) F θ
∼
evidence
very strong evidence against H 0 strong evidence against H 0 weak evidence against H 0 little or no evidence against H 0
Wald Test
• Two-sided test • Reject H 0 when |W | > z α/2 where W = θ −seθ0 • P |W | > z →
Rα
• X n = (X 1, . . . , Xn ) • xn = (x1 , . . . , xn) • Prior density f (θ) • Likelihood f (xn | θ): joint density of then data In particular, X n iid =⇒ f (xn | θ) = f (xi | θ) = Ln (θ) i=1 • Posterior density f (θ | xn) • Normalizing constant cn = f (xn) = f (x | θ)f (θ) dθ • Kernel: part of a density that depends on θ (θ)f (θ) • Posterior Mean θ¯n = θf (θ | xn) dθ = LθL(θ)f (θ) dθ
supθ∈Θ Ln (θ) Ln(θn) • T (X ) = sup = L n (θ) Ln(θn,0 ) θ∈ Θ k • λ(X ) = 2 log T (X ) → χ2r−q where Z i2 ∼ χ2k with Z 1, . . . , Zk iid ∼ N (0, 1) i=1 • p-value = Pθ [λ(X ) > λ(x)] ≈ P χ2r−q > λ(x) 0
D
0
Multinomial LRT
→
• Let pˆn = X n1 , . . . , X nk k • T (X ) = LLnn(ˆ( p pn0)) = j=1
be the
mle
Xj
pˆj p0j
k
• λ(X ) = 2 X j log p pˆ0jj χ2k−1 j=1 • The approximate size α LRT rejects H 0 when λ(X ) ≥ χ2k−1,α D
14.1 1
k
D
(X j
j=1
χ2k 1
− E [X j ])2 where E [X j ] = np0j under H 0
− α Posterior Interval P [θ
• T → − • p-value = P χ2k−1 > T (x) • Faster → X k2−1 than LRT, hence preferable for small n D
1
Independence Testing
• I rows, J columns, X multinomial sample of size n = I ∗ J • mles unconstrained: pˆij = Xn • mles under H 0: pˆ0ij = pˆi· pˆ·j = Xn Xn • LRT: λ = 2 I i=1 J j=1 X ij log XnXX • Pearson χ2: T = I i=1 J j=1 (X −[X[X] ]) • LRT and Pearson → χ2ν , where ν = (I − 1)(J − 1) ij
ij
i·
ij
E
− α Equal-tail Credible Interval a
−∞
f (θ xn ) dθ =
|
b
f (θ xn ) dθ = 1
a
∞
|
f (θ xn ) dθ = α/2
|
b
− α Highest Posterior Density (HPD) region Rn 1. P [θ ∈ Rn ] = 1 − α 2. Rn = {θ : f (θ | xn ) > k } for some k Rn is unimodal =⇒ Rn is an interval 1
·j
E
ij
ij
2
14.2
Function of Parameters
{
Let τ = ϕ(θ) and A = θ : ϕ(θ) Posterior CDF for τ
≤ τ }.
D
14
Bayesian Inference
Bayes’ Theorem
−α
·j
i·
∈ (a, b) | xn] =
E [X j ]
n
n
Credible Intervals
Pearson χ2 Test
• T =
H (r xn ) =
|
P [ϕ(θ)
≤ τ | xn] =
A
Posterior Density h(τ xn ) = H (τ xn )
|
|
f (θ xn ) dθ
|
14.3
Priors
Continuous likelihood (subscript c denotes constant)
Choice
• Subjective Bayesianism: prior should incorporate as much detail as possible the research’s a priori knowledge — via prior elicitation . • Objective Bayesianism: prior should incorporate as little detail as possible (non-informative prior). • Robust Bayesianism: consider various priors and determine sensitivity of
Likelihood
Conjugate Prior
Uniform(0, θ)
Pareto(xm , k)
Exponential(λ)
Gamma(α, β)
Posterior hyperparameters n
α + n, β +
• Flat: f (θ)∞∝ constant • Proper: −∞ f (θ) dθ = 1 ∞ f (θ) dθ = ∞ • Improper: −∞ • Jeffreys ’ prior (transformation-invariant):
f (θ)
∝
• Conjugate: f (θ) and f (θ | x 14.3.1
I (θ)
n
f (θ)
∝
Normal(µ, σc2 )
Nor mal(µ0 , σ02 )
Normal(µc , σ 2 )
Scaled Inverse Chisquare(ν, σ02 )
Normal(µ, σ 2 )
Nor malscaled Inverse Gamma(λ,ν,α,β)
MVN(µ, Σc )
det(I (θ))
) belong to the same parametric family
Conjugate Priors Discrete likelihood
Likelihood
Conjugate Prior
Posterior hyperparameters n
Bernoulli( p)
Beta(α, β)
α+
Beta(α, β)
α+
− − xi , β + n
n
xi , β +
i=1
Negative Binomial( p)
B eta(α, β)
n
α + rn,β +
Gamma(α, β)
α+
Dirichlet(α)
α+
xi
x(i)
n i=1 (xi
ν + n,
2
−
xi
c
0
n
n + κ, Ψ +
(xi
µc )(xi
i=1 n
xi xmc i=1 kn where k0 > k n
α + n, β +
Pareto(xm , kc )
Pareto(x0 , k0 )
x0 , k0
Gamma(αc , β)
Gam ma(α0 , β0 )
α0 + nαc , β0 +
If H 0 : θ
∈ Θ0 : Prior probability
−
log n
xi
P [H 0 ] =
f (θ) dθ
Θ0
Posterior probability
n , 2
−1 −1 Σ0 µ0 + nΣ−1 x ¯ , − 1 − 1 − 1 Σ + nΣ
Gam ma(α, β)
Bayesian Testing
α +
(x ¯ − λ) − x¯)2 + γ 2(n + γ )
Pareto(xmc , k)
14.4
µ)2
ν + n
i=1
n
i=1
1 n + , σ02 σc2
/
1 −1 Σ− 0 + nΣc
xi
i=1
N i
i=1
xi , β + n
i=1 n
Multinomial(p)
InverseWishart(κ, Ψ)
i=1
n
Poisson(λ)
νλ + n¯ x , ν + n n 1 β+ (xi 2 i=1
n
i=1 n
Binomial( p)
µ0 + σ02 − 1 n + σ02 σc2 νσ 02 + ν + n,
MVN(µ0 , Σ0 )
MVN(µc , Σ)
xi
i=1 n i=1 xi σc2 1
our inferences to changes in the prior.
Types
− max x(n) , xm , k + n
P [H 0
| xn ] =
Θ0
f (θ xn ) dθ
|
− µc)T
Marginal Likelihood f (xn H i ) =
|
1. Estimate VF [T n ] with VF ˆn [T n ]. 2. Approximate VF ˆn [T n ] using simulation:
f (xn θ, H i )f (θ H i ) dθ
|
Θ
|
∗ , . . . , T ∗ , an (a) Repeat the following B times to get T n,1 n,B ˆ the sampling distribution implied by F n ˆn . i. Sample uniformly X 1∗ , . . . , X n∗ F
Posterior Odds (of H i relative to H j )
| xn] P [H j | xn ] P [H i
=
f (xn H i ) f (xn H j )
| |
Bayes Factor
ii. Compute T n∗ = g(X 1∗ , . . . , X n∗ ).
P [H j ]
Bayes Factor BF ij
(b) Then
prior odds
evidence log10 BF 10 BF 10 0 0.5 1 1.5 Weak 0.5 1 1.5 10 Moderate 1 2 10 100 Strong >2 > 1 00 Dec is ive p BF 10 1− p p∗ = where p = P [H 1 ] and p∗ = P [H 1 xn ] p 1 + 1− BF 1 0 p
− − −
− − −
|
15
Exponential Family
{
−
{
}
Vector parameter
− A(θ) i=1 = h(x)exp {η(θ) · T (x) − A(θ)} = h(x)g(θ)exp {η(θ) · T (x)} ηi (θ)T i (x)
B
T ∗
n,b
b=1
−
1 B
B
2
T ∗
n,r
r=1
Normal-based Interval T n
± zα/2seˆ boot
Location parameter θ = T (F ) Pivot Rn = θn θ Let H (r) = P [Rn r] be the cdf of Rn ∗ = θ∗ Let Rn,b θn . Approximate H using bo otstrap: n,b
−≤ −
ˆ (r) = 1 H B
B
∗ I (Rn,b
b=1
≤ r)
∗ , . . . , θ∗ ) 5. Let θβ∗ denote the β sample quantile of (θn,1 n,B
− − −
6. Let rβ∗ denote the β sample quantile of (R∗n,1 , . . . , R ∗n,B ), i.e., rβ∗ = θβ∗
Natural form
|
{ ·
−
}
f X (x η) = h(x)exp η T(x) A(η) = h(x)g(η)exp η T(x)
{ ·
}
= h(x)g(η)exp η T T(x)
16
Bootstrap Confidence Intervals
s
|
16.1.1
1. 2. 3. 4.
}
f X (x θ) = h(x)exp η(θ)T (x) A(θ) = h(x)g(θ)exp η(θ)T (x)
f X (x θ) = h(x)exp
ˆˆ vboot = V F n
1 = B
Pivotal Interval
Scalar parameter
|
sample from
∼
P [H i ]
×
iid
Sampling Methods
7. Then, an approximate 1 a ˆ= ˆb =
− α confidence interval is C n =
− − − − −
θn
ˆ −1 1 H
θn
ˆ −1 H
α = 2 α = 2
r1∗−α/2 =
θn
θn
∗ = rα/2
ˆa, ˆb with
2θn
2θn
θ1∗−α/2
∗ θα/2
θn
16.2
Rejection Sampling
Setup
• We can easily sample from g(θ) • We want to sample from h(θ), but it is difficult k(θ) • We know h(θ) up to proportional constant: h(θ) = k(θ) dθ • Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ
Algorithm 1. Draw θ cand 2. Generate u
∼ g(θ) ∼ Unif (0, 1) k(θcand ) 3. Accept θcand if u ≤ M g(θcand ) 4. Repeat until B values of θcand have been accepted
• We can easily sample from the prior g(θ) = f (θ) • Target is the posterior with h(θ) ∝ k(θ) = f (xn | θ)f (θ) • Envelope condition: f (xn | θ) ≤ f (xn | θn) = Ln(θn) ≡ M • Algorithm 1. Draw θcand ∼ f (θ) 2. Generate u ∼ Unif (0, 1) Ln(θcand) 3. Accept θcand if u ≤ Ln(θn)
16.3
Importance Sampling
|
iid
∼ f (θ)
2. For each i = 1, . . . , B, calculate wi = 3.
17
E [q(θ)
• Squared error loss: L(θ, a) = (θ − a)2 1 (θ − a) a − θ < 0 • Linear loss: L(θ, a) = K K 2 (a − θ) a − θ ≥ 0 • Absolute error loss: L(θ, a) = |θ − a| (linear loss with K 1 = K 2) • L p loss: L(θ, a) = |θ − a| p • Zero-one loss: L(θ, a) = 01 aa == θθ
| xn ]
≈
B i=1
q(θi )wi
Decision Theory
Risk
Posterior Risk
| | | | ∀ ≤ • r(θ x) =
L(θ, θ (x))f (θ x) dθ = Eθ|X L(θ, θ(x))
(Frequentist) Risk
R(θ, θ) =
L(θ, θ(x))f (x θ) dx = EX |θ L(θ, θ(X ))
Bayes Risk
r(f, θ) =
Sample from an importance function g rather than target density h. Algorithm to obtain an approximation to E [q(θ) xn ]: 1. Sample from the prior θ1 , . . . , θn
Loss functions
17.1
Example
Ln(θi) B i=1 Ln (θi )
• Decision rule: synonymous for an estimator θ • Action a ∈ A: possible value of the decision rule. In the estimation context, the action is just an estimate of θ, θ(x). • Loss function L: consequences of taking action a when true state is θ or discrepancy between θ and θ, L : Θ × A → [−k, ∞).
L(θ, θ(x))f (x, θ) dxdθ = Eθ,X L(θ, θ(X ))
r(f, θ) =
r(f, θ) =
17.2
| L(θ, θ(X )
Eθ EX θ
EX Eθ X
|
L(θ, θ(X )
= Eθ R(θ, θ)
= EX r(θ X )
Admissibility
θ dominates θ if
θ : R(θ, θ )
R(θ, θ )
17.3
Bayes Rule
Residual Sums of Squares (rss)
Bayes Rule (or Bayes Estimator)
• r(f, θ) = inf θ˜ r(f, θ)˜ • θ(x) = inf r(θ | x) ∀x =⇒
n
r(θ x)f (x) dx
Least Square Estimates
β T = ( β0 , β1 )T : min rss β0 ,β1
• Squared error loss: posterior mean • Absolute error loss: posterior median • Zero-one loss: posterior mode
Minimax Rule
¯ R(a) = sup R(θ, a)
θ
θ
˜ = inf sup R(θ, θ) ˜ ¯ θ) sup R(θ, θ ) = inf R(
θ˜
θ˜
θ = Bayes rule
Least Favorable Prior
θf = Bayes rule
θ
∧ ∃c : R(θ, θ) = c
∧ R(θ, θf ) ≤ r(f, θf ) ∀θ
Linear Regression
¯n β1 X
n ¯n )(Y i X i=1 (X i n ¯ n )2 X i=1 (X i
β1 =
¯ θ) = sup R(θ, θ) R(
θ
18
¯n β0 = Y
Minimax Rules
Maximum Risk
E
β X n =
β0 β1
V
β X n =
σ2 nsX
n i=1
n−1
¯n X
n i=1
σ
se(β0 ) =
sX n σ se(β1 ) = sX n n
where s2X = n−1 i=1 (X i of σ. Further properties:
Simple Linear Regression
Model Y i = β0 + β1 X i + i Fitted Line
E [i
| X i] = 0, V [i | X i] = σ2
r(x) = β0 + β1 x
Predicted (Fitted) Values
β0
β0
se(β0 )
Approximate 1
X i2
=
D
n i=1 X i Y i n 2 i=1 X i
¯ Y ¯ nX nX 2
¯n X 1
X i2
n ˆ2i i=1
¯ n )2 and σ2 = 1 X n−2
P
• Response variable Y • Covariate X (aka predictor variable or feature)
¯n ) Y
n
Consistency: β0 β0 and β1 Asymptotic normality:
Definitions
18.1
ˆ2i
i=1
|
r(f, θ) =
Theorems
17.4
− − − − − − | | − − √ √ − • → → • − → N − → N • − ± ± • | | rss(β0 , β1 ) =
P
an (unbiased) estimate
β1
(0, 1)
and
β1
β1
se(β1 )
D
(0, 1)
α confidence intervals for β0 and β1 are β0
zα/2 se(β0 ) and
β1
zα/2 se(β1 )
The Wald test for testing H 0 : β1 = 0 vs. H 1 : β1 = 0 is: reject H 0 if W > z α/2 where W = β1 /se(β1 ).
Likelihood
If the (k n
L= L1 = L2 =
n
f (X i , Y i ) =
i=1 n
n
f X (X i )
i=1
×
i=1
|
f Y |X (Y i X i ) =
V
f Y |X (Y i | X i ) ∝ σ−n exp
− 1 2σ 2
Y i
i
− (β0 − β1 X i)
Prediction
2
Estimate regression function
k
r(x) =
σ2 =
i=1
−− ±
mle
Y∗ = β0 + β1 x∗
ξn2
=σ
2
n
Y∗
18.3
Y∗ = V β0 + x2∗ V β1 + 2x∗ Cov β0 , β1 n i=1 (X i i (X i
X ∗ )2 ¯ )2 j + 1 X
X = Likelihood
X 11 .. .
···
X n1
···
..
.
X 1k .. . X nk
L(µ, Σ) = (2πσ 2)−n/2 exp
k
¯ µ = X
1
− α Confidence Interval
18.4
βj
ˆ2i
ˆ = X β
σ2 =
n
Y
i=1
k
n
σ2
zα/2 se(βj )
Model Selection
Consider predicting a new observation Y ∗ for covariates X ∗ and let S J denote a subset of the covariates in the model, where S = k and J = n. Issues
||
||
1. Assign a score to each model 2. Search through all models to find the one with the highest score
− β=
n
Procedure
Y = Xβ +
n
1
• Underfitting: too few covariates yields high bias • Overfitting: too many covariates yields high variance
zα/2 ξn
Multiple Regression
where
β j xj
j=1
ˆ2i
Observe X = x∗ of the covarite and want to predict their outcome Y ∗ .
Prediction Interval
β, σ 2 (X T X )−1
Unbiased estimate for σ 2
n
1 σ2 = n
V
β X n = σ 2 (X T X )−1 β
Under the assumption of Normality, the least squares estimator is also the mle
18.2
| ≈ N − − − ± β = (X T X )−1 X T Y
L1 × L2
f X (X i )
i=1 n i=1
× k) matrix X T X is invertible,
β1 .. .
=
βk
1 rss 2σ2
1 .. .
n
Hypothesis Testing
H 0 : βj = 0 vs. H 1 : βj = 0
∀ j ∈ J
Mean Squared Prediction Error (mspe) mspe
= E (Y (S )
− Y ∗ )2
⊂
19
Training Error
Non-parametric Function Estimation
n
− − − Rtr (S ) =
(Yi (S )
i=1
R2 2
R (S ) = 1
rss(S )
Rtr (S )
=1
tss
tss
− Y i)2
=1
19.1
−
n i=1 (Yi (S ) n i=1 (Y i
¯ 2 − Y ) ¯ 2 − Y )
The training error is a downward-biased estimate of the prediction risk. E
−2
Cov Yi , Y i
i=1
19.1.1
− nn −− k1 rss tss
− −
Bayesian Information Criterion (BIC)
BI C (S ) = n (βS , σS2 )
m
(Yi ∗ (S )
RV (S ) =
i=1
− Y i∗ )2
m=
k
RCV (S ) =
k log n 2
− − n
(Y i
E
fn (x)
v(x) =
V
f n (x)
2
Y( i) ) =
f 2 (x) dx
v(x) dx
f (x)
j
Y i
Yi (S )
f (u) du
m
∈ − ≈ ≈ pj I (x h j=1
Bj )
pj h pj (1 pj ) V f n (x) = nh2 2 h 1 2 R(f n , f ) (f (u)) du + 12 nh E
|{validation data}|, often n4 or n2
n
b2 (x) dx +
b(x) =
• Number of bins m • Binwidth h = m1 • Bin Bj has ν j observations • Define pj = ν j /n and pj = B f n (x) =
Leave-one-out Cross-validation
dx = J (h) +
L(f, f n ) =
E
Histogram Estimator
AIC (S ) =
2
f n (x)
Histograms
Akaike Information Criterion (AIC) n (βS , σS2 )
A f (x) dx.
Definitions
R(S ) = Rtr (S ) + 2kσ 2 = lack of fit + complexity penalty
Validation and Training
f (x)
R(f, f n ) =
C p statistic
Frequentist Risk
∈ A] =
− −
L(f, fn ) =
n
R(S ) =
R2 (S ) = 1 Mallow’s
Estimate f (x), where f (x) = P [X Integrated Square Error (ise)
Rtr (S ) < R(S )
bias (Rtr (S )) = E Rtr (S )
Adjusted R2
Density Estimation
2
f n (x) =
h∗ =
R∗ (f n , f )
1/3
1
6
n1/3
(f (u))
C n2/3
C =
2
Cross-validation estimate of E [J (h)]
3 4
du
2/3
2
(f (u)) du
1/3
19.1.2
k-nearest Neighbor Estimator
Kernel Density Estimator (KDE)
Kernel K
r(x) =
• K (x) ≥ 0 • K (x) dx = 1 • xK (x) dx = 0 • x2K (x) dx ≡ σK2 > 0
− ≈ || √ n
i=1
h∗ =
R∗ (f, f n ) =
(f (x))2 dx +
−2/5 c−1/5 c−1/5 c 1
2
4/5
5 2 2/5 (σ ) 4 K
c4 =
K 2 (x) dx, c3 =
n
+
(f (x))2 dx
2
/5)
0
J CV (h) =
f n2 (x) dx
− n2
n
i=1
K ∗ (x) = K (2) (x)
19.2
f (−i) (X i )
− 2K (x)
≈ hn1 2
n
K ∗
i=1 j=1
K (2) (x) =
K (x
X i
X j
h
x K (x) dx
σ2
JC V (h) =
19.3 +
n
− (Y i
i=1
r(−i) (xi ))2 =
2 K (0) Approximation nh r(x) =
∞
( )+
2
dx
− − r (xi ))2
(Y i
2
i=1
K(0)
1
n j =1
K
x−xj h
J
βj φj (x)
j=1
Multivariate Regression
≈
βj φj (x)
i=1
Y = Φβ + η where ηi = i Least Squares Estimator
Y
Smoothing Using Orthogonal Functions
Non-parametric Regression
|
f (x)
K 2 (x) dx dx nhf (x)
n
y)K (y) dy
Estimate f (x), where f (x) = E [Y X = x]. Consider pairs of points (x1 , Y 1 ), . . . , (xn , Y n ) related by
f (x) r (x) + 2r (x)
≈ nc1/51 c2 R∗ (rn , r) ≈ 4/5 n
otherwise
n
2
h∗
5
− −
4
2
Cross-validation estimate of E [J (h)]
Cross-validation estimate of E [J (h)]
[0, 1]
Epanechnikov Kernel
K (x) =
−
−
h4 4
R(rn , r)
(f )2 dx
x <
x xi h x xj n K j=1 h
K
wi (x) =
C (K)
√ 3 4 5(1−x
wi (x)Y i
i=1
1/5
K 2 (x) dx
where N k (x) = k values of x1 , . . . , xn closest to x
∈
r (x) =
K 2 (x) dx
2 c1 = σK , c2 =
3
n1/5
c4 n4/5
1 nh
{
Y i
i:xi N k (x)
∈ ≈
1 x X i K h h
1 (hσK )4 4
R(f, f n )
Nadaraya-Watson Kernel Estimator
KDE
1 f n (x) = n
1 k
a nd Φ =
φ0 (x1 ) .. .
···
φ0 (xn )
···
..
.
φJ (x1 ) .. .
φJ (xn )
}
20.2
Cross-validation estimate of E [J (h)]
− n
RCV (J ) =
20
2
J
Y i
i=1
j=1
Poisson Process
φj (xi )βj,(−i)
• {X t : t ∈ [0, ∞)} – number of events up to and including time t • X 0 = 0 • Independent increments:
Stochastic Processes
Stochastic Process
{X t : t ∈ T }
T =
{ ±
∞
Markov Chains
{
Markov Chain X n : n P [X n
∀t0 < ··· < t n : X t − X t ⊥⊥ ·· · ⊥⊥ X t − X t 1
}
0, 1, . . . = Z discrete [0, ) continuous
• Notations: X t, X (t) • State space X • Index set T 20.1
Poisson Processes
∈ T }
|
• Intensity function λ(t) – P [X t+h − X t = 1] = λ(t)h + o(h) – P [X t+h − X t = 2] = o(h) • X s+t − X s ∼ Poisson (m(s + t) − m(s))
P [X n
|
= x X n−1 ]
∀n ∈ T, x ∈ X
λ(t)
Transition probabilities
where m(t) =
≡ P [X n+1 = j | X n = i] ≡ P [X m+n = j | X m = i]
≡ λ =⇒
X t
∼ Poisson (λt)
Waiting Times n-step
W t := time at which X t occurs
Transition matrix P (n-step: Pn )
• (i, j) element is pij • pij > 0 • i pij = 1
W t
Chapman-Kolmogorov
pij (m + n) =
∼ Gamma
pij (m) pkj (n)
S t = W t+1
Pm+n = Pm Pn Pn = P
=(
(1)
× · · · × P = Pn
(N )) wh
t,
1 λ
Interarrival Times
k
Marginal probability
n−1
n
Homogeneous Poisson Process
= x X 0 , . . . , X n −1 ] =
pij pij (n)
0
(i) =
S t
P [X
i]
∼ Exp
− W t
1 λ
λ> 0
t 0
λ(s) ds
21
Time Series
21.1
Mean function µXt = E [X t ] = Autocovariance function
Strictly stationary
xf t (x) dF X (x)
− µs )(X t − µt)] = E [X sX t] − µsµt γ X (t, t) = E (X t − µt )2 = V [X t ]
γ X (s, t) =
P [X t1
≤ c1, . . . , Xt ≤ ck ] = P [X t +h ≤ c1 , . . . , Xt +h ≤ ck ]
ρ(s, t) =
Cov [X s , X t ] = V [X s ] V [X t ]
Cross-covariance function (CCV) γ XY (s, t) =
E [(X s
γ (s, t) γ (s, s)γ (t, t)
− µXs )(Y t − µY t )]
Cross-correlation function (CCF) ρXY (s, t) = Backshift operator
γ XY (s, t)
γ X (s, s)γ Y (t, t)
B k (X t ) = X t−k
Difference operator
d = (1 − B)d
White noise W t
• E [W t] = 0 t ∈ T • V [W t] = σ2 t ∈ T • γ W (s, t) = 0 s = t ∧ s, t ∈ T Auto regression
1
k
Weakly stationary
• E X t2 < ∞ ∀t ∈ Z • E X t2 = m ∀t ∈ Z • γ X (s, t) = γ X (s + r, t + r)
∀r,s,t ∈ Z
Autocovariance function
• γ (h) = E [(X t+h − µ)(X t − µ)] • γ (0) = E (X t − µ)2 • γ (0) ≥ 0 • γ (0) ≥ |γ (h)| • γ (h) = γ (−h)
∀h ∈ Z
Autocorrelation function (ACF) ρX (h) =
Cov [X t+h , X t ] = V [X t+h ] V [X t ]
Jointly stationary time series γ XY (h) =
i=1
Random walk
φi X t−i + W t
j=1
Symmetric Moving Average
ρXY (h) = Linear Process
t
W j
where the constant δ is the drift X t = µ +
∞
j=
−∞
γ (t + h, t) γ (h) = γ (0) γ (t + h, t + h)γ (t, t)
E [(X t+h
p
X t =
k
∀k ∈ N, tk , ck , h ∈ Z
E [(X s
Autocorrelation function (ACF)
Y t = δt +
Stationary Time Series
− µX )(Y t − µY )]
γ XY (h) γ X (0)γ Y (h)
ψj wt−j
where
∞
| |
j=
−∞
ψj <
∞
21.2
Estimation of Correlation
Moving average 1 : • The low-pass filter V t is a symmetric moving average M t with aj = 2k+1
Sample autocovariance function γ (h) =
1 n
n h
−
(X t+h
t=1
− X ¯ )(X t − X ¯ )
Sample autocorrelation function
1 • If 2k+1
Sample cross-variance function
1 γX Y (h) = n
(X t+h
¯ )(Y t X
Sample cross-correlation function ρXY (h) =
γX Y (h) γX (0)γY (0)
Properties
• σρ • σρ
√1n if X t is white noise 1 (h) = √ if X t or Y t is white noise n
X (h)
XY
21.3
=
Non-Stationary Time Series
Classical decomposition model
¯ − Y )
21.4
• µt = trend • S t = seasonal component • W t = random noise term 21.3.1
Detrending
Least Squares
−
X t−1
≈ 0, a linear trend function µt = β0 + β1 t passes
ARIMA models
Autoregressive polynomial
− φ1 z − · · · − φ p z p
φ(z) = 1
∈ C ∧ φ p = 0
z
Autoregressive operator φ(B) = 1
− φ1B − · · · − φ pB p
AR ( p) (autoregressive model order p)
X t = φ1 X t−1 +
··· + φ pX t− p + W t ⇐⇒
φ(B)X t = W t
AR (1) k 1
• X t = φk (X t−k ) + X t = µt + S t + W t
where
i= k
• µt = β0 + β1 t =⇒ (X t) = β1
t=1
k i= k
k
Differencing
n h
−
− W t−j without distortion
γ (h) ρ(h) = γ (0)
−
1 2k + 1
V t =
−
j=0
φj (W t−j )
k
→∞,|φ|<1 ∞ =
j=0
φj (W t−j )
linear process
• E [X t] = ∞j=0 φj (E [W t−j ]) = 0 • γ (h) = Cov [X t+h , X t] = σ1−φφ h • ρ(h) = γ(h) γ(0) = φ • ρ(h) = φρ(h − 1) h = 1, 2, . . . 2 W
h
2
Moving average polynomial θ(z) = 1 + θ1 z +
··· + θq zq
z
∈ C ∧ θq = 0
MA (q) (moving average model order q)
X t = W t + θ1 W t−1 +
21.4.1
··· + θq W t−q ⇐⇒
X t = θ(B)W t ARMA ( p,q) is causal (future-independent)
q
E [X t ] =
j=0
θj E [W t−j ] = 0 2 σW 0
γ (h) = Cov [X t+h , X t ] =
Causality and Invertibility
q h j=0 θj θj+h
−
≤ ≤q
0 h h> q
X t =
∞
j=0
⇐⇒ ∃{ψj } :
∞ ψ < ∞ such that j=0 j
W t−j = ψ(B)W t
MA (1)
X t = W t + θW t−1 γ (h) =
2 (1 + θ 2 )σW 2 θσW 0
ρ(h) =
h =0 h =1 h> 1
θ (1+θ2 )
h =1
0
h> 1
ARMA ( p,q) is invertible
⇐⇒ ∃{πj } : π(B)X t =
∞ π < ∞ such that j=0 j
∞
j=0
X t−j = W t
ARMA ( p, q)
X t = φ1 X t−1 +
· ·· + φ pX t− p + W t + θ1W t−1 + ··· + θq W t−q φ(B)X t = θ(B)W t
Properties
Partial autocorrelation function (PACF) φhh = corr(X h
− X hh−1, X 0 − X 0h−1)
h
≥2
• ARMA ( p,q) causal ⇐⇒
roots of φ(z) lie outside the unit circle
φ11 = corr(X 1 , X 0 ) = ρ(1) where X ih−1 denotes the regression of X i on X h−1 , X h−2 , . . . , X 1
{
ARIMA ( p,d,q)
}
ψ(z) =
X t = X t−1 + W t
∞
(1
λ)λj
1
ψj z j =
j=0
dX t = (1 − B)dX t is ARMA ( p,q) φ(B)(1 − B)d X t = θ(B)W t • ARMA ( p,q) invertible ⇐⇒
Exponentially Weighted Moving Averages (EWMA)
∞
θ(z) φ(z)
|z| ≤ 1
roots of θ(z) lie outside the unit circle
− λW t−1 ||
when λ
( )=
∞
j
φ(z)
| |≤1
22 22.1
Math
22.2 Finite
Orthogonal Functions
L2 space L2 (a, b) =
n
b
f : [a, b]
Inner Product
→ R,
f (x)2 dx <
a
∞
||f || =
f 2 (x) dx
k=0 n
An orthogonal sequence φ1 , φ2 , . . . is complete if the only function that is is orthogonal to each φj is the zero function. Then, φ1 , φ2 , . . . form an orthogonal basis in L2 :
∈ L2 =⇒
f (x) =
βj φj (x) where βj =
a
j=1
Cosine Basis
√
f
Legendre
∀ j ≥ 1
2cos( jπx)
|| || ≡ Polynomials
∈−
x [ 1, 1] P 0 (x) = 1 P 1 (x) = x
∞
f (x) dx =
k3 =
n(n + 1) 2
rk−1 =
1 rn 1 r
∞
β
n l
2
2
= 2n m
k
l
n+m k
=
Vandermonde’s
n n−k k a b = (a + b)n k
1
1
k=0
∞
k=0
p
kp k−1 =
d dp
r+k k
1
k=0
≡ || || j=1
n k
pk =
∞
2
n(n + 1)(2n + 1) 6
Binomial Theorem
Infinite
∞
Relation 2
f (x)φj (x) dx
k=0
φ0 (x) = 1
Parseval ’s
l=0 n
k=0
b
φj (x) =
1) = n2
k=1 n
∀ φi (x)φj (x) dx = 0 ∀i =j
f
(2k
k=0 n
k=0 n
φ2j (x) dx = 1 j
∞
n(n + 1) 2
k=0 n
k=
k2 =
Orthogonality (for a series of functions φi )
− • • • −− • • − • • − || • • − − − ∈ • • || ∈ •
k=0 n
f (x)g(x) dx
Norm
Series
p < 1
∞
pk
k=0
xk = (1
α k p = (1 + p)α k
d dp
1
p
x)−r
r
+ N
=
1
p < 1 , α
C
=
1 1
− p2
Identity
22.3
References
Combinatorics
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory . Brooks Cole, 1972.
Sampling
k out of n
w/o replacement k 1
−
nk =
ordered
(n
i=0
− i) = (n −n! k)!
n nk n! = = k k! k!(n k)!
unordered
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American Statistician , 62(1):45–53, 2008.
w/ replacement nk
− − n
−
[3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications With R Examples. Springer, 2006.
1+r r
=
n
1+r
n
[4] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie, Algebra . Springer, 2001.
−1
[5] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und Statistik . Springer, 2002.
Stirling numbers, 2nd kind
− − n k
=k
n
1
n + k
k
−
1 1
1
n 0
≤k≤n
=
[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2003.
1 n =0 0 else
Partitions n
P n+k,k =
P n,i
k > n : P n,k = 0
n
i=1
Balls and Urns
|B| = n, |U | = m ¬
B : D, U : D
¬
B : D, U : D
f : B
→U
P n,0 = 0, P 0,0 = 1
¬
D = distinguishable, D = indistinguishable.
f arbitrary
f injective
f surjective
f bijective
≥ − − − ≥ ≥ mn 0
mn
n+n n m
¬
B : D, U : D
≥ 1:
k=1 m
1
n k
m n else
m n
1 m n 0 else 1 m
n
m!
n m
n! m = n 0 else
n m
1 1
1 m=n 0 e ls e
n m
1 m=n 0 e ls e 1 m
n
] . 2 [ n o t s e u Q c M d n a s i m e e L y b d e t a e r C