Probablity & Statistics-Cheat Sheet

Probability Probability and Sta Statistics Cheat Cheat Sheet

c Matthias Vall Copyright  Vallentin entin 2010

This cheatsheet cheatsheet integrate integratess a variet variety y of topics topics in probabili probability ty theory theory and statistics. statistics. It is based on literature literature [1, 6, 3] and in-class in-class material material from courses courses of the statistics statistics departmen departmentt at the Universit University y of California California in Berkeley Berkeley but also influenced influenced by other sources [4, 5]. If you find find errors errors or have sugge suggesstions for further topics, I would appreciate if you send me an email.. The most recent version of this document is available email at http://cs.berkeley.edu/~ mavam/probstat.pdf. To reproduce, please contact me.

12 Parametric Inference 12.1 Method of Moments Moments . . . . . . . . . . 12.2 Maxim Maximum um Likelihood Likelihood . . . . . . . . . . 12.2.1 Delta Method . . . . . . . . . . 12.3 Multi Multiparamet parameter er Models . . . . . . . . 12.3.1 Multi Multiparamet parameter er Delta Method 12.4 Parame Parametric tric Bootstrap . . . . . . . . . 13 Hypothesis Testing

2 Proba Probability bility Theory

14 Bayesian Inference 14.1 Credible Intervals Intervals . . . . 14.2 Fu Function nction of Parameters 3 3 14.3 Priors . . . . . . . . . . 4 14.3.1 Conjug Conjugate ate Priors Priors 14.4 Bayesian Testing Testing . . . . 6

3 Random Variabl ariables es 3.1 Transform ransformations ations . . . . . . . . . . . . .

6 7

Contents 1 Distribution Distribution Overv Overview iew 1.1 Discret Discretee Distrib Distributions utions . . . . . . . . . . 1.2 Cont Continuo inuous us Distri Distributio butions ns . . . . . . . .

4 Expec Expectat tation ion

7

5 Vari arianc ance e

7

6 Ine Inequal qualiti ities es

8

7 Distrib Distribution ution Relati Relationships onships

8

8 Prob Probab abil ilit ity y Functions

9

and an d

Mome Mo ment nt Ge Gene nera rati ting ng

9 Multivariate Multivariate Distrib Distributions utions 9.1 Stand Standard ard Biv Bivariate ariate Normal . . . . . . . 9.2 Biv Bivariate ariate Normal . . . . . . . . . . . . . 9.3 Multi Multivar variate iate Normal . . . . . . . . . . .

. . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

15 Exponential Family

11 20 Stocha Stochastic stic Process Processes es 20.1 Marko Markov v Chains . . . . . . . . . . 12 20.2 Poisson Poisson Processes . . . . . . . . . 12 12 21 Time Series 13 21.1 Stati Stationary onary Time Series . . . . . . 13 21.2 Estima Estimation tion of Correlation Correlation . . . . 13 21.3 Non-S Non-Stationa tationary ry Time Series . . . 21.3.1 Detrend Detrending ing . . . . . . . . 13 21.4 ARIM ARIMA A models . . . . . . . . . . 21.4.1 Causality and Invertibilit Invertibility y 14 14 22 Math 14 22.1 Orthogonal Functions Functions . . . . . . 22.2 Series . . . . . . . . . . . . . . . 15 22.3 Comb Combinator inatorics ics . . . . . . . . . . 15 15 16

16 Sampling Methods 16.1 The Bootstrap Bootstrap . . . . . . . . . . . . . 16.1.1 Bootstrap Confidence Confidence Intervals Intervals 16.2 Rejecti Rejection on Sampling Sampling . . . . . . . . . . . 16.3 Importan Importance ce Sampling . . . . . . . . . .

. . . .

16 16 16 17 17

17 Decision Theory 17.1 Risk . . . . . . 17.2 Admissibility . 17.3 Baye Bayess Rule . . 17.4 Minima Minimax x Rules

. . . .

. . . .

17 17 17 18 18

. . . .

. . . .

18 18 19 19 19

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

9 9 18 Linear Regression 9 18.1 Simpl Simplee Linear Regression . . . . . . 9 18.2 Prediction Prediction . . . . . . . . . . . . . . . 18.3 Multi Multiple ple Regression Regression . . . . . . . . . 10 Convergence 10 18.4 Model Selection Selection . . . . . . . . . . . . 10.1 Law of Large Numbers Numbers (LLN) . . . . . . 10 10.2 Cent Central ral Limit Theorem (CLT (CLT)) . . . . . 10 19 Non-parametric Function Estimation 19.1 Densit Density y Estimation . . . . . . . . . . 11 Statist Statistical ical Infere Inference nce 11 19.1.1 Histograms Histog rams 11.1 Poin Pointt Estimation 11

20 . . 20 20

22 . . . . 22 . . . . 22 . . . . . .

. . . . . .

. . . . . .

. . . . . .

23 23 24 24 24 24 25

26 . . . . 26 . . . . 26 . . . . 27

1

Distribution Overview

1.1

Discrete Distributions F X (x)

 

0

{

}

Uniform a , . . . , b Bernoulli( p)

x< a

x−a+1 a ≤ x ≤ b b−a

1

Binomial(n, p)

f X (x)

(1

−

I 1− p (n

x> b 1−x p)

− x, x + 1)

Multinomial(n, p) Hypergeometric(N,m,n)

≈Φ

NegativeBinomial(r, p) Geometric( p)

  −  x np np(1 p)

−

I p (r, x + 1)

1

− (1 − p)x

x

x

λi i!

e−λ

Poisson(λ)

 i=0

∈ N+

I (a < x < b) b a+1

n x p (1 x

n! px1 x1 ! . . . xk ! 1

− p)n−x

5 2 . 0

···

1 q

n

q

q

q

q

q

F M P

(b

M X (s)

− a + 1)2 − 1

− e−(b+1)s s(b − a) 1 − p + pes

eas

12

− p)

p

p(1

np

np(1

     − − m x

m x n x N x

− −

x+r 1 r p (1 r 1

− p(1 − p)x−1

x

λx e−λ x!

− p + pes)n

− p)

(1

− pi)

  k

xi = n

npi

npi (1

i=1

nm N x

p)

∈ N+

r

1

− p

p

−

pi e

−

nm(N n)(N m) N 2 (N 1)

− 1 − p r −

1 p

1 p p2

λ

λ

N/A



p2

n = 40, p = 0.3 n = 30, p = 0.6 n = 25, p = 0.9

n

is

i=0

p (1 p)es



1

− −

1

− (1 − p)es

p

r

s eλ(e −1)

Geometric 8 . 0

Poisson

p p p

=

=

=

0.2 0.5 0.8

λ= 1 λ= 4 λ = 10 3 . 0

0 2 . 0

F M P

V [X ]

k



pxk k

Binomial

Uniform (discrete)

a+b 2

− 1−x px (1 − p)



[X ]

E

6 . 0

5 1 . 0

F M P

4 . 0

F M P

2 . 0

0 1 . 0 2 . 0

5 0 . 0

0 0 . 0

q q q q q q q q q q q q q q q q q

0 . 0

1 . 0

0 . 0

q

q

q

q

q

q

q

q

q

q

q

q

q

q

1.2

Continuous Distributions F X (x)

    √−  0

Uniform(a, b)

x< a a b

x a b a

1

− −

x

2

Normal(µ, σ )

Φ(x) =

φ(t) dt

−∞

Log-Normal(µ, σ2 )

1 1 ln x µ + erf 2 2 2σ2

Multivariate Normal(µ, Σ)

 

1 k x γ , Γ(k/2) 2 2

Chi-square(k)

1

Gamma(α, β)

γ (α, x/β) Γ(α)

√



Pareto(xm , α)

1

− e−(x/λ) x

≥ xm

(b

eµ+σ

− a)2 σ

2

M X (s) esb esa s(b a)

12

µ /2

(eσ

2

− −



σ 2 s2 exp µs + 2

2

− 1)e2µ+σ



2



µ

Σ

k

2k

(1

1 −x/β e β

β

β2

1 xα−1 e−x/β Γ (α) β α

αβ

αβ2

1 2k/2 Γ(k/2)

(x µ)T Σ−1 (x µ)

−

−

xk/2 e−x/2

β α

k k i=1 Γ (αi ) i 1 xα i k α i=1 i i=1

Γ (α + β) α−1 x (1 Γ (α) Γ (β) k

V [X ]

1 exp µ s + sT Σs 2

1 2

  

I x (α, β)

α

2σ2

2

(2π)−k/2 |Σ|−1/2 e−

Γ

xm x

 

− µ)2

(x

β α −α−1 −β/x x e Γ (α)

Dirichlet(α)

  −

−

√ 1 exp − (ln x2σ−2 µ) x 2πσ 2

Γ (α)

1

a+b 2

1 φ(x) = exp σ 2π

 

Weibull(λ, k)

I (a < x < b) b a

−

Γ α, βx

Beta(α, β)

E [X ]

− e−x/β

Exponential(β)

InverseGamma(α, β)

f X (x)



k x λ λ

−

− x) −

k 1

xα m α α+1 x

β 1

− −(x/λ) e

k

x

≥ xm

−1

α> 1

αi k i=1 αi

(α

−

β2 1)2 (α

1

1 λΓ 1 + k

αxm α> 1 α 1

1

1

2)2

α>2

−

2 λ2 Γ 1 + k

(α

−

−

1+

(s < 1/β)

 −  4βs

      ∞

k 1

k=1

r=0

−

∞ sn λn

µ2

− 2) α > 2

− βs

α

2( βs)α/2 K α Γ(α)

αi + 1

αβ 2 (α + β) (α + β + 1)

xα m 1)2 (α

− βs (s < 1/β)

 

     − α α+β



− 2s)−k/2 s < 1/2 1

− E [X i ] (1 − E [X i ]) k i=1

T

n=0

n!

α+r α+β +r Γ 1+

sk k!

n k

α( xm s)α Γ( α, xm s) s < 0

−

− −

Uniform (continuous)

2

Log−normal

Normal

χ

2

5 . 0

2

µ= 0, σ = 0.2

µ= 0, σ = 3

0 . 1

2

µ= 0, σ = 1 2

8 . 0

2

µ= 2, σ = 2

k =1 k =2

2

k =3

µ= 0, σ = 5

µ= 0, σ = 1

µ= −2, σ = 0.5

µ= 0.5, σ = 1

2

2

k =4

2

µ= 0.25 , σ = 1

k =5

4 . 0

2

8 . 0

µ= 0.125, σ = 1

6 . 0

3 . 0

6 . 0 F 1 D P b a

)

F D P

x

(

φ

−

4 . 0

2 . 0

0 . 0

a

b

−4

−2

x

0

2

4 . 0

2 . 0

2 . 0

1 . 0

0 . 0

0 . 0

4

0.0

0.5

1.0

1.5

x

α= α= α= α= α=

4 . 0

5 . 1

1,

β= 2, β= 3, β= 5, β= 9, β=

0

3.0

2

4

2 4

0.5

1,

β= 2, β= 3, β= 3, β=

0 . 3

1 1

α= α= α= α= α=

1 0.5 5 . 2

0.5,

β= 0.5 β= 1 β= 3 2, β= 2 2, β= 5 5, 1,

0 . 2

3

F D P

8

Beta

α= α= α= α=

2 1

6

x

2

3 . 0 0 . 1

2.5

InverseGamma

5 . 0

β=2 β=1 β = 0.4

2.0

x

Gamma

Exponential 0 . 2

F D P

F D P

F D P

F D P

5 . 1

2

2 . 0

0 . 1 5 . 0

1 . 0

0 . 0

1

2

3

4

5

10

15

20

1, k = 0.5

xm = 1, α= 1

1, k = 1

xm = 1, α= 2

1, k = 1.5

xm = 1, α= 4

1, k = 5 3

5 . 1 F D P

2

0 . 1

1

0

1

2

3 x

Pareto

λ= λ= λ= λ=

0 . 2

5 . 0

5

x

Weibull

0 . 0

0

0

x

F D P

5 . 0

0 . 0

0

5 . 2

1

4

5

0.0

0.2

0.4

0.6 x

0.8

1.0

2

Probability Theory

Law of Total Probability n

Definitions

• Sample space Ω • Outcome (point or element) ω ∈ Ω • Event A ⊆ Ω • σ-algebra A 1. ∅ ∈ A ∞ 2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A 3. A ∈ A =⇒ ¬A ∈ A • Probability distribution P 1. P [A] ≥ 0 for every A



2. 3.

P [Ω] P



P [B] =

=1

   ∞

i=1

Ai =

∞

P [Ai ]

n

P [B

i=1

Ω=



Ai

i=1

Bayes’ Theorem P [Ai

n

  ||   −     P [B Ai ] P [Ai ] n j=1 P [B Aj ] P [Aj ]

| B] =

Ω=

Ai

i=1

Inclusion-Exclusion Principle

 n

i=1

3

n

r

( 1)r−1

Ai =

r=1

Aij

i i1 <
≤

···

≤

Random Variables

Random Variable

i=1

X : Ω

• Probability space (Ω, A, P)

→R

Probability Mass Function (PMF)

Properties

f X (x) =

• P [∅] = 0 • B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B) • P [¬A] = 1 − P [A] • P [B] = P [A ∩ B] + P [¬A ∩ B] • P [Ω] = 1 P [∅] = 0 • ¬( n An) = n ¬An ¬( n An) = n ¬An DeMorgan • P [ n An] = 1 − P [ n ¬An] • P [A ∪ B] = P [A] + P [B] − P [A ∩ B] =⇒ P [A ∪ B] ≤ P [A] + P [B] • P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B] • P [A ∩ ¬B] = P [A] − P [A ∩ B] • A1 ⊂ · · · ⊂ An ∧ A = ∞n=1 An =⇒ P [An ] = P [A] • An ⊂ · · · ⊂ A1 ∧ A = ∞n=1 An =⇒ P [An] = P [A] Independence ⊥⊥ A ⊥⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B]

   

|Ai] P [Ai]



P [X =

x] =

P[

{ω ∈ Ω : X (ω) = x}]

Probability Density Function (PDF)



b

f X (x) =

P [a

≤ X ≤ b] =

f (x) dx

a

Cumulative Distribution Function (CDF):

→ [0, 1] F X (x) = P [X ≤ x] 1. Nondecreasing: x1 < x 2 =⇒ F (x1 ) ≤ F (x2 ) F X : R

2. Normalized: limx→−∞ = 0 and limx→∞ = 1 3. Right-continuous: limy ↓x F (y) = F (x)





b

P [a

≤ Y ≤ b | X = x] = |

a

f Y |X (y x) = Independence

|

f Y |X (y x)dy f (x, y) f X (x)

a

≤b

3.1



Transformations

• E [XY ] = xyf X,Y (x, y) dF X (x) dF Y (y) X,Y • E [ϕ(Y )] = ϕ(E [X ]) (cf. Jensen inequality) • P [X ≥ Y ] = 0 =⇒ E [X ] ≥ E [Y ] ∧ P [X = Y ] = 1 =⇒ ∞ • E [X ] = P [X ≥ x]

Transformation function Z = ϕ(X ) Discrete f Z (z) =

P [ϕ(X ) =

z] =

P[

{x : ϕ(x) = z}] = P



X

 

∈ ϕ−1(z)

=

f (x)

x ϕ−1 (z)

∈

x=1

Sample mean

¯n = 1 X n

Continuous F Z (z) =

P [ϕ(X )

≤ z] =



{

f (x) dx with Az = x : ϕ(x)

Az

≤ z}

Conditional Expectation







The Rule of the Lazy Statistician E [Z ] =

E [I A (x)] =

Convolution

• Z := X + Y





dz



dF X (x) =

P [X

A

f Z (z) =



∞

−∞

 

|J |

−∞ ∞

• E [ϕ(Y, Z ) | X = x] = ϕ(y, z)f (Y,Z )|X (y, z | x) dydz −∞ • E [Y + Z | X ] = E [Y | X ] + E [Z | X ] • E [ϕ(X )Y | X ] = ϕ(X )E [Y | X ] • E [Y | X ] = c =⇒ Cov [X, Y ] = 0

ϕ(x) dF X (x)

I A (x) dF X (x) =

f X,Y (x, z



− x) dx

X,Y 0

≥

=

∈ A]



5

z

0

∞

f X,Y (x, z

− x) dx

• Z := |X − Y | f Z (z) = 2 f X,Y (x, z + x) dx ∞ 0 ∞ X xf x (x)f X (x)f Y (xz) dx • Z := Y f Z (z) = |x|f X,Y (x,xz) dx ⊥=⊥





−∞

4

Variance

  

  − 

• V [X ] = σX2 = E (X − E [X ])2 n n • V X i = V [X i] + 2

  i=1 n

•V

i=1

i=1 n

X i =



x dF X (x) =

  

X discrete

x

xf X (x)

X continuous

Covariance

E [X ]

2

Cov [X i , Y j ]



V

[X i ]

iff X i

i=1

sd[X ] =

xf X (x)

= E X 2

i=j

Standard deviation

Expectation

• E [X ] = µX =

Variance

−∞

Expectation

X i

i=1



d −1 dx 1 f Z (z) = f X (ϕ−1 (z)) ϕ (z) = f X (x) = f X (x) dz

n



• E [Y | X = x] = yf (y | x) dy • E [X ] = E [E [X | Y ]] ∞ • E [ϕ(X, Y ) | X = x] = ϕ(x, y)f Y |X (y | x) dx

Special case if ϕ strictly monotone



E [X ] = E



⊥⊥ X j



V [X ] =

σX

• Cov [X, Y ] = E [(X − E [X ])(Y − E [Y ])] = E [XY ] − E [X ] E [Y ] • Cov [X, a] = 0 • Cov [X, X ] = V [X ]

[Y ]

     n

• Cov

m

X i ,

i=1

n

Y j

j=1

Cov [X i , Y j ]

ρ [X, Y ] = 0

⇐⇒

Sample variance S 2 =

V [X ] V [Y ]

Cov [X, Y ] = 0

Conditional Variance



E [XY ] = E [X ] E [Y ]

1

¯ n )2 X

(X i

E [Y

| X ]2

Exponential

≤E

Markov P [ϕ(X )

 X 2

E

Y 2

)] ≥ t] ≤ E [ϕ(X t

Chebyshev P[

] |X − E [X ]| ≥ t] ≤ V t[X 2

Chernoff P [X

≥ (1 + δ)µ] ≤



eδ (1 + δ)1+δ



δ>

Jensen E [ϕ(X )]

7

≥ ϕ(E [X ])

ϕ convex

Distribution Relationships

Binomial X

Bernoulli( )

X

Bin(

)

X i

X j =

X i

X i

X j =

X i

X j

j=1

≥ r]

     n

Poisson

i=1

P [Y

λi

i=1 n

Bin

X j ,

j=1

λi n j=1 λj

n

−1

               

 

• X ∼ N µ, σ2 =⇒ Xσ−µ ∼ N (0, 1) • X ∼ N µ, σ2 ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ2 • X ∼ N µ1 , σ12 ∧ Y ∼ N µ2, σ22 =⇒ X + Y ∼ N µ1 + µ2, σ12 + σ22 • X i ∼ N µi, σi2 =⇒ i X i ∼ N i µi, i σi2 • P [a < X ≤ b] = Φ b−σµ − Φ a−σ µ • Φ(−x) = 1 − Φ(x) φ(x) = −xφ(x) φ(x) = (x2 − 1)φ(x) • Upper quantile of N (0, 1): zα = Φ −1 (1 − α) Gamma (distribution)

• X ∼ Gamma (α, β) α⇐⇒ X/β ∼ Gamma (α, 1) • Gamma (α, β) ∼ i=1 Exp(β) • X i ∼ Gamma(α ⊥ X j =⇒ i X i ∼ Gamma ( i , β) ∧ X i ⊥ ∞ Γ(α) α−1 −λx • λα = x e dx







0

Gamma (function) n

 ⇒

s] =

• X i ∼ Exp(β) ∧ X i X j = X i Gamma (n, β) i=1 • Memoryless property: P [X > x + y | X > y] = P [X > x] Normal

2 E [XY ]

ri , p) P [X

n

Inequalities

Cauchy-Schwarz

 ≤  ∼ ∧ ⊥⊥ ⇒  ∧ ⊥⊥ ⇒  ∼  ∼ ⊥⊥ ⇒ n

• X i ∼ Poisson (λi)

i=1

Y 2 X

Poisson

• X i ∼ Poisson (λi)

n

• V [Y | X ] = E (Y − E [Y | X ])2 X = E • V [Y ] = E [V [Y | X ]] + V [E [Y | X ]] 6

⇐⇒

 − − |   | − 1

n





Independence

⊥⊥ Y =⇒

• X ∼ NBin (1, p) = Geometric ( p) • X ∼ NBin (r, p) = ri=1 Geometric ( p) • X i ∼ NBin (ri, p) =⇒ X i ∼ NBin ( • X ∼ NBin (r, p) . Y ∼ Bin(s + r, p) =⇒

Cov [X, Y ]

ρ [X, Y ] =

(n large, p far from 0 and 1)

Negative Binomial

i=1 j=1

Correlation

X

• limn→∞ Bin(n, p) = N (np,np(1 − p))

m

=

• Ordinary: Γ(s) =



∞ ts−1 e−t dt

0



i αi , β)



• Γ(n) = (n√− 1)! • Γ(1/2) = π

n

9

∈N

9.1

Beta (distribution)

Standard Bivariate Normal

Let X, Y

Γ(α + β) α−1 • B(α,1 β) xα−1 (1 − x)β−1 = Γ(α)Γ(β) x (1 − x)β−1 + k, β) α+k−1 k−1 • E X k = B(αB(α, E X = β) α+β +k−1 • Beta(1, 1) ∼ Unif (0, 1)



Conditionals

|

(Y X = x)



Independence

1

Probability and Moment Generating Functions

X

9.2

Bivariate Normal

Let X

∼ N µx , σx2

 

  

∞ (Xt)i

i=0

i!

=

∞

i=0

E

X i i!

x2 + y 2 2ρxy 2(1 ρ2 )

− −



(X Y = y)

⊥⊥ Y ⇐⇒

ρ =0

1

x

µx

2

+

σx

y

1

µy

z

exp

ρ2

2

2ρ

σy

2(1

x

ρ2 )

µx

y

σx

| Y ] = E [X ] + ρ σσXY (Y − E [Y ]) V [X

ti

| Y ] = σX

Multivariate Normal

Covariance Matrix Σ

Σ=

 − 1

ρ2

(Precision Matrix Σ−1 )

 

V [X 1 ]

.. . Cov [X k , X 1 ]

· ·· ..

.

· ··

 

Cov [X 1 , X k ] .. . V [X k ]

µy

σy

Conditional mean and variance

9.3



∼ N ρy, 1 − ρ2

|

and

2πσ x σy

E [X

       ·

exp

−

ρ2 Z

∼ N µy , σy2 .

f (x, y) =

z=

ρ2

1

   − − −   −   −  −  −  −  and Y

X

• GX (t) = E t |t| < 1 • M X (t) = GX (et) = E eXt = E • P [X = 0] = GX (0) • P [X = 1] = GX (0) (i) • P [X = i] = GXi!(0) • E [X ] = GX (1−) • E X k = M X(k)(0) − • E (X X ! k)! = G(k) X (1 )

1

ρ2

ρx, 1

Γ(x)Γ(y) t − (1 − t)y−1 dt =

N



 − ∼ N  −  2π

x 1

• Ordinary: B(x, y) = B(y, x) = Γ(x + y) 0 x a−1 b−1 • Incomplete: B(x; a, b) = t (1 − t) dt 0 • Regularized incomplete: a+b−1 B(x; a, b) a,b∈ (a + b − 1)! I x (a, b) = = xj (1 − x)a+b−1−j − − B(a, b) j!(a + b 1 j)! j=a • I 0 (a, b) = 0 I 1(a, b) = 1 • I x (a, b) = 1 − I 1−x(b, a)

 

1

f (x, y) =

 

 −

∼ N (0, 1) ∧ X ⊥⊥ Z with Y = ρX +

Joint density

Beta (function):

8

Multivariate Distributions



Properties

Slutzky’s Theorem 1/2

• Z ∼ N (0, 1) ∧ X = µ + Σ Z =⇒ X ∼ N (µ, Σ) • X ∼ N (µ, Σ) =⇒ Σ−1/2(X − µ) ∼ N (0, 1) • X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT • X ∼ N (µ, Σ) ∧ a is vector of lenght k =⇒ aT X ∼ N aT µ, aT Σa



10







P

D

P

D

D

D

10.1

D

Law of Large Numbers (LLN)

{

}

Let X 1 , X 2 , . . . be a sequence of rv’s and let X be another the cdf of X n and let F denote the cdf of X .

}

rv.

Let F n denote

1. In distribution (weakly, in law): X n

Strong (SLLN)

lim F n (t) = F (t)

n

→∞



10.2

n

→∞

4. In quadratic mean (L2 ): X n

qm

¯n X

µ

V

¯n X

 − 

=

CLT Notations

as

P

P

P

P

P

qm

qm

qm

P

P

P

<

∞.

→∞

σ

E [X 1 ] =

µ, and

V [X 1 ] =

σ2 .

D

→ Z

where Z

≤ z] = Φ(z)

z

∈R

≈ N (0, 1) 2 ¯n ≈ N µ, σ X n

    − ≈ N − ≈ N  

¯n X

µ

√n(X ¯ µ) √n(X ¯nn − µ) n

Continuity Correction

P

0,

σ2 n

0, σ2

≈ N (0, 1)

 ≤  ≈  √−   ≥  ≈ −  − √−  P

P

P

V [X 1 ]

Z n

D

D

as n

=1



• X n → X =⇒ X n → X =⇒ X n → X • X n → X =⇒ X n → X • X n → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ X n → X • X n → X ∧ Y n → Y =⇒ X n + Y n → X + Y • X n → X ∧ Y n → Y =⇒ X n + Y n → X + Y • X n → X ∧ Y n → Y =⇒ X nY n → XY • X n → X =⇒ ϕ(X n) → ϕ(X )

as

→∞ P [Z n

2 →∞ E (X n − X ) = 0

P

→∞

√n(X ¯n − µ)

X

lim

qm

→µ

as n

lim



n

Relationships

¯n X

P

n

as

  → 

→µ

}

Z n :=

→ X = P ω ∈ Ω : lim X n (ω) = X (ω) n→∞

µ, and

Central Limit Theorem (CLT)

{

P

lim X n = X

¯n X

Let X 1 , . . . , X n be a sequence of iid rv’s,

→ X (∀ε > 0) lim P [|X n − X | > ε] = 0 n→∞

3. Almost surely (strongly): X n P

D

→ X ∀t where F continuous

E [X 1 ] =

Weak (WLLN)

Types of Convergence

2. In probability: X n

D

Let X 1 , . . . , X n be a sequence of iid rv’s,

Convergence

{

D

• X n → X and Y n → c =⇒ X n + Y n → X + c • X n → X and Y n → c =⇒ X nY n → cX • In general: X n → X and Y n → Y =⇒ X n + Y n → X + Y

¯n X

¯n X

x

x

x + 12 µ σ/ n

Φ

1

Φ

x

1 2

σ/ n

µ

∼ N (0, 1)

11

Statistical Inference

• F ˆn → F (x) P

Let X 1 ,

··· , X n iid ∼ F if not otherwise noted.

11.1

Point Estimation

P

                          ≈ N    − − − ∼ N  ± 

• Point estimator θn of θ is a rv: θn = g(X 1 , . . . , Xn ) • bias(θn) = E θn − θ • Consistency: θn → θ • Sampling distribution: F (θn) • Standard error: se(θn) = V θn • Mean squared error: mse = E (θn − θ)2 = bias(θn) + V θn • limn→∞ bias(θn) = 0 ∧ limn→∞ se(θn) = 0 =⇒ θn is consistent • Asymptotic normality: θnse− θ → N (0, 1) • Slutzky’s Theorem often lets us replace se(θn) by some (weakly) consis-

 

sup F (x) x

Nonparametric 1

P

tent estimator σn .

=

P [L(x)

Normal-based Confidence Interval Suppose θn θ, se2 . Let zα/2 = Φ−1 (1 (α/2)), i.e., and

P

zα/2 < Z < zα/2 = 1

α where Z

C n = θn

11.3

(0, 1). Then

zα/2 se

Empirical Distribution Function

Empirical Distribution Function (ECDF)

  Fn (x) =

I (X i Properties (for any fixed x)

•E



ˆn = F (x) F

Z > z α/2 = α/2

I (X i n



≤ x) ≤

1 X i x 0 X i > x

> ε = 2e−2nε

2

   1 log 2n

2 α

≤ F (x) ≤ U (x) ∀x] ≥ 1 − α

Statistical Functionals

• Statistical functional: T (F ) • Plug-in estimator of θ = T (F ) : θn = T (F ˆn) • Linear functional: T (F ) = ϕ(x) dF X (x) • Plug-in estimator for linear functional:

        ⇒ ±  • ≈ N • { ≥ } •  • − −  − •  − −   •   ˆn ) = T (F

T (F ), se2

ˆn ) Often: T (F

ϕ(x) dF n (x) = ˆn ) T (F

=

th

p quantile: F −1 ( p) = inf x : F (x) p ¯n µ ˆ = X

n i=1

≤ x) =

P

11.4

 

− F ˆn(x)

− α confidence band for F ˆn − n , 0} L(x) = max{F ˆn + n , 1} U (x) = min{F

D

11.2

(DKW) Inequality (X 1 , . . . , X n

Dvoretzky-Kiefer-Wolfowitz

σ2 = κ ˆ=

n

1

n

1 n

ρˆ =

1

(X i

¯n )2 X

i=1 n µ ˆ)3 i=1 (X i 3 σ j n ¯ n )(Y i Y ¯n ) X i=1 (X i n n ¯ n )2 X i=1 (X i i=1 (Y i

−

12

Pa

etric Inf

− Y ¯n)

1 n

n

ϕ(X i )

i=1

zα/2 se

∼ F )

12.1

Method of Moments

j th moment αj (θ) =

E

j th sample moment

Fisher Information I (θ) =

    X j =

1 α ˆj = n

Fisher Information (exponential family)

   

I nobs (θ) =

α ˆ1 α ˆ2 .. . α ˆk

Properties of the

   ≈  

2. se



(θn

D

θ)

se

f (X i ; θ)

i=1

D

→ N (0, 1)

    ≤

ñ , θn ) = are(θ

V

θñ

V

θn

• Approximately the Bayes estimator 12.2.1

n

Maximum Likelihood Estimator (mle)

  −

• Asymptotic optimality, i.e., smallest variance for large samples. If θñ is any

n

Ln(θ) =

log f (X i ; θ)

i=1



1/I n (θn )

Ln : Θ → [0, ∞)

n (θ) = log

n



mle

other estimator, then

Maximum Likelihood

Log-likelihood

2

− ∂θ∂ 2

• Consistency: θn → θ • Equivariance: θn is the mle =⇒ ϕ(θn) is the mle of ϕ(θ) • Asymptotic normality: 1. se ≈ 1/I n (θ) (θn − θ) → N (0, 1) se

D

where Σ = g E Y Y T g T , Y = (X, X 2 , . . . , X k )T , ∂ −1 g = (g1 , . . . , g k ) and gj = ∂θ αj (θ)

Ln(θ) =



∂ s(X ; θ) ∂θ

P

P



−

Observed Fisher Information

• θn exists with probability tending to 1 • Consistency: θn → θ • Asymptotic normality: √n(θ − θ) → N (0, Σ)

Likelihood:

Eθ

i=1

Properties of the MoM estimator

12.2

I (θ) =

X ij

Method of Moments Estimator (MoM)



I n (θ) = nI (θ)

xj dF X (x)

n

α1 (θ) = α2 (θ) = .. .= αk (θ) =

[s(X ; θ)]

Vθ



log f (X i ; θ)

i=1

Ln(θn) = sup Ln(θ)

Delta Method



If τ = ϕ(θ) where ϕ is differentiable and ϕ (θ) = 0:



  − (τ n

τ ) se(τ )

D

→ N (0, 1)

1

12.3

13

Multiparameter Models

  

Let θ = (θ1 , . . . , θ k ) and θ = ( θ1 , . . . , θk ) be the ∂ 2 n H jj = ∂θ 2 Fisher Information Matrix

mle.

I n (θ) =

Eθ

[H 11 ] .. .

..

Eθ

[H 1k ] .. .

Eθ

[H kk ]

.

[H k1 ]

Under appropriate regularity conditions (θ

H 0 : θ

∂ 2 n H jk = ∂θ j ∂θ k

 ·· · − ·· ·  − ≈ N   − → N    Eθ

θ)

 

(0, J n )

θj )

D

(0, 1)

sej

where se2j = J n ( j,j) and Cov θj , θk = J n ( j,k) 12.3.1



Let τ = ϕ(θ1 , . . . , θ k ) be a function and let the gradient of ϕ be

ϕ = Suppose



∂ϕ ∂θ 1 .. . ∂ϕ ∂θ k

 

             

Retain H 0

se(τ ) =

ˆn = J n (θ) and ˆ ϕ = and J





ϕ

. θ =θ

ˆn ˆ ϕ J

√

H 0 true H 1 true

Type II error (β)

Reject H 0 Type I error (α) (power)

√

p-value

• p-value = supθ∈Θ • p-value = supθ∈Θ

0

Pθ

0

 ∈   ∈  ≥  ≥

[T (X ) T (x)] = inf α : T (x) Rα  Pθ [T (X ) T (X )] = inf α : T (X )

1 F θ (T (X))

−

p-value

< 0.01 0.01 0.05 0.05 0.1 > 0.1

− −

se(τ )

T

∈ Θ1

0

D

ˆϕ

H 1 : θ

1

ϕ θ=θ = 0 and τ = ϕ(θ). Then, (τ − τ ) → N (0, 1)

where

versus

• Null hypothesis H 0 • Alternative hypothesis H 1 • Simple hypothesis θ = θ0 • Composite hypothesis θ > θ 0 or θ < θ 0 • Two-sided test: H 0 : θ = θ0 versus H 1 : θ = θ0 • One-sided test: H 0 : θ ≤ θ0 versus H 1 : θ > θ 0 • Critical value c • Test statistic T • Rejection Region R = {x : T (x) > c } • Power function β(θ) = P [X ∈ R] • Power of a test: 1 − P [Type II error] = 1 − β = θinf β(θ) ∈Θ • Test size: α = P [Type I error] = θsup β(θ) ∈Θ

Multiparameter Delta Method

 

∈ Θ0

Definitions

with J n (θ) = I n−1 . Further, if θj is the j th component of θ, then (θj

Hypothesis Testing

since T (X  ) F θ

∼

evidence

very strong evidence against H 0 strong evidence against H 0 weak evidence against H 0 little or no evidence against H 0

Wald Test

• Two-sided test • Reject H 0 when |W | > z α/2 where W = θ −seθ0 • P |W | > z →





 

Rα

    

• X n = (X 1, . . . , Xn ) • xn = (x1 , . . . , xn) • Prior density f (θ) • Likelihood f (xn | θ): joint density of then data In particular, X n iid =⇒ f (xn | θ) = f (xi | θ) = Ln (θ) i=1 • Posterior density f (θ | xn) • Normalizing constant cn = f (xn) = f (x | θ)f (θ) dθ • Kernel: part of a density that depends  on θ (θ)f (θ) • Posterior Mean θ¯n = θf (θ | xn) dθ =  LθL(θ)f (θ) dθ

supθ∈Θ Ln (θ) Ln(θn) • T (X ) = sup = L n (θ) Ln(θn,0 ) θ∈ Θ k • λ(X ) = 2 log T (X ) → χ2r−q where Z i2 ∼ χ2k with Z 1, . . . , Zk iid ∼ N (0, 1) i=1 • p-value = Pθ [λ(X ) > λ(x)] ≈ P χ2r−q > λ(x) 0

D

0

Multinomial LRT



     →

• Let pˆn = X n1 , . . . , X nk k • T (X ) = LLnn(ˆ( p pn0)) = j=1

be the

mle



Xj

pˆj p0j

k

• λ(X ) = 2 X j log p pˆ0jj χ2k−1 j=1 • The approximate size α LRT rejects H 0 when λ(X ) ≥ χ2k−1,α D

14.1 1

k

D



(X j

j=1

χ2k 1

− E [X j ])2 where E [X j ] = np0j under H 0

− α Posterior Interval P [θ

• T → − • p-value = P χ2k−1 > T (x) • Faster → X k2−1 than LRT, hence preferable for small n D

1





Independence Testing

• I rows, J columns, X multinomial sample of size n = I ∗ J • mles unconstrained: pîj = Xn • mles under H 0: pˆ0ij = pî· pˆ·j = Xn Xn • LRT: λ = 2 I i=1 J j=1 X ij log XnXX • Pearson χ2: T = I i=1 J j=1 (X −[X[X] ]) • LRT and Pearson → χ2ν , where ν = (I − 1)(J − 1) ij

  ij

i·

ij

E

− α Equal-tail Credible Interval a

−∞

f (θ xn ) dθ =

|



b

f (θ xn ) dθ = 1

a



∞

|

f (θ xn ) dθ = α/2

|

b

− α Highest Posterior Density (HPD) region Rn 1. P [θ ∈ Rn ] = 1 − α 2. Rn = {θ : f (θ | xn ) > k } for some k Rn is unimodal =⇒ Rn is an interval 1

·j

E

ij

ij

2

14.2

Function of Parameters

{

Let τ = ϕ(θ) and A = θ : ϕ(θ) Posterior CDF for τ

≤ τ }.

D

14

Bayesian Inference

Bayes’ Theorem

−α

·j

i·

   

∈ (a, b) | xn] =



E [X j ]

n

n

Credible Intervals

Pearson χ2 Test

• T =

 

H (r xn ) =

|

P [ϕ(θ)

≤ τ | xn] =

 A

Posterior Density h(τ xn ) = H  (τ xn )

|

|

f (θ xn ) dθ

|

14.3

Priors

Continuous likelihood (subscript c denotes constant)

Choice

• Subjective Bayesianism: prior should incorporate as much detail as possible the research’s a priori knowledge — via prior elicitation . • Objective Bayesianism: prior should incorporate as little detail as possible (non-informative prior). • Robust Bayesianism: consider various priors and determine sensitivity of

Likelihood

Conjugate Prior

Uniform(0, θ)

Pareto(xm , k)

Exponential(λ)

Gamma(α, β)

Posterior hyperparameters n

α + n, β +

• Flat: f (θ)∞∝ constant • Proper: −∞ f (θ) dθ = 1 ∞ f (θ) dθ = ∞ • Improper: −∞ • Jeffreys ’ prior (transformation-invariant):

 

f (θ)

∝



• Conjugate: f (θ) and f (θ | x 14.3.1

I (θ)

n

f (θ)

∝



Normal(µ, σc2 )

Nor mal(µ0 , σ02 )

Normal(µc , σ 2 )

Scaled Inverse Chisquare(ν, σ02 )

Normal(µ, σ 2 )

Nor malscaled Inverse Gamma(λ,ν,α,β)

MVN(µ, Σc )

det(I (θ))

) belong to the same parametric family

Conjugate Priors Discrete likelihood

Likelihood

Conjugate Prior

Posterior hyperparameters n

Bernoulli( p)

Beta(α, β)

α+

Beta(α, β)

α+

 −   −    xi , β + n

n

xi , β +

i=1

Negative Binomial( p)

B eta(α, β)

n

α + rn,β +

Gamma(α, β)

α+

Dirichlet(α)

α+

xi

x(i)

n i=1 (xi

ν + n,

2

   −  

xi



c

0

n

n + κ, Ψ +

(xi

µc )(xi

i=1 n

xi xmc i=1 kn where k0 > k n

α + n, β +

Pareto(xm , kc )

Pareto(x0 , k0 )

x0 , k0

Gamma(αc , β)

Gam ma(α0 , β0 )

α0 + nαc , β0 +

If H 0 : θ

∈ Θ0 : Prior probability

−

log n

xi

P [H 0 ] =

 

f (θ) dθ

Θ0

Posterior probability

n , 2

−1 −1 Σ0 µ0 + nΣ−1 x ¯ , − 1 − 1 − 1 Σ + nΣ

Gam ma(α, β)

Bayesian Testing

α +

(x ¯ − λ) − x¯)2 + γ 2(n + γ )

Pareto(xmc , k)

14.4

µ)2

ν + n

i=1

n

i=1

1 n + , σ02 σc2

/

1 −1 Σ− 0 + nΣc

xi

i=1

N i

i=1

xi , β + n

i=1 n

Multinomial(p)

InverseWishart(κ, Ψ)



i=1

n

Poisson(λ)

νλ + n¯ x , ν + n n 1 β+ (xi 2 i=1

n

i=1 n

Binomial( p)

µ0 + σ02 − 1 n + σ02 σc2 νσ 02 + ν + n,



MVN(µ0 , Σ0 )

MVN(µc , Σ)

xi

i=1 n i=1 xi σc2 1

our inferences to changes in the prior.

Types

          − max x(n) , xm , k + n

P [H 0

| xn ] =

Θ0

f (θ xn ) dθ

|

− µc)T

Marginal Likelihood f (xn H i ) =

|



1. Estimate VF [T n ] with VF ˆn [T n ]. 2. Approximate VF ˆn [T n ] using simulation:

f (xn θ, H i )f (θ H i ) dθ

|

Θ

|

∗ , . . . , T ∗ , an (a) Repeat the following B times to get T n,1 n,B ˆ the sampling distribution implied by F n ˆn . i. Sample uniformly X 1∗ , . . . , X n∗ F

Posterior Odds (of H i relative to H j )

| xn] P [H j | xn ] P [H i

=

f (xn H i ) f (xn H j )

| |

Bayes Factor

ii. Compute T n∗ = g(X 1∗ , . . . , X n∗ ).

P [H j ]

     

Bayes Factor BF ij

(b) Then

prior odds

evidence log10 BF 10 BF 10 0 0.5 1 1.5 Weak 0.5 1 1.5 10 Moderate 1 2 10 100 Strong >2 > 1 00 Dec is ive p BF 10 1− p p∗ = where p = P [H 1 ] and p∗ = P [H 1 xn ] p 1 + 1− BF 1 0 p

− − −

− − −

|

15

Exponential Family

{

−

{

}

Vector parameter



− A(θ) i=1 = h(x)exp {η(θ) · T (x) − A(θ)} = h(x)g(θ)exp {η(θ) · T (x)} ηi (θ)T i (x)

B

T ∗

n,b

b=1

−

1 B

  B

2

T ∗

n,r

r=1

Normal-based Interval T n

± zα/2seˆ boot

Location parameter θ = T (F ) Pivot Rn = θn θ Let H (r) = P [Rn r] be the cdf of Rn ∗ = θ∗ Let Rn,b θn . Approximate H using bo otstrap: n,b

−≤  − 

ˆ (r) = 1 H B

B



∗ I (Rn,b

b=1

≤ r)

 

∗ , . . . , θ∗ ) 5. Let θβ∗ denote the β sample quantile of (θn,1 n,B

  −   −  −

6. Let rβ∗ denote the β sample quantile of (R∗n,1 , . . . , R ∗n,B ), i.e., rβ∗ = θβ∗

Natural form

|

{ ·

−

}

f X (x η) = h(x)exp η T(x) A(η) = h(x)g(η)exp η T(x)

{ ·

 }

= h(x)g(η)exp η T T(x)

16



Bootstrap Confidence Intervals



s

|

16.1.1

1. 2. 3. 4.

}

f X (x θ) = h(x)exp η(θ)T (x) A(θ) = h(x)g(θ)exp η(θ)T (x)

f X (x θ) = h(x)exp

ˆˆ vboot = V F n

1 = B

Pivotal Interval

Scalar parameter

|

sample from

∼

P [H i ]

×

iid

Sampling Methods

7. Then, an approximate 1 a ˆ= ˆb =

− α confidence interval is C n =

 −  −   − − −

θn

ˆ −1 1 H

θn

ˆ −1 H

α = 2 α = 2

r1∗−α/2 =

θn

θn

∗ = rα/2

â, ˆb with

2θn

2θn

θ1∗−α/2

∗ θα/2

θn

16.2

Rejection Sampling

Setup

• We can easily sample from g(θ) • We want to sample from h(θ), but it is difficult k(θ) • We know h(θ) up to proportional constant: h(θ) = k(θ) dθ • Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ



Algorithm 1. Draw θ cand 2. Generate u

∼ g(θ) ∼ Unif (0, 1) k(θcand ) 3. Accept θcand if u ≤ M g(θcand ) 4. Repeat until B values of θcand have been accepted

• We can easily sample from the prior g(θ) = f (θ) • Target is the posterior with h(θ) ∝ k(θ) = f (xn | θ)f (θ) • Envelope condition: f (xn | θ) ≤ f (xn | θn) = Ln(θn) ≡ M • Algorithm 1. Draw θcand ∼ f (θ) 2. Generate u ∼ Unif (0, 1) Ln(θcand) 3. Accept θcand if u ≤ Ln(θn)

 

16.3



Importance Sampling

|

iid

∼ f (θ)

2. For each i = 1, . . . , B, calculate wi = 3.

17

E [q(θ)

• Squared error loss: L(θ, a) = (θ − a)2 1 (θ − a) a − θ < 0 • Linear loss: L(θ, a) = K K 2 (a − θ) a − θ ≥ 0 • Absolute error loss: L(θ, a) = |θ − a| (linear loss with K 1 = K 2) • L p loss: L(θ, a) = |θ − a| p • Zero-one loss: L(θ, a) = 01 aa == θθ





| xn ]

≈

B i=1

q(θi )wi

Decision Theory

Risk

Posterior Risk

   |   |       |                 |  ∀  ≤  •   r(θ x) =

L(θ, θ (x))f (θ x) dθ = Eθ|X L(θ, θ(x))

(Frequentist) Risk

R(θ, θ) =

L(θ, θ(x))f (x θ) dx = EX |θ L(θ, θ(X ))

Bayes Risk

r(f, θ) =

Sample from an importance function g rather than target density h. Algorithm to obtain an approximation to E [q(θ) xn ]: 1. Sample from the prior θ1 , . . . , θn



Loss functions

17.1

Example



Ln(θi) B i=1 Ln (θi )

 

• Decision rule: synonymous for an estimator θ • Action a ∈ A: possible value of the decision rule. In the estimation context, the action is just an estimate of θ, θ(x). • Loss function L: consequences of taking action a when true state is θ or discrepancy between θ and θ, L : Θ × A → [−k, ∞).

L(θ, θ(x))f (x, θ) dxdθ = Eθ,X L(θ, θ(X ))

r(f, θ) =

r(f, θ) =

17.2

| L(θ, θ(X )

Eθ EX θ

EX Eθ X

|

L(θ, θ(X )

= Eθ R(θ, θ)

= EX r(θ X )

Admissibility

θ dominates θ if

θ : R(θ, θ )

R(θ, θ )

17.3

Bayes Rule

Residual Sums of Squares (rss)

Bayes Rule (or Bayes Estimator)

• r(f, θ) = inf θ˜ r(f, θ)˜ • θ(x) = inf r(θ | x) ∀x =⇒

  

n

r(θ x)f (x) dx

Least Square Estimates

β T = ( β0 , β1 )T : min rss β0 ,β1

• Squared error loss: posterior mean • Absolute error loss: posterior median • Zero-one loss: posterior mode

Minimax Rule

¯ R(a) = sup R(θ, a)

θ

θ





˜ = inf sup R(θ, θ) ˜ ¯ θ) sup R(θ, θ ) = inf R(





θ˜

θ˜

θ = Bayes rule

Least Favorable Prior



θf = Bayes rule

θ



∧ ∃c : R(θ, θ) = c

 

∧ R(θ, θf ) ≤ r(f, θf ) ∀θ

Linear Regression

¯n β1 X

n ¯n )(Y i X i=1 (X i n ¯ n )2 X i=1 (X i

β1 =

¯ θ) = sup R(θ, θ) R(

θ

18

¯n β0 = Y

Minimax Rules

Maximum Risk

E

β X n =

β0 β1

V

β X n =

σ2 nsX

n i=1

n−1

¯n X

n i=1

σ

se(β0 ) =

sX n σ se(β1 ) = sX n n

where s2X = n−1 i=1 (X i of σ. Further properties:

Simple Linear Regression

Model Y i = β0 + β1 X i + i Fitted Line

E [i

| X i] = 0, V [i | X i] = σ2

   

r(x) = β0 + β1 x

Predicted (Fitted) Values

β0

β0

se(β0 )

Approximate 1

X i2

=

D

n i=1 X i Y i n 2 i=1 X i

¯ Y ¯ nX nX 2

¯n X 1

X i2

n ˆ2i i=1 

¯ n )2 and σ2 = 1 X n−2

P

• Response variable Y • Covariate X (aka predictor variable or feature)

¯n ) Y

n

Consistency: β0 β0 and β1 Asymptotic normality:

Definitions

18.1

ˆ2i

i=1

  |

r(f, θ) =

Theorems

17.4

          −  − −  −    −  −   |   |   − −    √    √  −     • → → •   − → N   − → N • −  ±    ±   •     | | rss(β0 , β1 ) =

P

an (unbiased) estimate

β1

(0, 1)

and

β1

β1

se(β1 )

D

(0, 1)

α confidence intervals for β0 and β1 are β0

zα/2 se(β0 ) and

β1

zα/2 se(β1 )

The Wald test for testing H 0 : β1 = 0 vs. H 1 : β1 = 0 is: reject H 0 if W > z α/2 where W = β1 /se(β1 ).

Likelihood

If the (k n

L= L1 = L2 =

  

n

f (X i , Y i ) =

i=1 n



n

f X (X i )

i=1

×



i=1

|

f Y |X (Y i X i ) =

V

f Y |X (Y i | X i ) ∝ σ−n exp

−  1 2σ 2

Y i

i



− (β0 − β1 X i)

Prediction

2

Estimate regression function

k

r(x) =

σ2 =

i=1

           −−  ± 

mle

Y∗ = β0 + β1 x∗

ξn2

=σ

2

n

Y∗

18.3

   

Y∗ = V β0 + x2∗ V β1 + 2x∗ Cov β0 , β1 n i=1 (X i i (X i

X ∗ )2 ¯ )2 j + 1 X

X = Likelihood

X 11 .. .

···

X n1

···

..

.

X 1k .. . X nk

L(µ, Σ) = (2πσ 2)−n/2 exp

k

¯ µ = X

1

− α Confidence Interval

18.4

βj

ˆ2i

ˆ = X β

σ2 =

n

Y

i=1

k

n

σ2

zα/2 se(βj )

Model Selection

Consider predicting a new observation Y ∗ for covariates X ∗ and let S J denote a subset of the covariates in the model, where S = k and J = n. Issues

||

||

1. Assign a score to each model 2. Search through all models to find the one with the highest score

        −  β=

n

Procedure

Y = Xβ + 

 

n

1

• Underfitting: too few covariates yields high bias • Overfitting: too many covariates yields high variance

zα/2 ξn

Multiple Regression

where

β j xj

j=1

ˆ2i

Observe X = x∗ of the covarite and want to predict their outcome Y ∗ .

Prediction Interval

β, σ 2 (X T X )−1

Unbiased estimate for σ 2

n

  1 σ2 = n

V

β X n = σ 2 (X T X )−1 β

Under the assumption of Normality, the least squares estimator is also the mle

18.2

  |    ≈ N      −   −   −  ±   β = (X T X )−1 X T Y

L1 × L2

f X (X i )

i=1 n i=1

× k) matrix X T X is invertible,

β1 .. .

=

βk

1 rss 2σ2

1 .. .

n

Hypothesis Testing



H 0 : βj = 0 vs. H 1 : βj = 0

∀ j ∈ J

Mean Squared Prediction Error (mspe) mspe

 

= E (Y (S )

− Y ∗ )2



⊂

19

Training Error

Non-parametric Function Estimation

n

    − −      − Rtr (S ) =

(Yi (S )

i=1

R2 2

R (S ) = 1

rss(S )

Rtr (S )

=1

tss

tss

− Y i)2

=1

19.1

  − 

n i=1 (Yi (S ) n i=1 (Y i

¯ 2 − Y ) ¯ 2 − Y )

The training error is a downward-biased estimate of the prediction risk. E

−2

   Cov Yi , Y i

i=1

19.1.1

− nn −− k1 rss tss



  −   −

Bayesian Information Criterion (BIC)

BI C (S ) = n (βS , σS2 )

m

  

(Yi ∗ (S )

RV (S ) =

i=1

− Y i∗ )2

m=

k

RCV (S ) =

k log n 2

 −   −   n

(Y i

E

fn (x)

v(x) =

V

f n (x)

2

Y( i) ) =

f 2 (x) dx

v(x) dx

f (x)



j

Y i

Yi (S )

f (u) du

m

   ∈     −  ≈        ≈ pj I (x h j=1

Bj )

pj h pj (1 pj ) V f n (x) = nh2 2 h 1 2 R(f n , f ) (f  (u)) du + 12 nh E

|{validation data}|, often n4 or n2

n

b2 (x) dx +

b(x) =

• Number of bins m • Binwidth h = m1 • Bin Bj has ν j observations • Define pj = ν j /n and pj = B f n (x) =

Leave-one-out Cross-validation



dx = J (h) +

L(f, f n ) =

E

Histogram Estimator

AIC (S ) =

2

f n (x)

Histograms



Akaike Information Criterion (AIC) n (βS , σS2 )

A f (x) dx.

Definitions

R(S ) = Rtr (S ) + 2kσ 2 = lack of fit + complexity penalty

Validation and Training

f (x)

R(f, f n ) =

C p statistic

 

Frequentist Risk



∈ A] =

    −           − 

L(f, fn ) =

n

R(S ) =

R2 (S ) = 1 Mallow’s

Estimate f (x), where f (x) = P [X Integrated Square Error (ise)

Rtr (S ) < R(S )

bias (Rtr (S )) = E Rtr (S )

Adjusted R2

Density Estimation

2

f n (x) =

h∗ =

R∗ (f n , f )

1/3

1

6

n1/3

(f  (u))

C n2/3

C =

2

Cross-validation estimate of E [J (h)]

3 4

du

2/3

2



(f  (u)) du

1/3

19.1.2

k-nearest Neighbor Estimator

Kernel Density Estimator (KDE)

Kernel K

r(x) =



• K (x) ≥ 0 • K (x) dx = 1 • xK (x) dx = 0 • x2K (x) dx ≡ σK2 > 0

      −    ≈             || √ n

i=1

h∗ =

R∗ (f, f n ) =

(f  (x))2 dx +

−2/5 c−1/5 c−1/5 c 1

2

4/5

5 2 2/5 (σ ) 4 K

c4 =

K 2 (x) dx, c3 =

n

+

(f  (x))2 dx

2

/5)

0



J CV (h) =

f n2 (x) dx

− n2

n

  i=1

K ∗ (x) = K (2) (x)

19.2

f (−i) (X i )

− 2K (x)

≈ hn1 2

n

K ∗

i=1 j=1

K (2) (x) =

K (x

X i

X j

h

x K (x) dx

σ2

JC V (h) =

19.3 +

n

 −  (Y i

i=1

r(−i) (xi ))2 =

2 K (0) Approximation nh r(x) =

∞



( )+

2

dx

  −   −   r (xi ))2

(Y i

2

i=1

K(0)

1

n j =1

K

x−xj h

J

βj φj (x)

j=1

Multivariate Regression

≈



βj φj (x)

i=1

Y = Φβ + η where ηi = i Least Squares Estimator

Y



Smoothing Using Orthogonal Functions

Non-parametric Regression

|

f (x)

K 2 (x) dx dx nhf (x)

n



y)K (y) dy

Estimate f (x), where f (x) = E [Y X = x]. Consider pairs of points (x1 , Y 1 ), . . . , (xn , Y n ) related by

f  (x) r (x) + 2r  (x)

≈ nc1/51 c2 R∗ (rn , r) ≈ 4/5 n

otherwise

n

2

h∗

5

  −   −

4

2



 

[0, 1]



Epanechnikov Kernel

K (x) =

−

−

h4 4

R(rn , r)

(f  )2 dx

x <

x xi h x xj n K j=1 h

K

wi (x) =

C (K)

√ 3 4 5(1−x

wi (x)Y i

i=1

1/5

K 2 (x) dx

where N k (x) = k values of x1 , . . . , xn closest to x

∈

r (x) =

K 2 (x) dx

2 c1 = σK , c2 =

3

n1/5

c4 n4/5

1 nh

{

Y i

i:xi N k (x)

      ∈      ≈  

1 x X i K h h

1 (hσK )4 4

R(f, f n )



Nadaraya-Watson Kernel Estimator

KDE

1 f n (x) = n

1 k

a nd Φ =

 

φ0 (x1 ) .. .

···

φ0 (xn )

···

..

.

φJ (x1 ) .. .

 

φJ (xn )

}

20.2


  −    n



RCV (J ) =

20

2

J

Y i

i=1

j=1

Poisson Process

φj (xi )βj,(−i)

• {X t : t ∈ [0, ∞)} – number of events up to and including time t • X 0 = 0 • Independent increments:

Stochastic Processes

Stochastic Process

{X t : t ∈ T }

T =

{ ±

∞

Markov Chains

{

Markov Chain X n : n P [X n

∀t0 < ··· < t n : X t − X t ⊥⊥ ·· · ⊥⊥ X t − X t 1

}

0, 1, . . . = Z discrete [0, ) continuous

• Notations: X t, X (t) • State space X • Index set T 20.1

Poisson Processes

∈ T }

|

• Intensity function λ(t) – P [X t+h − X t = 1] = λ(t)h + o(h) – P [X t+h − X t = 2] = o(h) • X s+t − X s ∼ Poisson (m(s + t) − m(s))

P [X n

|

= x X n−1 ]

∀n ∈ T, x ∈ X

λ(t)

Transition probabilities

where m(t) =

≡ P [X n+1 = j | X n = i] ≡ P [X m+n = j | X m = i]

≡ λ =⇒

X t

∼ Poisson (λt)

Waiting Times n-step

W t := time at which X t occurs

Transition matrix P (n-step: Pn )

• (i, j) element is pij • pij > 0 • i pij = 1

W t



Chapman-Kolmogorov

pij (m + n) =

∼ Gamma

pij (m) pkj (n)

S t = W t+1

Pm+n = Pm Pn Pn = P

=(

(1)

× · · · × P = Pn

(N )) wh

  t,

1 λ

Interarrival Times

 k

Marginal probability

n−1

n



Homogeneous Poisson Process

= x X 0 , . . . , X n −1 ] =

pij pij (n)

0

(i) =

S t

P [X

i]

∼ Exp

− W t

 1 λ

λ> 0

t 0

λ(s) ds

21

Time Series

21.1

Mean function µXt = E [X t ] = Autocovariance function

Strictly stationary



xf t (x) dF X (x)

− µs )(X t − µt)] = E [X sX t] − µsµt γ X (t, t) = E (X t − µt )2 = V [X t ]

γ X (s, t) =

P [X t1

≤ c1, . . . , Xt ≤ ck ] = P [X t +h ≤ c1 , . . . , Xt +h ≤ ck ]



ρ(s, t) =

Cov [X s , X t ] = V [X s ] V [X t ]



Cross-covariance function (CCV) γ XY (s, t) =

E [(X s

 

γ (s, t) γ (s, s)γ (t, t)

− µXs )(Y t − µY t )]

Cross-correlation function (CCF) ρXY (s, t) = Backshift operator



γ XY (s, t)

γ X (s, s)γ Y (t, t)

B k (X t ) = X t−k

Difference operator

d = (1 − B)d

White noise W t

• E [W t] = 0 t ∈ T • V [W t] = σ2 t ∈ T • γ W (s, t) = 0 s = t ∧ s, t ∈ T Auto regression

1

k

Weakly stationary

 

• E X t2 < ∞ ∀t ∈ Z • E X t2 = m ∀t ∈ Z • γ X (s, t) = γ X (s + r, t + r)

∀r,s,t ∈ Z

Autocovariance function

• γ (h) = E [(X t+h − µ)(X t − µ)] • γ (0) = E (X t − µ)2 • γ (0) ≥ 0 • γ (0) ≥ |γ (h)| • γ (h) = γ (−h)



∀h ∈ Z



Autocorrelation function (ACF) ρX (h) =

Cov [X t+h , X t ] = V [X t+h ] V [X t ]



Jointly stationary time series γ XY (h) =

 i=1

Random walk

φi X t−i + W t

j=1

Symmetric Moving Average

ρXY (h) = Linear Process

t



W j

where the constant δ is the drift X t = µ +

∞



j=

−∞

γ (t + h, t) γ (h) = γ (0) γ (t + h, t + h)γ (t, t)



E [(X t+h

p

X t =

k

∀k ∈ N, tk , ck , h ∈ Z

E [(X s

Autocorrelation function (ACF)

Y t = δt +

Stationary Time Series

− µX )(Y t − µY )]

γ XY (h) γ X (0)γ Y (h)



ψj wt−j

where

∞

| |

j=

−∞

ψj <

∞

21.2

Estimation of Correlation

Moving average 1 : • The low-pass filter V t is a symmetric moving average M t with aj = 2k+1

Sample autocovariance function γ (h) =



1 n

n h

−



(X t+h

t=1

− X ¯ )(X t − X ¯ )

Sample autocorrelation function

1 • If 2k+1

Sample cross-variance function

1 γX Y (h) = n



(X t+h

¯ )(Y t X

Sample cross-correlation function ρXY (h) =

γX Y (h) γX (0)γY (0)

  

Properties

• σρ • σρ

√1n if X t is white noise 1 (h) = √ if X t or Y t is white noise n

X (h)

XY

21.3

=

Non-Stationary Time Series

Classical decomposition model

¯ − Y )

21.4

• µt = trend • S t = seasonal component • W t = random noise term 21.3.1

Detrending

Least Squares

−

X t−1

≈ 0, a linear trend function µt = β0 + β1 t passes

ARIMA models

Autoregressive polynomial

− φ1 z − · · · − φ p z p

φ(z) = 1

∈ C ∧ φ p = 0

z

Autoregressive operator φ(B) = 1

− φ1B − · · · − φ pB p

AR ( p) (autoregressive model order p)

X t = φ1 X t−1 +

··· + φ pX t− p + W t ⇐⇒

φ(B)X t = W t

AR (1) k 1

• X t = φk (X t−k ) + X t = µt + S t + W t

where

i= k

• µt = β0 + β1 t =⇒ (X t) = β1

t=1



k i= k

k



Differencing

n h

−



− W t−j without distortion

γ (h) ρ(h) = γ (0)

   −

1 2k + 1

V t =

−

 j=0

φj (W t−j )

k

→∞,|φ|<1 ∞ =

    j=0

φj (W t−j )

linear process



• E [X t] = ∞j=0 φj (E [W t−j ]) = 0 • γ (h) = Cov [X t+h , X t] = σ1−φφ h • ρ(h) = γ(h) γ(0) = φ • ρ(h) = φρ(h − 1) h = 1, 2, . . . 2 W

h

2

Moving average polynomial θ(z) = 1 + θ1 z +

··· + θq zq

z

∈ C ∧ θq = 0

MA (q) (moving average model order q)

X t = W t + θ1 W t−1 +

21.4.1

··· + θq W t−q ⇐⇒

X t = θ(B)W t ARMA ( p,q) is causal (future-independent)

q

E [X t ] =

   j=0

θj E [W t−j ] = 0 2 σW 0

γ (h) = Cov [X t+h , X t ] =

Causality and Invertibility

q h j=0 θj θj+h

−

≤ ≤q

0 h h> q

X t =

∞

 j=0

⇐⇒ ∃{ψj } :



∞ ψ < ∞ such that j=0 j

W t−j = ψ(B)W t

MA (1)

X t = W t + θW t−1 γ (h) =

  

2 (1 + θ 2 )σW 2 θσW 0

ρ(h) =

h =0 h =1 h> 1

θ (1+θ2 )

h =1

0

h> 1

ARMA ( p,q) is invertible

⇐⇒ ∃{πj } : π(B)X t =

 

∞ π < ∞ such that j=0 j

∞

j=0

X t−j = W t

ARMA ( p, q)

X t = φ1 X t−1 +

· ·· + φ pX t− p + W t + θ1W t−1 + ··· + θq W t−q φ(B)X t = θ(B)W t

Properties

Partial autocorrelation function (PACF) φhh = corr(X h

− X hh−1, X 0 − X 0h−1)

h

≥2

• ARMA ( p,q) causal ⇐⇒

roots of φ(z) lie outside the unit circle

φ11 = corr(X 1 , X 0 ) = ρ(1) where X ih−1 denotes the regression of X i on X h−1 , X h−2 , . . . , X 1

{

ARIMA ( p,d,q)

}

ψ(z) =

X t = X t−1 + W t

∞



(1

λ)λj

1

ψj z j =

j=0

dX t = (1 − B)dX t is ARMA ( p,q) φ(B)(1 − B)d X t = θ(B)W t • ARMA ( p,q) invertible ⇐⇒

Exponentially Weighted Moving Averages (EWMA)

∞



θ(z) φ(z)

|z| ≤ 1

roots of θ(z) lie outside the unit circle

− λW t−1 ||

when λ

( )=

∞



j

φ(z)

| |≤1

22 22.1

Math

22.2 Finite

Orthogonal Functions

L2 space L2 (a, b) =



n



b

f : [a, b]

Inner Product



→ R,

f (x)2 dx <

a

∞

||f || =

f 2 (x) dx

k=0 n

An orthogonal sequence φ1 , φ2 , . . . is complete if the only function that is is orthogonal to each φj is the zero function. Then, φ1 , φ2 , . . . form an orthogonal basis in L2 :

∈ L2 =⇒

f (x) =



βj φj (x) where βj =

a

j=1

Cosine Basis

√

f

Legendre

∀ j ≥ 1

2cos( jπx)

 || || ≡ Polynomials

∈−

x [ 1, 1] P 0 (x) = 1 P 1 (x) = x

∞

f (x) dx =

k3 =

n(n + 1) 2

rk−1 =

1 rn 1 r

∞

β

n l

2

2

= 2n m

k

l

n+m k

=

Vandermonde’s

n n−k k a b = (a + b)n k

1

1

k=0

∞

k=0

p

kp k−1 =

d dp

r+k k

1

k=0

 ≡ || || j=1

n k

pk =

∞

2

n(n + 1)(2n + 1) 6

Binomial Theorem

Infinite

∞

Relation 2

f (x)φj (x) dx

k=0

φ0 (x) = 1

Parseval ’s

l=0 n

k=0

b

φj (x) =

1) = n2

k=1 n

∀ φi (x)φj (x) dx = 0 ∀i  =j

f

(2k

k=0 n

k=0 n

φ2j (x) dx = 1 j

∞

n(n + 1) 2

k=0 n

 





k=

k2 =

Orthogonality (for a series of functions φi )



  − •  •    •  −− •    •    −    •    •  − || •       • −    − − ∈ •    • || ∈ •

k=0 n

f (x)g(x) dx

Norm

Series

p < 1

∞

pk

k=0

xk = (1

α k p = (1 + p)α k

d dp

1

p

x)−r

r

+ N

=

1

p < 1 , α

C

=

1 1

− p2

Identity

22.3

References

Combinatorics

[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory . Brooks Cole, 1972.

Sampling

k out of n

w/o replacement k 1

−

 

nk =

ordered

(n

i=0

− i) = (n −n! k)!

n nk n! = = k k! k!(n k)!

unordered

[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American Statistician , 62(1):45–53, 2008.

w/ replacement nk

−  −  n

−

[3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications With R Examples. Springer, 2006.

1+r r

=

n

1+r

n

[4] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie, Algebra . Springer, 2001.

−1

[5] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und Statistik . Springer, 2002.

Stirling numbers, 2nd kind

  −  − n k

=k

n

1

n + k

k

−

1 1

1

  n 0

≤k≤n

=

[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2003.

1 n =0 0 else

Partitions n

P n+k,k =



P n,i

k > n : P n,k = 0

n

i=1

Balls and Urns

|B| = n, |U | = m ¬

B : D, U : D

¬

B : D, U : D

f : B

→U

P n,0 = 0, P 0,0 = 1

¬

D = distinguishable, D = indistinguishable.

f arbitrary

f injective



f surjective

f bijective

  ≥  −   −  −      ≥  ≥  mn 0

mn

n+n n m

¬

B : D, U : D

≥ 1:

k=1 m

1

n k

m n else

m n

1 m n 0 else 1 m

n

m!

n m

n! m = n 0 else

n m

1 1

1 m=n 0 e ls e

n m

1 m=n 0 e ls e 1 m

n

] . 2 [ n o t s e u Q c M d n a s i m e e L y b d e t a e r C

Probablity & Statistics-Cheat Sheet

Recommend Documents