1
Chapter 1 Introduction to Biostatistics Statistics Statistics is a field of study concerned with (1) collection, organization, summarization and analysis of data; and (2) the drawing of inferences about a body of data when only a part of the data is observed. Statistic A characteristics, or value, derived from sample data. Data The raw material of statistics is data. We may define data as number. The two kinds of numbers that we use in statistics are – 1. numbers that result from the taking of a measurement 2. numbers that result from the process of counting. Sources of Data 1. Routinely kept records 2. Surveys 3. Experiments 4. External sources Biostatistics When the data analyzed are derived from the biological sciences and medicine, we use the term biostatistics. Variable A characteristic, takes on different value in different persons, places or things. A variable is any quality, characteristic or constituent of a person or thing that can be measured. A variable is any measured characteristic or attribute that differs for different subjects.
Quantitative Variable Copyright © Dr. Win Khaing (2007)
2
A quantitative variable is one that can be measured in the usual sense. Measurements made on quantitative variables convey information regarding amount. Eg., height of adult males, weight of preschool children
Quantitative variable is one that can be counted in the usual sense.
Qualitative Variable
A qualitative variable is one that cannot be measured in the usual sense. Many characteristics can be categorized only. Measurements made on qualitative variables convey information regarding attribute. Eg.. ethnic of a person.
The characteristic that cannot be counted and can be categorized only.
Random Variable When the values obtained arise as a result of chance factors, so that they cannot be exactly predicted in advance, the variable is called a random variable. Eg., Adult height Discrete Random Variable A discrete random variable is characterized by gaps or interruptions in the values that it can assume. These gaps or interruptions indicate the absence of values between particular values that the variable can assume. Eg., number of daily admission in a hospital Continuous Random Variable A continuous random variable does not possess the gaps or interruptions characteristic of a discrete random variable. A continuous random variable can assume any value within a specified relevant interval of values assumed by the variable. Eg., Height, Weight, Head circumference. Population A population of entities as the largest collection of entities for which we have an interest at a particular time. A population of values as the largest collection of values of a random variable for which we have an interest at a particular time.
Sample Copyright © Dr. Win Khaing (2007)
3
A sample may be defined simply as a part of a population. A sample is a selected subset of a population. Finite Population If a population of values consists of a fixed number of these values, the population is said to be finite. Infinite Population If a population of values consists of an endless succession of values, the population is said to be infinite. Measurement Measurement may be defined as the assignment of numbers to objects or events according to a set of rules. Measurement Scales 1. The Nominal Scale
This is the lowest measurement scale. It consists of "naming" observations or classifying them into various mutually exclusive and collectively exhaustive categories.
Eg., Male – Female, Well – Sick, Child – Adult
2. The Ordinal Scale
Whenever observation are not only different from category to category but can be ranked according to some criterion, they are said to be measured on an ordinal scale.
Eg. SE Status – low, medium, high ; Intelligence – above average, average, below average
3. The Interval Scale
It is more sophisticated scale than the nominal or ordinal.
This scale is not only possible to order measurements, but also the distance between any two measurements is known.
The selected zero point is not necessarily a true zero in that it does not have to indicate a total absence of the quantity being measured.
Eg., Temperature – "zero degrees" does not indicate a lack of heat.
Copyright © Dr. Win Khaing (2007)
4
The interval scale unlike the nominal and ordinal scales is a truly quantitative scale.
4. The Ratio Scale
It is the highest level of measurement.
This scale is characterized by the fact that equality of ratios as well as equality of intervals may be determined.
Fundamental to the ratio scale is a true zero point.
Eg., Weight, Height, Length
Statistical Inference Statistical inference is the procedure by which we reach a conclusion about a population on the basis of the information contained in a sample that has been drawn from that population. Simple Random Sample If a sample of size n is drawn from a population of size N in such a way that every possible sample of size n has the same chance of being selected, that sample is called a simple random sample. Sampling with replacement When sampling with replacement is employed, every members of the population is available at each draw. Sampling without replacement In sampling without replacement, a drawn member is not returned, so a given member could appear in the sample only once.
Copyright © Dr. Win Khaing (2007)
5
Chapter 2 Descriptive Statistics Descriptive Statistics Descriptive statistics are methods for organizing and summarizing a set of data that help us to describe the attributes of a group or population. Descriptive
statistics
are
a
means of
organizing
and
summarizing
observations, which provide us with an overview of the general features of a set of data. Raw Data Measurements that have not been organized, summarized, or otherwise manipulated are called raw data. The Ordered Array A first step in organizing data is the preparation of an ordered array. An ordered array is a listing of the values of a collection (either population or sample) in order of magnitude from the smallest value to the largest value. Class Intervals To group a set of observations we select a set of contiguous, non-overlapping intervals such that each value in the set of observations can be placed in one, and only one, of the intervals. These intervals are usually referred to as class interval.
A commonly followed rule of thumb – no fewer than six intervals and no more than 15. o Fewer than six – the information they contain has been lost o More than 15 – the data have not been summarized enough.
Sturges's Rule
Deciding how many class intervals are needed, we may use a formula given by Sturges's rule. k = 1 + 3.322 (log10n) where,
k = number of class intervals n = number of values in the data set
The answer obtained by applying Sturges's rule should not be regarded as final, but should be considered as a guide only. Copyright © Dr. Win Khaing (2007)
6
The number of class intervals specified by the rule should be increased or decreased for convenience and clear presentation.
Width of Class interval w
where,
R k
R = the difference between the smallest and the largest observation in the data set. k = number of class intervals
Statistic A descriptive measure computed from the data of a sample is called a statistic. Parameter A descriptive measure computed from the data of a population is called a parameter. Measures of Central Tendency 1. Mean (Arithmetic Mean) 2. Median 3. Mode Mean The mean is obtained by adding all the values in a population or sample and dividing by the number of values that are added. Formula N
Population mean
x
i
i 1
N n
Sample mean where,
x
x i 1
i
n
xi = a typical value of a random variable N = number of value in the population n = number of value in the sample
Copyright © Dr. Win Khaing (2007)
7
Properties of the Mean 1. Uniqueness. For a given set of data there is one and only arithmetic mean 2. Simplicity. The arithmetic mean is easily understood and easy to compute. 3. Since each and every value in a set of data enters into the computation of the mean, extreme values have an influence on the mean and can distort it.
The mean is extremely sensitive to unusual values.
Median Dianel -
The median of a finite set of values is that value which divides the set
into two equal parts such that the number of values equal to or greater than the median is equal to the number of values equal to or less than the median. (Dianel) Pagano -
The median of a finite set of values is that value which divides the set
into two equal parts, if the all values have been arranged in order of magnitude, half the values are greater than or equal to the median, whereas the other half are less than or equal to it. If the number of values is odd, the median will be the middle value when all values have been arranged in order of magnitude. When the number of values is even, there is no single middle value. Instead there are two middle values. In this case the median is taken to be the mean of these two middle values, when all values have been arranged in the order of their magnitudes. Median - (
n 1 ) 2
th
one.
Properties of the Median 1. Uniqueness. There is only one median for a given set of data 2. Simplicity. The median is easy to calculate 3. It is not as drastically affected by extreme value as is the mean.
Median is said to be robust; it is much less sensitive to unusual data points.
Mode The mode of a set of values is that value which occurs most frequently.
If all the values are different there is no mode.
A set of values may have more than one mode.
Copyright © Dr. Win Khaing (2007)
8
Measure of Dispersion 1. Range 2. Variance 3. Standard Deviation Range The range is the difference between the largest and smallest value in a set of observations.
R x L xs Advantage
– simplicity of its computation
Disadvantage
– it takes into account only 2 values causes it to be a poor measure of dispersion. The usefulness of the range is limited.
The Variance Population variance N
2
(x ) i 1
2
i
N
Sample variance The sum of squared deviations of the values from their mean is divided by the sample size, minus 1 is sample variance. n
s2 where,
(x x ) i 1
2
i
n 1
s2
= sample variance
n
= number of values in the sample
xi
= a typical value of random variable
x
= sample mean
Alternative Variance Formula
n xi xi s 2 i 1 i 1 n (n 1) n
2
n
Copyright © Dr. Win Khaing (2007)
2
N xi xi i 1 i 1 2 N gN N
2
N
2
9
Standard Deviation The square root of the variance is called the standard deviation. n
(x x )
s s2
i 1
2
i
n 1
The Coefficient of Variation It expresses the standard deviation as the percentage of the mean.
C.V .
s (100) x
Advantages
When one desires to compare the dispersion in two sets of data, however, comparing the two standard deviations may lead to fallacious results because of different units. Although the same unit of measurement is used, the two means may be quite different.
In CV, it expresses the standard deviation as the percentage of the mean. It measure relative variation, rather than absolute variation.
Since unit of measurement is cancels out in computing the CV, we could use CV to compare the variability independent of the scale of measurement. Eg., Weight in lb, kg.
Percentile Given a set of n observations x1,x2, … , xn, the pth percentile P is the value of X such that p percent or less of the observations are less than P and (100 – P) percent or less of the observations are greater than P. First quartile (or) 25th percentile
Q1
n 1 th 4
ordered observation
Second quartile (or) middle quartile (or) 50th percentile (or) Median
Q2
2( n 1) n 1 th 4 2
ordered observation
Third quartile (or) 75th percentile
Q3
3(n 1) th 4
ordered observation Copyright © Dr. Win Khaing (2007)
10
Interquartile range The interquartile range (IQR) is the difference between the third and first quartiles.
IQR Q3 Q1 Measures of Central Tendency – Computed from Group Data 1. Mean computed from Grouped Data k
x
m f i 1 k
f i 1
where,
k mi fi
i i
i
= the number of class interval = the midpoint of the ith class interval = the frequency of the ith class interval
2. Median computed from Grouped Data
Median Li
Li
j (U i Li ) fi
= the true lower limit of the interval containing Median
U i = the true upper limit of the interval containing Median j = the number of observations still lacking to reach the median, after the lower limit of the interval containing the median has been reach
fi = frequency of the interval containing the median
Copyright © Dr. Win Khaing (2007)
11
Chapter 3 Some Basic Probability Concepts Probability Probability is the relative possibility or chance or likelihood of an event will occur. Event An event is a collection of one or more outcomes of an experiment. Outcome An outcome is a particular result of an experiment. Experiment An experiment is a process that leads to the occurrence of one of several possible observations. Mutually exclusive event The occurrence of two events cannot occur simultaneously. The occurrence of any one event means that none of the others can occur at the same time. Independent event The occurrence of one event has no effect on the possibility of the occurrence of any other event. TWO Views of Probability 1. Subjective Probability
is personalistic
measures the confidence that a particular individual has in the truth of a particular proposition
not fully accepted by statisticians
2. Objective Probability (a) Classical Probability If an event can occur in N mutually exclusive and equally likely ways, and if m of these possess a trait, E, the probability of the occurrence of E is equal to m / N.
P( E )
m N
P (occurrence of E )
no. of favorable outcome no. of all possible outcome
(b) Relative Frequency Probability Copyright © Dr. Win Khaing (2007)
12
If some process is repeated a large number of times, n, and if some resulting event with the characteristics E occurs m times, the relative frequency of occurrence of E, m/n, will be approximately equal to the probability of E.
P( E )
m n
Elementary Properties of Probability 1. Given some process (or experiment) with n mutually exclusive outcomes (called events), E1, E2, … , En, the probability of any event E i , is assigned a nonnegative number.
P ( Ei ) 0 2. The sum of probabilities of the mutually exclusive outcomes is equal to 1. (Property of Exhaustiveness)
P ( E1 ) P( E2 ) ... P ( En ) 1 3. In mutually exclusive events, E1 and E2, the probability of the occurrence of either E1 or E2 is equal to the sum of their individual probabilities.
P ( E1 or E2 ) P ( E1 ) P ( E2 ) In not mutually exclusive events, E1 and E2, the probability of the event E 1, or event E2, or both occur is equal to the probability that event E 1 occurs, plus the probability that event E2 occurs, minus the probability that the events occur simultaneously.
P ( E1 or E2 ) P ( E1 ) P ( E2 ) P ( E1 and E2 ) Rules of Probability 1. Multiplication Rule If two events are Independent,
P ( A B ) P ( A) P ( B ) If two events are Not Independent,
P ( A B ) P ( A) P ( B | A) P( A B) P( B ) P ( A | B )
2. Additional Rule If two events are mutually exclusive, Copyright © Dr. Win Khaing (2007)
(or)
13
P ( A B ) P ( A) P ( B ) If two event are not mutually exclusive,
P ( A B ) P ( A) P ( B ) P ( A B ) 3. Complementary Rule
P( A) 1 P ( A)
(or)
P ( A) 1 P (not A) Types of Probability 1. Marginal Probability The probability of one of the marginal total is used as numerator and the total group as the denominator.
Marginal Total Grand Total 2. Joint Probability The probability that a subject is picked at random from a group of subjects possesses two characteristics at the same time.
P ( A and B )
no. of occurrence possessing A and B Grand Total
For independent,
P ( A B ) P ( A) P ( B ) Joint Probability is the product of Marginal Probability For not independent,
P ( A B ) P ( A) P ( B | A) Joint Probability is the product of Marginal and Conditional Probability Note: If
P ( A B ) P ( A) P ( B ) , event is independent.
If
P ( A B ) P ( A) P ( B ) , event is not independent.
3. Conditional Probability
Copyright © Dr. Win Khaing (2007)
14
The probability of an event occurring given that another event has occurred.
P( A | B)
no. of occurrence possessing A and B P( A B ) , P( A | B ) Marginal Total ( B ) P( B) (or)
P ( B | A)
no. of occurrence possessing A and B P( A B ) , P( B | A) Marginal Total ( A) P( A)
Conditional
Joint Marginal
Statistical independence
Two random events A and B are statistically independent if and only if
P( A B ) P( A) P( B ) Thus, if A and B are independent, then their joint probability can be expressed as a simple product of their individual probabilities.
In other words, if A and B are independent, then the conditional probability of A, given B is simply the individual probability of A alone; likewise, the probability of B given A is simply the probability of B alone.
P( A | B) P( A)
and
P( B | A) P( B)
Mutual exclusivity
Two events A and B are mutually exclusive if and only if
P( A B ) 0
as long as
P( A) 0
and
P( B ) 0
Then
P( A | B) 0
and
P( B | A) 0
In other words, the probability of A happening, given that B happens, is nil since A and B cannot both happen in the same situation; likewise, the probability of B happening, given that A happens, is also nil.
Copyright © Dr. Win Khaing (2007)
15
Chapter 4 Probability Distributions Probability Distribution of Discrete Variables 1. Binomial Distribution 2. Poisson Distribution Probability Distribution of Continuous Variables 3. Normal Distribution Binomial Distribution (Swiss mathematician James Bernoulli, Bernoulli trial) The Bernoulli Process – A sequence of Bernoulli trials forms a Bernoulli process under the following conditions. 1. Each trial results in one of two possible, mutually exclusive, outcomes. One of the possible outcomes is denoted as a success, and the other is denoted as a failure. 2. The probability of a success, denoted by p, remains constant from trial to trial. The probability of a failure, 1 – p, is denoted by q. 3. The trials are independent; that is, the outcome of any particular trial is not affected by the outcome of any other trial. An experiment with a fixed number of independent trials, each of which can only have 2 possible outcomes and the probability of each outcome remains constant from trial to trial. Formula,
P ( x) n C x gp x gq n x P ( x) where
n! gp x g(1 p ) n x x !(n x)!
p
= probability of success
q (or) 1 – p
= probability of failure
x
= number of occurrence of an event
n
= number of trial
P (x) use Binomial table P ( x or x ) can be easily calculated by using Binomial table, if the probability of an event (x) and the number of trials (n) are known. Copyright © Dr. Win Khaing (2007)
16
Using Binomial Table when p > 0.5
P( X x | n, p 0.5) P ( X n x | n, 1 p ) P( X x | n, p 0.5) P ( X n x | n, 1 p ) P( X x | n, p 0.5) P ( X n x | n, 1 p ) Binomial Parameters Mean of Binomial Distribution
np Variance of Binomial Distribution
2 np (1 p ) Poisson Distribution (French mathematician Simeon Denis Poisson) Poisson Process 1. The occurrences of the events are independent. The occurrence of an event in an interval of space or time has no effect on the probability of a second occurrence of the event in the same, or any other, interval. 2. Theoretically, an infinite number of occurrences of the event must be possible in the interval. 3. The probability of the single occurrence of the event in a given interval is proportional to the length of the interval. 4. In any infinitesimally small portion of the interval, the probability of more than one occurrence of the event is negligible. Poisson probabilities are useful when there are a large number of independent trials with a small probability of success on a single trial and the variables occur over a period of time. Formula,
e x P( x) x! where
= average number of occurrences of the random event in the interval e = constant (2.7183) x = number of occurrence
P (x) use Poisson Distribution table Copyright © Dr. Win Khaing (2007)
17
P ( x or x ) can be easily calculated by using Poisson table, if the average number of occurrence ( ) known.
An interesting feature of the Poisson distribution is the fact that the mean and variance are equal.
Normal Distribution (Gaussian Distribution) Characteristics of the Normal Distribution 1. It is symmetrical bell-shaped curve. 2. It is symmetrical about its mean. The curve on either side of mean is a mirror image of the other side. 3. The mean, the median and the mode are all equal. 4. The total area under the curve above x-axis is one square unit 5. Mean 1 SD covers 68 % of total area under the curve. Mean 2 SD covers 95 % of total area under the curve. Mean 1 SD covers 99.7 % of total area under the curve. 6. The normal distribution is completely determined by the parameters and . Different values of shift the graph of the distribution along the x-axis. Different values of determine the degree of flatness or peakedness of the graph of the distribution.
The Standard Normal Distribution (Unit Normal Distribution) Copyright © Dr. Win Khaing (2007)
18
The standard normal distribution is the normal distribution with a mean of zero and a standard deviation of one.
Mean = 0
Standard Deviation = 1
Z-score
z
x
Copyright © Dr. Win Khaing (2007)
19
Chapter 5 Some Important Sampling Distributions Sampling Distribution The distribution of all possible values that can be assumed by some statistic, computed from samples of the same size randomly drawn from the same population, is called the sampling distribution of that statistic. Sampling distribution serve 2 purposes: 1. they allow us to answer probability questions about sample statistics 2. they provide the necessary theory for making statistical inference procedure valid. Central Limit Theorem Given a population of any nonnormal functional form with a mean and finite variance 2, the sampling distribution of x , computed from samples of size n from this population, will have mean and variance 2/n and will be approximately normally distributed when the sample size is large.
x / n Sampling Distribution of Sample Mean 1. When sampling is from a normally distributed population with a known population variance, the distribution of the sample mean will possess the following properties: a) the mean of distribution of sampling distribution ( x ) will be equal to the mean of the population ( )from which the sample were drawn. b) the variance of distribution of sampling distribution ( x2 ) will be equal to the variance of the population ( n ) divided by the sample size. 2
The square root of the variance of distribution of sampling distribution ( ) = Standard Error of the mean = x
x2
/ n
c) the sampling distribution of sample mean is normal.
Copyright © Dr. Win Khaing (2007)
20
2. When sampling is from a nonnormally distributed population with a known population variance, the distribution of the sample mean will possess the following properties: a)
x =
b)
x / n
when n / N 0.5
x ( / n )
N n N 1
c) the sampling distribution of sample mean is approximately normal. Distribution of the Difference between two Sample Means
z
( x1 x2 ) ( 1 2 )
12 22 n1 n2
Distribution of Sample Proportion
z
pˆ p p (1 p ) n
Distribution of the Difference between two Sample Proportions
z
( pˆ1 pˆ 2 ) ( p1 p2 ) p1 (1 p1 ) p2 (1 p2 ) n1 n2
Copyright © Dr. Win Khaing (2007)