








Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Complete and schematic biostatistics cheat sheet
Typology: Cheat Sheet
1 / 14
This page cannot be seen from the preview
Don't miss anything!
On special offer
Population entire collection of objects or individuals about which information is desired. ➔ easier to take a sample ◆ Sample part of the population that is selected for analysis ◆ Watch out for: ● Limited sample size that might not be representative of population ◆ Simple Random Sampling Every possible sample of a certain size has the same chance of being selected
Observational Study there can always be lurking variables affecting results ➔ i.e, strong positive association between shoe size and intelligence for boys ➔ _should never show causation_**
Experimental Study lurking variables can be controlled; can give good evidence for causation
Descriptive Statistics Part I ➔ Summary Measures
➔ Mean arithmetic average of data values *◆ *** _Highly susceptible to extreme values (outliers). Goes towards extreme values_ ◆ Mean could never be larger or smaller than max/min value but could be the max/min value
➔ Median in an ordered array, the median is the middle number ◆ _Not affected by extreme values_**
➔ Quartiles split the ranked data into 4 equal groups ◆ Box and Whisker Plot
◆ Disadvantages: Ignores the way in which data are distributed; sensitive to outliers
➔ Interquartile Range (IQR) = 3 rd quartile 1 st quartile ◆ Not used that much ◆ Not affected by outliers
➔ Variance the average distance squared
sx^2 = (^) n 1
∑ ( x x )
n i = 1 i^
2
values ◆ units are squared
➔ Standard Deviation shows variation about the mean
∑ ( x x )
n i = 1 i
2
◆ highly affected by outliers ◆ has same units as original data ◆ finance = horrible measure of risk (trampoline example)
Descriptive Statistics Part II Linear Transformations
➔ Linear transformations change the center and spread of data
➔ Average(a+bX) = a+b[Average(X)]
➔ Effects of Linear Transformations: ◆ meannew = a + bmean ◆* mediannew = a + bmedian ◆* stdevnew = (^) | b | *stdev ◆ IQRnew = (^) | b | *IQR ➔ Z score new data set will have mean 0 and variance 1 z =^ X^ SX
Empirical Rule ➔ Only for mound shaped data Approx. 95 % of data is in the interval: ( x 2 s (^) x , x + 2 s (^) x ) = x + (^) / 2 s (^) x ➔ only use if you just have mean and std. dev.
Chebyshev's Rule ➔ Use for any set of data and for any number k, greater than 1 ( 1. 2 , 1. 3 , etc.)
k^2 ➔ (Ex) for k= 2 ( 2 standard deviations), 75 % of data falls within 2 standard deviations
Detecting Outliers ➔ Classic Outlier Detection ◆ doesn't always work ◆ | z | = | |^ X^ SX^ | | ≥ 2 ➔ The Boxplot Rule ◆ Value X is an outlier if: X<Q 1 1. 5 (Q 3 Q 1 ) or X>Q 3 + 1. 5 (Q 3 Q 1 )
Skewness ➔ measures the degree of asymmetry exhibited by data ◆ negative values= skewed left ◆ positive values= skewed right ◆ if (^) | s kewness | < 0. 8 = don't need to transform data
Measurements of Association ➔ Covariance ◆ Covariance > 0 = larger x, larger y ◆ Covariance < 0 = larger x, smaller y
n
i = 1
◆ Units = Units of x Units of y ◆ Covariance is only +, , or 0 (can be any number)
➔ Correlation measures strength of a linear relationship between two variables
◆ rxy =
covariancexy ( std. dev. (^) x ) ( std. dev. (^) y ) ◆ correlation is between 1 and 1 ◆ Sign: direction of relationship ◆ Absolute value: strength of relationship ( 0. 6 is stronger relationship than + 0. 4 )
◆ Correlation doesn't imply causation ◆ The correlation of a variable with itself is one
Combining Data Sets ➔ Mean (Z) = Z = aX + bY ➔ Var (Z) = sz^2 = a^2 V^ ar ( X ) + b^2 V^ ar ( Y )+ 2 a bCov ( X , Y )
Portfolios ➔ Return on a portfolio:
◆ weights add up to 1 ◆ return = mean ◆ risk = std. deviation
➔ Variance of return of portfolio
◆ Risk(variance) is reduced when stocks are negatively correlated. (when there's a negative covariance)
Probability ➔ measure of uncertainty ➔ all outcomes have to be exhaustive (all options possible) and mutually exhaustive (no 2 outcomes can occur at the same time)
➔ Combining Random Variables ◆ If X and Y are independent:
V ar ( X + Y ) = V ar ( X ) + Var ( Y )
◆ If X and Y are dependent: E ( X + Y ) = E ( X ) + E ( Y ) V ar ( X + Y ) = V ar ( X ) + V ar ( Y ) + 2 C ov ( X , Y )
➔ Covariance: C ov ( X , Y ) = E ( XY ) E ( X ) E ( Y ) ➔ If X and Y are independent, Cov(X,Y) = 0
Binomial Distribution ➔ doing something n times ➔ only 2 outcomes: success or failure ➔ trials are independent of each other ➔ probability remains constant
1 .) All Failures P ( all f ailures ) = ( 1 p ) n
2 .) All Successes P ( all successes )= pn 3 .) At least one success P ( at least 1 success ) = 1 ( 1 p ) n 4 .) At least one failure P ( at least 1 f ailure ) = 1 pn 5 .) Binomial Distribution Formula for x=exact value
6 .) Mean (Expectation)
7 .) Variance and Standard Dev.
Binomial Example
Continuous Probability Distributions ➔ the probability that a continuous random variable X will assume any particular value is 0 ➔ Density Curves ◆ Area under the curve is the probability that any range of values will occur. ◆ Total area = 1
Uniform Distribution
◆ X ~ U nif ( a , b )
Uniform Example
(Example cont'd next page)
➔ Mean for uniform distribution:
( a + b )
➔ Variance for unif. distribution:
( b a )^2
Normal Distribution ➔ governed by 2 parameters: μ (the mean) and σ (the standard deviation)
Standardize Normal Distribution:
Z = (^) σ
X μ
➔ Z score is the number of standard deviations the related X is from its mean ➔ ****Z< some value, will just be the probability found on table** ➔ ****Z> some value, will be ( 1 probability) found on table**
Normal Distribution Example
Sums of Normals
Sums of Normals Example:
➔ Cov(X,Y) = 0 b/c they're independent
Central Limit Theorem ➔ as n increases, ➔ x should get closer to μ (population mean) ➔ mean( x ) = μ ➔ variance (^) ( x ) = σ^2 / n ➔ X ~ N (μ, σ n )
2
◆ if population is normally distributed, n can be any value ◆ any population, n needs to be (^) ≥ 30
➔ Z =
X μ σ/√ n
Confidence Intervals = tells us how good our estimate is **Want high confidence, narrow interval **As confidence increases , interval also increases
A. One Sample Proportion
︿
number of successes in sample
➔ We are thus 95 % confident that the true population proportion is in the interval… ➔ We are assuming that n is large, n^ ︿ p> 5 and our sample size is less than 10 % of the population size.
**One Sample Hypothesis Tests
2. Test Statistic Approach (Population Mean) (^3). Test Statistic Approach (Population Proportion)
4. P Values ➔ a number between 0 and 1 ➔ the larger the p value, the more consistent the data is with the null ➔ the smaller the p value, the more consistent the data is with the alternative ➔ ** If P is low (less than 0. 05 ), H 0 must go reject the null hypothesis
**Two Sample Hypothesis Tests
➔ Test Statistic for Two Proportions 2. Comparing Two Means (large independent samples n> 30 )
➔ Calculating Confidence Interval
➔ Test Statistic for Two Means
Matched Pairs ➔ Two samples are DEPENDENT Example:
Assumptions of Simple Linear Regression 1. We model the AVERAGE of something rather than something itself
2.
◆ As ε (noise) gets bigger, it’s harder to find the line
➔ S
2 e =^ n 2
SSE
➔ Se^2 is our estimate of σ^2 ➔ Se = (^) √ Se^2 is our estimate of σ ➔ 95 % of the Y values should lie within
Example of Prediction Intervals:
Standard Errors for b 1 a nd b 0 ➔ standard errors when noise ➔ sb 0 amount of uncertainty in our estimate of β 0 (small s good, large s bad) ➔ sb 1 amount of uncertainty in our estimate of β 1
Confidence Intervals for b 1 and b 0
➔
➔ n small → bad se big → bad s^2 x small→ bad (wants x’s spread out for better guess)
*Regression Hypothesis Testing always a two sided test ➔ want to test whether slope ( β 1 ) is needed in our model ➔ H 0 : β 1 = 0 (don’t need x) Ha : (^) β 1 =/ 0 (need x) ➔ Need X in the model if: a. 0 isn’t in the confidence interval b. t > 1. 96 c. P value < 0. 05
Test Statistic for Slope/Y intercept ➔ can only be used if n> 30 ➔ if n < 30 , use p values
Multiple Regression
➔ ➔ Variable Importance: ◆ higher t value, lower p value = variable is more important ◆ lower t value, higher p value = variable is less important (or not needed)
Adjusted R squared ➔ k = # of X’s
➔ Adj. R squared will as you add junk x variables ➔ Adj. R squared will only if the x you add in is very useful ➔ ****want Adj. R squared to go up and Se low for better model**
The Overall F Test
➔ Always want to reject F test (reject null hypothesis) ➔ Look at p value (if < 0. 05 , reject null) ➔ H 0 : β 1 = β 2 = β 3. ..= β k = 0 (don’t need any X’s) Ha : (^) β 1 = β 2 = β 3 ... = β k =/ 0 (need at least 1 X) ➔ If no x variables needed, then SSR= 0 and SST=SSE
Modeling Regression Backward Stepwise Regression
Dummy Variables ➔ An indicator variable that takes on a value of 0 or 1 , allow intercepts to change
Interaction Terms ➔ allow the slopes to change ➔ interaction between 2 or more x variables that will affect the Y variable
How to Create Dummy Variables (Nominal Variables) ➔ If C is the number of categories, create (C 1 ) dummy variables for describing the variable ➔ One category is always the “baseline”, which is included in the intercept
Recoding Dummy Variables Example: How many hockey sticks sold in the summer (original equation) h o ckey = 10 0 + 10 W tr 20 Spr + 30 F all Write equation for how many hockey sticks sold in the winter h o ckey = 11 0 + 20 F all 30 Spri 10 Summer ➔ ****always need to get same exact values from the original equation**