Midterm Coverage: Textbook - All of Chapter 1 (except normal quantile plot on p.65-68) - Section 2.1-2.4 - Section 3.1-3.2 - Section 4.1-4.5 - Section 5.1-5.2 Midterm Study Guide 1.1 Graphical Displays of Data - Two types of variables -- categorical or quantitative - Graphs # Pie charts + area represents percentage + percentages must add up to 1 or 100% # Bar graphs + percentage may not add up to 1 or 100% # Histograms + It's the area that represents percentage + symmetric, right-skewed, left-skewed, number of modes + position of mean and median + outlier # Stemplots + how to make a stemplot? + back-to-back stemplot? # Boxplot # Time plots - When to use which graph? 1.2 Numerical Descriptions of Data - Mean v.s. Median - Five number summary - IQR - 1.5 IQR rule - Boxplot, modified boxplot - When to use which numerical summary? # If unimodal, symmetric, no outliers, use "mean + SD" # If unimodal, skewed distribution (w/ or w/o outlier), use 5-number summary, Boxplots # If multimodal (i.e. clustered), use histograms or stemplots - Effects of linear transformation on mean, and SD 1.3 Normal Distributions - 68-95-99.7% Rule - Using the standard normal table and normal calculation - Inverse normal calculations - Skip Normal quantile plots (p.65-68) 2.1 Scatter Plot - How to read information for one variable in a scatter plot - Form, direction, strength of a relationship - Are there outliers, clusters - Points in different categories can be marked with different colors or symbols. - Use side by side boxplot to display the relationship between one numerical variable and one categorical variable 2.2. Correlation r - r does not distinguish between x and y - r ranges from ?1 to +1, When will r be -1 or 1? - r has no units - shift x or y has no effect on r, - scaling of x or y has no effect on the magnitude of r, at most changes its sign - When is it not appropriate to use r to describe the strength of relationships? # nonlinear, outlier, or clusters 2.3-2.4 Regression - Equation of the Regression line, slope, intercept - Use the regression line to predict the response, numerically and graphically - Residual = observed y - predicted y = vertical (signed) distance from a point to the regression line - Regression line always pass through the point of means - There are two regression lines, the roles of response and explanatory variable are not interchangeable - Read R output lm(y ~ x) - Residuals always sum to zero - Residuals have zero correlation with the explanatory variable - Residuals have zero correlation with the predicted responses y-hat - The sd of the residuals is the square root of (1-r^2) times the SD of the response - The mean of the predicted responses is the same as the mean of the response - The SD of the predicted responses is r times the SD of the response - r^2 is the fraction of variation in the response explained by the explanatory variable - Residual plot - Good residual plot: evenly spread around the zero line - Bad sign residual plot: nonlinear, heteroscadasticity (unequall spread around the zero line), and their implications? - identification of outliers and influential observations - correlation or regression doesn't imply causation - Skip log transformation of variables (Example 2.13 on p.90) - Skip section 2.5 3.0-3.1 Observational studies and Experiments - difference between an observational study and a experiment - What is a confounding variable? - Given a study or an experiment, identify possible confounding variables - completely randomized design - single-blind, double-blind, placebo - Randomized Block Design - Matched Pair Design 3.2 Sampling Design - 4 Keywords: Population, sample, parameter, statistic - Bad sampling methods: convenience sampling, voluntary response sampling - Better Sampling Designs # Simple Random Sampling # Stratified Sampling # Cluster Sampling # Multistage Clustered Sampling Given a description of a sampling method, classify it's sampling design - Problems in sampling # Undercoverage # Non-response bias # Response bias: wording of questions, design of questionnaire, attitude of interviewer Skip 3.3-3.4 4.1, 4.2, 4.5 Probability # Probability rules # Conditional Probability # General Multiplication Rule # Independence of Events # The Rule of Total Probability # Bayesˇ¦ Rule (IMPORTANT!!!) 4.3-4.4 Random Variables - Based on the description of a problem, find the destribution of a (discrete) random variable - Mean - Variance (2 formulas) - Properties of Mean and Variance # E(X + c) = E(X) + c, E(cX) = cE(X) # Var(X + c) = Var(X) # Var(cX) = c^2Var(X), SD(cX)= |c|SD(X) # E(X + Y) = E(X) + E(Y) (always valid) # Var(X + Y) = Var(X) + Var(Y) when X and Y are independent - Sums and Means of i.i.d. Random Variables 5.1 The Sampling Distribution for a Sample Mean - Statistical Model of Simple Random Sampling # Observations are nearly i.i.d. when the sample size is big enough - CLT 5.2 Sampling Distributions for Counts and Proportions - Binomial formula - When is all right to use Binomial formula? - CLT