Prepare for your exams
Get points
Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Lecture Notes on Linear Regression, Lecture notes of Statistics

The University of Tennessee Health Science Center (UTHSC)Statistics

Simple and Multiple Linear Regression are types of linear regression

Typology: Lecture notes

2020/2021

Uploaded on 06/21/2021

hugger 🇺🇸

4.7

(11)

923 documents

1 / 13

This page cannot be seen from the preview

Don't miss anything!

Chapter 3

Linear Regression

Once we’ve acquired data with multiple variables, one very important question is how the

variables are related. For example, we could ask for the relationship between people’s weights

and heights, or study time and test scores, or two animal populations. Regression is a set

of techniques for estimating relationships, and we’ll focus on them for the next two chapters.

In this chapter, we’ll focus on finding one of the simplest type of relationship: linear. This

process is unsurprisingly called linear regression, and it has many applications. For exam-

ple, we can relate the force for stretching a spring and the distance that the spring stretches

(Hooke’s law, shown in Figure 3.1a), or explain how many transistors the semiconductor

industry can pack into a circuit over time (Moore’s law, shown in Figure 3.1b).

Despite its simplicity, linear regression is an incredibly powerful tool for analyzing data.

While we’ll focus on the basics in this chapter, the next chapter will show how just a few

small tweaks and extensions can enable more complex analyses.

●

y=−0.044 +35 ⋅x, r2 = 0.999

50

100

150

1 2 3 4 5

Force on spring (Newtons)

Amount of stretch (mm)

(a) In classical mechanics, one could empiri-

cally verify Hooke’s law by dangling a mass

with a spring and seeing how much the spring

is stretched.

(b) In the semiconductor industry, Moore’s

law is an observation that the number of

transistors on an integrated circuit doubles

roughly every two years.

Figure 3.1: Examples of where a line fit explains physical phenomena and engineering feats.1

1The Moore’s law image is by Wgsimon (own work) [CC-BY-SA-3.0 or GFDL], via Wikimedia Commons.

1

Partial preview of the text

Download Lecture Notes on Linear Regression and more Lecture notes Statistics in PDF only on Docsity!

Chapter 3 Linear Regression

Once we’ve acquired data with multiple variables, one very important question is how the variables are related. For example, we could ask for the relationship between people’s weights and heights, or study time and test scores, or two animal populations. Regression is a set of techniques for estimating relationships, and we’ll focus on them for the next two chapters.

In this chapter, we’ll focus on finding one of the simplest type of relationship: linear. This process is unsurprisingly called linear regression, and it has many applications. For exam- ple, we can relate the force for stretching a spring and the distance that the spring stretches (Hooke’s law, shown in Figure 3.1a), or explain how many transistors the semiconductor industry can pack into a circuit over time (Moore’s law, shown in Figure 3.1b).

Despite its simplicity, linear regression is an incredibly powerful tool for analyzing data. While we’ll focus on the basics in this chapter, the next chapter will show how just a few small tweaks and extensions can enable more complex analyses.

l

y = −0.044 + 35 ⋅ x , r^2 = 0.

50

100

150

(^1) Force on spring (Newtons) 2 3 4 5

Amount of stretch (mm)

(a) In classical mechanics, one could empiri- cally verify Hooke’s law by dangling a mass with a spring and seeing how much the spring is stretched.

(b) In the semiconductor industry, Moore’s law is an observation that the number of transistors on an integrated circuit doubles roughly every two years.

Figure 3.1: Examples of where a line fit explains physical phenomena and engineering feats.^1

(^1) The Moore’s law image is by Wgsimon (own work) [CC-BY-SA-3.0 or GFDL], via Wikimedia Commons.

But just because fitting a line is easy doesn’t mean that it always makes sense. Let’s take another look at Anscombe’s quartet to underscore this point.

Example: Anscombe’s Quartet Revisited

Recall Anscombe’s Quartet: 4 datasets with very similar statistical properties under a simple quanti- tative analysis, but that look very different. Here they are again, but this time with linear regression lines fitted to each one:

(^22 4 6) 8 10 12 14 16 18 20

4

6

8

10

12

14

(^22 4 6) 8 10 12 14 16 18 20

4

6

8

10

12

14

(^22 4 6) 8 10 12 14 16 18 20

4

6

8

10

12

14

(^22 4 6) 8 10 12 14 16 18 20

4

6

8

10

12

14

For all 4 of them, the slope of the regression line is 0.500 (to three decimal places) and the intercept is 3.00 (to two decimal places). This just goes to show: visualizing data can often reveal patterns that are hidden by pure numeric analysis!

We begin with simple linear regression in which there are only two variables of interest (e.g., weight and height, or force used and distance stretched). After developing intuition for this setting, we’ll then turn our attention to multiple linear regression, where there are more variables.

Disclaimer: While some of the equations in this chapter might be a little intimidating, it’s important to keep in mind that as a user of statistics, the most important thing is to understand their uses and limitations. Toward this end, make sure not to get bogged down in the details of the equations, but instead focus on understanding how they fit in to the big picture.

3.1 Simple linear regression

We’re going to fit a line y = β 0 + β 1 x to our data. Here, x is called the independent variable or predictor variable, and y is called the dependent variable or response variable.

Before we talk about how to do the fit, let’s take a closer look at the important quantities from the fit:

β 1 is the slope of the line: this is one of the most important quantities in any linear regression analysis. A value very close to 0 indicates little to no relationship; large positive or negative values indicate large positive or negative relationships, respectively. For our Hooke’s law example earlier, the slope is the spring constant^2. (^2) Since the spring constant k is defined as F = −kx (where F is the force and x is the stretch), the slope

in Figure 3.1a is actually the inverse of the spring constant.

− 4 − 2 0 2 4

− 4

− 2

0

2

4

r = -0.

− 4 − 2 0 2 4

− 4

− 2

0

2

4

r = -0.

− 4 − 2 0 2 4

− 4

− 2

0

2

4

r = 0.

− 4 − 2 0 2 4

− 4

− 2

0

2

4

r = 0.

Figure 3.2: An illustration of correlation strength. Each plot shows data with a particular correla- tion coefficient r. Values farther than 0 (outside) indicate a stronger relationship than values closer to 0 (inside). Negative values (left) indicate an inverse relationship, while positive values (right) indicate a direct relationship.

where ¯x, ¯y, sx and sy are the sample means and standard deviations for x values and y values, respectively, and r is the correlation coefficient, defined as

r =

n − 1

∑^ n

i=

(x i −^ x¯ sx

)(y i −^ y¯ sy

By examining the second equation for the estimated slope βˆ 1 , we see that since sample standard deviations sx and sy are positive quantities, the correlation coefficient r, which is always between −1 and 1, measures how much x is related to y and whether the trend is positive or negative. Figure 3.2 illustrates different correlation strengths.

The square of the correlation coefficient r^2 will always be positive and is called the coefficient of determination. As we’ll see later, this also is equal to the proportion of the total variability that’s explained by a linear model.

As an extremely crucial remark, correlation does not imply causation! We devote the entire next page to this point, which is one of the most common sources of error in interpreting statistics.

Example: Correlation and Causation

Just because there’s a strong correlation between two variables, there isn’t necessarily a causal rela- tionship between them. For example, drowning deaths and ice-cream sales are strongly correlated, but that’s because both are affected by the season (summer vs. winter). In general, there are several possible cases, as illustrated below:

x y x y (a) Causal link: Even if there is a causal link between x and y, corre- lation alone cannot tell us whether y causes x or x causes y.

x y

z

(b) Hidden Cause: A hidden variable z causes both x and y, creating the correla- tion.

x y

z

(c) Confounding Factor: A hidden variable z and x both affect y, so the results also depend on the value of z.

x y (d) Coincidence: The correlation just happened by chance (e.g. the strong cor- relation between sun cycles and number of Republicans in Congress, as shown below).

(e) The number of Republican senators in congress (red) and the sunspot num- ber (blue, before 1986)/inverted sunspot number (blue, after 1986). This fig- ure comes from http://www.realclimate.org/index.php/archives/2007/05/ fun-with-correlations/. Figure 3.3: Different explanations for correlation between two variables. In this diagram, arrows represent causation.

−1.0 −0.5 0.0 0.5 1. r

− 10

− 5

0

5

10

tr

Figure 3.4: The test statistic for the correlation coefficient r for n = 10 (blue) and n = 100 (green).

which is also t-distributed with n − 2 degrees of freedom. The standard error is

sβ 0 = ˆσ

n

x¯^2 ∑ i(xi^ −^ ¯x)

2 ,^ (3.11)

and ˆσ is given by Equation (3.9).

3.2.3 Correlation

For the correlation coefficient r, our test statistic is the standardized correlation

tr = r

n − 2 1 − r^2

which is t-distributed with n − 2 degrees of freedom. Figure 3.4 plots tr against r.

3.2.4 Prediction

Let’s look at the prediction at a particular value x∗, which we’ll call ˆy(x∗). In particular:

yˆ(x∗) = βˆ 0 + βˆ 1 x∗.

We can do this even if x∗^ wasn’t in our original dataset.

Let’s introduce some notation that will help us distinguish between predicting the line versus predicting a particular point generated from the model. From the probabilistic model given by Equation (3.1), we can similarly write how y is generated for the new point x∗:

y(x∗) = β 0 + β 1 x∗ ︸︷︷︸ defined as μ(x∗)

+ε, (3.13)

where ε ∼ N (0, σ^2 ).

Then it turns out that the standard error sμˆ for estimating μ(x∗) (i.e., the mean of the line at point x∗) using ˆy(x∗) is:

sˆμ = ˆσ

n

(x∗^ − x¯) ∑n i=1(xi^ −^ ¯x) 2 ︸︷︷︸ distance from “comfortable prediction region”

This makes sense because if we’re trying to predict for a point that’s far from the mean, then we should be less sure, and our prediction should have more variance. To compute the standard error for estimating a particular point y(x∗) and not just its mean μ(x∗), we’d also need to factor in the extra noise term ε in Equation (3.13):

syˆ = ˆσ

n

(x∗^ − x¯) ∑ i(xi^ −^ x¯)

added

While both of these quantities have the same value when computed from the data, when analyzing them, we have to remember that they’re different random variables: ˆy has more variation because of the extra ε.

Interpolation vs. extrapolation

As a reminder, everything here crucially depends on the probabilistic model given by Equa- tion (3.1) being true. In practice, when we do prediction for some value of x we haven’t seen before, we need to be very careful. Predicting y for a value of x that is within the interval of points that we saw in the original data (the data that we fit our model with) is called interpolation. Predicting y for a value of x that’s outside the range of values we actually saw for x in the original data is called extrapolation.

For real datasets, even if a linear fit seems appropriate, we need to be extremely careful about extrapolation, which can often lead to false predictions!

Multiple dependent variables: for example, suppose we’re trying to predict medical outcome as a function of several variables such as age, genetic susceptibility, and clinical diagnosis. Then we might say that for each patient, x 1 = age, x 2 = genetics, x 3 = diagnosis, and y = outcome.
Nonlinearities: Suppose we want to predict a quadratic function y = ax^2 + bx + c: then for each data point we might say x 1 = 1, x 2 = x, and x 3 = x^2. This can easily be extended to any nonlinear function we want.

One may ask: why not just use multiple linear regression and fit an extremely high-degree polynomial to our data? While the model then would be much richer, one runs the risk of overfitting, where the model is so rich that it ends up fitting to the noise! We illustrate this with an example; it’s also illustrated by a song^4.

Example: Overfitting

Using too many features or too complex of a model can often lead to overfitting. Suppose we want to fit a model to the points in Figure 3.3(a). If we fit a linear model, it might look like Figure 3.3(b). But, the fit isn’t perfect. What if we use our newly acquired multiple regression powers to fit a 6th order polynomial to these points? The result is shown in Figure 3.3(c). While our errors are definitely smaller than they were with the linear model, the new model is far too complex, and will likely go wrong for values too far outside the range.

0 2 4 6 8 10 6

7

8

9

10

11

12

(a) A set of points with a sim- ple linear relationship.

0 2 4 6 8 10 6

7

8

9

10

11

12

(b) The same set of points with a linear fit (blue).

0 2 4 6 8 10 6

7

8

9

10

11

12

(c) The same points with a 6th-order polynomial fit (green). As before, the linear fit is shown in blue.

We’ll talk a little more about this in Chapters 4 and 5.

We’ll represent our input data in matrix form as X, an x × p matrix where each row corre- sponds to a data point and each column corresponds to a feature. Since each output yi is just a single number, we’ll represent the collection as an n-element column vector y. Then our linear model can be expressed as

y = Xβ + ε (3.15) (^4) Machine Learning A Cappella, Udacity. https://www.youtube.com/watch?v=DQWI1kvmwRg

where β is a p-element vector of coefficients, and ε is an n-element matrix where each element, like εi earlier, is normal with mean 0 and variance σ^2. Notice that in this version, we haven’t explicitly written out a constant term like β 0 from before. We’ll often add a column of 1s to the matrix X to accomplish this (try multiplying things out and making sure you understand why this solves the problem). The software you use might do this automatically, so it’s something worth checking in the documentation.

This leads to the following optimization problem:

min β

∑n

i=

(yi − Xiβ)^2 , (3.16)

where minβ. just means “find values of β that minimize the following”, and Xi refers to row i of the matrix X.

We can use some basic linear algebra to solve this problem and find the optimal estimates:

βˆ = (XT^ X)−^1 XT^ y, (3.17)

which most computer programs will do for you. Once we have this, what conclusions can we make with the help of statistics? We can obtain confidence intervals and/or hypothesis tests for each coefficient, which most statistical software will do for you. The test statistics are very similar to their counterparts for simple linear regression.

It’s important not to blindly test whether all the coefficients are greater than zero: since this involves doing multiple comparisons, we’d need to correct appropriately using Bonferroni correction or FDR correction as described in the last chapter. But before even doing that, it’s often smarter to measure whether the model even explains a significant amount of the variability in the data: if it doesn’t, then it isn’t even worth testing any of the coefficients individually. Typically, we’ll use an analysis of variance (ANOVA) test to measure this. If the ANOVA test determines that the model explains a significant portion of the variability in the data, then we can consider testing each of the hypotheses and correcting for multiple comparisons.

We can also ask about which features have the most effect: if a feature’s coefficient is 0 or close to 0, then that feature has little to no impact on the final result. We need to avoid the effect of scale: for example, if one feature is measured in feet and another in inches, even if they’re the same, the coefficient for the feet feature will be twelve times larger. In order to

avoid this problem, we’ll usually look at the standardized coefficients βˆk s (^) βˆk^.

3.4 Model Evaluation

How can we measure the performance of our model? Suppose for a moment that every point yi was very close to the mean ¯y: this would mean that each yi wouldn’t depend on xi, and that there wasn’t much random error in the value either. Since we expect that this shouldn’t be the case, we can try to understand how much the prediction from xi and random error

where we note that r^2 is precisely the coefficient of determination mentioned earlier. Here, we see why r^2 can be interpreted as the fraction of variability in the data that is explained by the model.

One way we might evaluate a model’s performance is to compare the ratio SSM/SSE. We’ll do this with a slight tweak: we’ll instead consider the mean values, MSM = SSM/(p − 1) and MSE = SSE/(n − p), where the denominators correspond to the degrees of freedom. These new variables MSM and MSE have χ^2 distributions, and their ratio

f =

MSM

MSE

has what’s known as an F distribution with parameters p − 1 and n − p. The widely used ANOVA test for categorical data, which we’ll see in Chapter 6, is based on this F statistic: it’s a way of measuring how much of the variability in the data is from the model and how much is from random error, and comparing the two.

Lecture Notes on Linear Regression, Lecture notes of Statistics

Related documents

Partial preview of the text

Download Lecture Notes on Linear Regression and more Lecture notes Statistics in PDF only on Docsity!

Chapter 3

Linear Regression

Example: Anscombe’s Quartet Revisited

3.1 Simple linear regression

Example: Correlation and Causation

2 ,^ (3.11)

3.2.3 Correlation

3.2.4 Prediction

Example: Overfitting

3.4 Model Evaluation

MSM

MSE