1. Linear Correlation
Linear regression uses past information about the relationship between variables to make predictions. If we find that two varaibles are linearly related, we can use the slope equation to predict one varaible (the criterion variable) from the other (the predictor variable).
The line of best fit (a.k.a., least-squares regression line) minimizes the squared distance from each ($x, y$) coordinate to your line. If you remember back to elementary school, the slope equation of a line is calculated as:
$y = mx + b$
The regression equation is similarly calculated as:
$Y^\prime = b_{Y}X + a_{Y}$
Where
- $Y^\prime$ is the predicted $Y$
- $b_{Y}$ is the slope (the change in $y$ per change in $x$)
- $X$ is the predictor variable
- $a_{Y}$ is the y-intercept (where the regression line intercepts with the y-axis)
Often, we'lll be asked to figure out the regression line equation:
- Calculate slope (use 1 of 2 ways):
- Based on the correlation coefficient and the standard deviation of $x$ and $y$:
$\displaystyle b_{Y} = r_{xy} (\frac{S_{y}}{S_{x}}) $
- Based on the quotient of the cross product divided by the $x$ sum of squares:
$\displaystyle b_{Y} = \frac{SS_{xy}}{SS_{x}} $
$\displaystyle b_{Y} = \frac{\sum (X - \bar X)(Y - \bar Y)}{\sum (X - \bar X)^2} $
- Based on the correlation coefficient and the standard deviation of $x$ and $y$:
- Calculate the y-intercept by rearranging the original regression line formula, using the means of $x$ and $y$:
$\displaystyle a_{Y} = \bar Y - b_{Y} \bar X$
1.1. Unexplained and Explained Variation around the Regression Line
Once we have our regression line equation, we can determine how awesome our predictions are by calculating the total variation: explained variation plus unexplained variation.
total variation = explained variation + unexplained variation
$SS_{tot} = SS_{reg} + SS_{error}$
$\displaystyle \sum (Y - \bar Y)^2 = \sum (Y^\prime - \bar Y)^2 + \sum (Y - Y^\prime)^2$
To make this clear, you have three points when you calculate the regression equation: actual $x$, actual $y$ and predicted $y$.
- The total error is equal to the sum of squares for the actual $y$ values minus the mean of $y$.
- The regression error (explained variation) is equal to the sum of squares for the predicted $y$ minus the mean of $y$.
- The unexplained error (unexplained variation) is equal to the sum of squares for the actual $y$ values minus the predicted $y$ values.
1.2. Coefficient of Determination
If we need to know the coefficient of determination, or the proportion of variance in one variable that can be explained by another variable, we simple need to divide the explained error in $y$ by the total error:
$\displaystyle r^2 = \frac{SS_{reg}}{SS_{tot}} = \frac{\sum (Y^\prime - \bar Y)^2}{\sum (Y - \bar Y)^2} $
1.3. Coefficient of Non-Determination
If we need to know the coefficient of non-determination (a.k.a., alienation), or the proportion of variance in one variable that cannot be explained by another variable, we simply need to divide the unexplained error in $y$ by the total error:
$\displaystyle 1 - r^2 = \frac{SS_{err}}{SS_{tot}} = \frac{\sum (Y - Y^\prime)^2}{\sum (Y - \bar Y)^2} $
1.4. Standard of Error of Estimate
Because the regression line minimizes the sum of squared errors of prediction, we can use the square root of the average squared error of prediction to determine the accuracy of the prediction. This gives us the standard deviation of the points around the regression line. This is known as the standard error of estimate and is calculated as:
$\displaystyle S_{Y.X} = \sqrt{\frac{SS_{err}}{N}} = \sqrt{\frac{\sum (Y - Y^\prime)^2}{N}} $
If we just want to know the variance of the points around the line, we would calculate it without the square root:
$\displaystyle S_{Y.X}^2 = \frac{SS_{err}}{N} = \frac{\sum (Y - Y^\prime)^2}{N} $
1.5. Estimated Population Error Variance
We can calculate the estimated population error variance by dividing the unexplained variance by degrees of freedom (instead of $N$) (Note that $S$ actually have a hat. This means "esimate".):
$\displaystyle \hat S_{Y.X} = \sqrt{\frac{\sum (Y - Y^\prime)^2}{N - 2}} $
1.6. Linear Regression with Z-Scores
If we are working with standardized z-scores, we can simplify our equation a little. Because of the y-intercept with z-scores is always 0, we multiply the z-score covariance by the predictor variable $Z_{x}$:
$\displaystyle Z_{Y}^\prime = r_{xy}Z_{X} $
This expands to:
$\displaystyle Z_{Y}^\prime = (\frac{\sum Z_{x}Z{y}}{N})Z_{X} $