skip to main | skip to sidebar

See lectures by course…

Saturday, July 22, 2017

PSYC 2002: Linear Regression

1. Linear Correlation

Linear regression uses past information about the relationship between variables to make predictions. If we find that two varaibles are linearly related, we can use the slope equation to predict one varaible (the criterion variable) from the other (the predictor variable).

The line of best fit (a.k.a., least-squares regression line) minimizes the squared distance from each ($x, y$) coordinate to your line. If you remember back to elementary school, the slope equation of a line is calculated as:

$y = mx + b$

The regression equation is similarly calculated as:

$Y^\prime = b_{Y}X + a_{Y}$

Where

  • $Y^\prime$ is the predicted $Y$
  • $b_{Y}$ is the slope (the change in $y$ per change in $x$)
  • $X$ is the predictor variable
  • $a_{Y}$ is the y-intercept (where the regression line intercepts with the y-axis)

Often, we'lll be asked to figure out the regression line equation:

  1. Calculate slope (use 1 of 2 ways):
    1. Based on the correlation coefficient and the standard deviation of $x$ and $y$:

      $\displaystyle b_{Y} = r_{xy} (\frac{S_{y}}{S_{x}}) $

    2. Based on the quotient of the cross product divided by the $x$ sum of squares:

      $\displaystyle b_{Y} = \frac{SS_{xy}}{SS_{x}} $

      $\displaystyle b_{Y} = \frac{\sum (X - \bar X)(Y - \bar Y)}{\sum (X - \bar X)^2} $

  2. Calculate the y-intercept by rearranging the original regression line formula, using the means of $x$ and $y$:

    $\displaystyle a_{Y} = \bar Y - b_{Y} \bar X$

1.1. Unexplained and Explained Variation around the Regression Line

Once we have our regression line equation, we can determine how awesome our predictions are by calculating the total variation: explained variation plus unexplained variation.

total variation = explained variation + unexplained variation

$SS_{tot} = SS_{reg} + SS_{error}$

$\displaystyle \sum (Y - \bar Y)^2 = \sum (Y^\prime - \bar Y)^2 + \sum (Y - Y^\prime)^2$

To make this clear, you have three points when you calculate the regression equation: actual $x$, actual $y$ and predicted $y$.

  • The total error is equal to the sum of squares for the actual $y$ values minus the mean of $y$.
  • The regression error (explained variation) is equal to the sum of squares for the predicted $y$ minus the mean of $y$.
  • The unexplained error (unexplained variation) is equal to the sum of squares for the actual $y$ values minus the predicted $y$ values.

1.2. Coefficient of Determination

If we need to know the coefficient of determination, or the proportion of variance in one variable that can be explained by another variable, we simple need to divide the explained error in $y$ by the total error:

$\displaystyle r^2 = \frac{SS_{reg}}{SS_{tot}} = \frac{\sum (Y^\prime - \bar Y)^2}{\sum (Y - \bar Y)^2} $

1.3. Coefficient of Non-Determination

If we need to know the coefficient of non-determination (a.k.a., alienation), or the proportion of variance in one variable that cannot be explained by another variable, we simply need to divide the unexplained error in $y$ by the total error:

$\displaystyle 1 - r^2 = \frac{SS_{err}}{SS_{tot}} = \frac{\sum (Y - Y^\prime)^2}{\sum (Y - \bar Y)^2} $

1.4. Standard of Error of Estimate

Because the regression line minimizes the sum of squared errors of prediction, we can use the square root of the average squared error of prediction to determine the accuracy of the prediction. This gives us the standard deviation of the points around the regression line. This is known as the standard error of estimate and is calculated as:

$\displaystyle S_{Y.X} = \sqrt{\frac{SS_{err}}{N}} = \sqrt{\frac{\sum (Y - Y^\prime)^2}{N}} $

If we just want to know the variance of the points around the line, we would calculate it without the square root:

$\displaystyle S_{Y.X}^2 = \frac{SS_{err}}{N} = \frac{\sum (Y - Y^\prime)^2}{N} $

1.5. Estimated Population Error Variance

We can calculate the estimated population error variance by dividing the unexplained variance by degrees of freedom (instead of $N$) (Note that $S$ actually have a hat. This means "esimate".):

$\displaystyle \hat S_{Y.X} = \sqrt{\frac{\sum (Y - Y^\prime)^2}{N - 2}} $

1.6. Linear Regression with Z-Scores

If we are working with standardized z-scores, we can simplify our equation a little. Because of the y-intercept with z-scores is always 0, we multiply the z-score covariance by the predictor variable $Z_{x}$:

$\displaystyle Z_{Y}^\prime = r_{xy}Z_{X} $

This expands to:

$\displaystyle Z_{Y}^\prime = (\frac{\sum Z_{x}Z{y}}{N})Z_{X} $

Thursday, March 06, 2014

PSYC 2002: Correlation

1. Linear Correlation

Correlational designs allow us to measure each participant on each variable and find a relationship between two variables (which is calculated using a correlational coefficient and a scatterplot). This design looks for a relationship between two or more variables. Remember, you cannot infer causation.

The correlational coefficient, $r$, is a statistic used to quantitatively express the extent to which two variables are related, and the direction of that relationship (know when to use these):

  • (Nominal - same) Phi coefficient uses two dichotomous, nominal variables.
  • (Nominal - not same) Bi-serial $r$ uses nominal variables (they don't have to be dichotomous).
  • (Ordinal - ranks) Spearman $r$ uses ordinal variables and must be expressed as ranks.
  • (Ordinal - allows ties) Kendall's Tau uses ordinal data when the data allows you to have ties.
  • (Interval-ratio) Pearson's $r$ is for interval-ratio data (Pearson product moment correlation coefficient).

1.1. Covariance and Linear Correlation - Formulas Explained!

The cross product sum of squares (or "sum of cross product" or "cross product") is equal to the sum of the product of the $x$ and $y$ deviation scores. This indicates the degree to which two random variables ($x$ and $y$) covary. This is represented as:

$\displaystyle {SS}_{xy} = \sum{(X - \bar{X})(Y - \bar{Y})}$

Because the cross product varies depending on the scale (e.g., you would get a different result if you used a 1-10 scale versus a 1-100 scale), you can adjust the cross product by dividing it by the sample population, $N$. Dividing by the sample population normalizes the result.

This gives you the unstandardized covariance of $x$ and $y$:

$\displaystyle S_{xy} = \frac{\sum{(X - \bar{X})(Y - \bar{Y})}}{N}$

A linear correlation is the standardized version of covariance. This value allows us to compare covariances for different $x$ and $y$ pairs (much like how z-scores allow us to compare raw scores from different sized samples). It should be obvious that just like covariance, the linear correlation indicates a linear relationship between two variables - the extent to which a change in one variable ($x$, for example) is linearly related to a change in another variable ($y$, for example).

$\displaystyle r_{xy} = \frac{S_{xy}}{S_x S_y}$

Which, when expanded, looks like this:

$\displaystyle r_{xy} = \frac{\dfrac{\sum (X - \bar{X})(Y - \bar{Y})}{N}}{\sqrt{\dfrac{(X - \bar{X})^2}{N}} \sqrt{\dfrac{(Y - \bar{Y})^2}{N}}}$


OR in other words:

$\displaystyle \text{linear correlation} = \frac{\text{covariance of X and Y}}{\sqrt{\text{X variance}} \sqrt{\text{Y variance}}}$

We can further factor out the N (same shit, faster to calculate) and we get the linear correlation as:

$\displaystyle r_{xy} = \frac{\sum{(X - \bar{X})(Y - \bar{Y})}}{\sqrt{{(X - \bar{X})}^2{(Y - \bar{Y})}^2}}$

And (even better!) if you happen to have the Z-scores already calculated, you can compute the covariance between the z-scores using a simplified formula:

$\displaystyle r_{xy} = \sum{\frac{Z_x Z_y}{N}}$

1.2. Understanding $r$

Once we have calculated the correlation coefficient, $r$, we have to interpret it:

  1. The direction can be upward ($+r$), downward ($-r$), or non-existent ($r = 0$).
  2. The nature (form) of the relationship (linear) (Pearson's $r$) or curved (Spearman's $rho$).
  3. The consistency or strength. The closer $r$ is to 1, the more consistent the relationship (i.e., the closer the data points appear on the scatter plot).
    • Even if $r = 0$, there may be a relationship - it may just be curvilinear.

Be careful! There are 3 factors that can affect $r_{xy}$:

  1. Nonlinearity: if the relationship between two variables is curvilinear, $r$ may be low. Statistical computation is not a substitute for actually looking at the graph.
  2. Outliers: single outliers can have a major impact on the size of the correlation (just like outliers can have a major impact on your mean and standard deviation, it should be no surprise that it would affect a statistic that uses standard deviation and mean in it's formula.)
  3. Range Restriction: if the range of the $x$ or $y$ values are restricted, the 'true' correlation may be larger than what you computed.

We can use the coefficient of determination ($r^2$ or $r_{xy}^2$) to determine the strength of the relationship between two variables. This is the proportion of variance in one variables that can be explained by the variance in another variable. Naturally, any squared value must be positive, so $0 \leq r_{xy}^2 \leq 1$.

  • E.g., if we have a correlation coefficient $r = 0.80$, we calculate $r_{xy}^2 = 0.64$ and finally $1 - r^2 = 1 - 0.64 = 0.34$.
    This says that 34% of the variability in one variable cannot be explained by the variability in another.

In social sciences, we normally use this as a guide (but it's a loose guide! Some populations, like criminals, even a 0.30 correlation is viewed as a major find - so check with your peers and their findings first).

  • Small correlation: $0 \leq r \leq 0.30$
  • Medium correlation: $0.30 \leq r \leq 0.50$
  • Large correlation: $0.50 \leq r \leq 1$

PSYC 2002: Descriptive Statistics

1. Measures of Central Tendency

Measures of central tendency describe the most typical value in the distribution. There are three measures:

  1. Mode: indicates the most frequently occurring category or the category that contains the most data values. Used primarily for nominal data but can be used with others.
    • Unimodal: 1 mode.
    • Bimodal: 2 modes.
    • Multimodal: many models.
    • Rectangle/Uniform: no modes (all the scores have the same frequency).
  2. Median: indicates the data point that falls in the middle of a ranked data set (i.e., 50th percentile, $X_2$ or the second quartile, $Q_2$). Used primarily for ordinal data. Can be useful with skewed interval and ratio data.
    • Order the numbers from the smallest to largest and find the middle number. If there are two middle numbers, calculate the middle of the two.
    • Equation:
      $\displaystyle \text{median location} = \frac{N+1}{2}, \text{where N is the number of your sample}$
  3. Median: indicates the average value for a data set. Used for interval and ratio data.
    • Equation: where x-bar is equal to the sum of each value ($X$), divided by the number of values.
      $\displaystyle \bar{X} = \frac{\sum X}{N}$
      • Summed Deviations: the summed deviations (the distance from any point to the mean) will always equal zero $\sum(X - \bar{X}) = 0$.
      • The Sum of the Squared Deviations (a.k.a., concept of least squares) gives you the unadjusted measure of dispersion. $\sum(X - \bar{X}^2 )$ (This is ${SS}_x$... you'll use this in the Linear Regression post.)
    • Weighted Mean: calculated as the overall mean for two or more different samples. The weight is calculated as the data set with more scores. X-bar is equal to the sum of each value ($i$) multiplied by the weight ($a$), divided by the sum of the weights.
      $\displaystyle \bar{X} = \frac{\sum_{i=1}^k a_i X_i}{\sum_i^k a_i}$
    • Mean, median and mode relationships: in single-peaked, symmetrical distributions, the mean, median and mode are all int he same place (the middle). In skewed distributions, the mode is at the peak, the median is in the middle (duhh), and the mean is pulled towards the skew (i.e., the tail).

2. Measures of Variability (Dispersion or Spread)

There are four measures of variability:

  1. Range (R): the difference between the highest and lowest scores.
    • Issues: sensitive to extreme scores and doesn't provide much information.
  2. Interquartile Range (IQR or IR): describes the variance (spread) of the data set by removing the outliers. It is calculated as the difference between the 75th ($Q_3$) and the 25th ($Q_1$) scores. $Q_2$ is the median (or 50th) percentile.
    • $IQR = Q_3 - Q_1$
      $\text{Where } Q_3 = \text{values of } X_.75 = (N + 1)(.75)$
      $\text{And } Q_1 = \text{value of } X_.25 = (N + 1)(.25)$
    • Issue: removes the extreme scores, which may be important.
  3. Variance Estimate: because $SS_x$ is affected by $N$, we can correct for this difference by calculating the variance.
    $\displaystyle S_{x}^2 = \frac{SS_x}{N} = \frac{\sum (X - \bar{X})^2}{N}$
    • Unbiased Variance Estimate: provides a sample of the population variance (i.e., systematically underestimates the value of the population variance). In order to obtain an unbiased estimate of the population variance (variation estimate), divide $SS_x$ by the degrees of freedom instead of N.
      $\displaystyle \hat{S}_x^2 = \frac{SS_x}{df} = \frac{SS_x}{N-1} = \frac{\sum (X - \bar{X})^2}{N-1}$
  4. Standard Deviation ($S_X$): calculate the standard deviation by squaring the variance.
    $\displaystyle \text{Standard Deviation} = \sqrt{\text{Variance}}$

    $\displaystyle S_x = \sqrt{S_x^2} = \sqrt{\frac{{SS}_x}{N}} = \sqrt{\frac{\sum (X - \bar{X})^2}{N}}$
    • Deviation Score, or distance of a score from the mean, is equal to $X - \bar{X}$; the sum of these scores is always zero.
    • Sum of Squares is equal to ${SS}_x = \sum (X - \bar{X})^2$

These are the roles to follow in order to select a measure of variability:

  1. When mean is used as the measure of central tendency, standard deviation should be used as the measure of variability.
  2. When median is used as the measure of central tendency, interquartile range should be used as the measure of variability.

3. Transformations

3.1. Linear Transformations

Linear transformations use a constant, which changes the mean, standard deviation, but not the shape. There are two ways to manipulate a dataset linearly:

  • Add/Subtract: calculate the sum, $b$, to all the scores.
    • The mean changes by the same constant:
      if $X + b \text{, then } \bar{X} + b$
    • The standard deviation does not change.
  • Multiply/Divide: calculate the product or quotient, $m$ of all the scores.
    • The mean changes by the same factor:
      if $X \times m \text{, then } \bar{X} \times m$
    • The standard deviation changes by the same factor:
      if $X \times m \text{, then } S_x = S_x \times m$

3.2. Non-Linear Transformations

Non-linear transformations use a square root or a $log$, which changes the mean, standard deviation, and the shape. The scores are affected non-uniformly.

  • Useful to reduce the skew of the data, which allows you to manipulate and interpret your data more. However, some argue that it distorts the scale.

4. Standard Normal Distribution

The Standard Normal Distribution is a mesokurtic curve made up of z-scores. It has a mean of 0 and a standard deviation of 1. It allows us to calculate the proportion of area below specific sections of the curve (namely, $Q_1$ to $Q_4$).

  • We know that 50% of the scores are above and 50% of the scores are below the mean.
  • -1 to +1 standard deviations from the mean is 68.26% of the scores (31.13% above and below).
  • -2 to +2 standard deviations from the mean is 95.44% of the scores (the above values plus 13.59% above and below).
  • -3 to +3 standard deviations from the mean is 99.74% of the scores (the above values plus 21.15% above and below).
  • The rest makes up the last 0.26% (0.13% above and below).

4.1. Standard Scores (Z-Scores)

Z-Score: describes the number of standard deviations any score is above or below the mean. It is calculated by dividing the deviation score (score minus mean) by the standard deviation.
E.g., A Z-score of -1.1 means that the score is 1.1 standard deviations below the mean.

$\displaystyle \text{Sample: } z = \frac{X - \bar{X}}{S_x}$

$\displaystyle \text{Population: } z = \frac{X - \mu_x}{\sigma_x}$

Z-scores are useful for two reasons:

  • Allow you to describe the exact location of a score within a single distribution.
  • Allow you to compare z-scores in one distribution to z-scores in another distribution. It gives you a way to compare across different scales.

There are three characteristics of z-scores:

  • The mean z-scores will always equal zero.
  • The variance and standard deviation of z-scores will equal to 1.
  • Converting the observed scores (X scores) to Z scores does not change the shape of the distribution.

4.2. Percentile Rank

The percentile rank tells us the percentage of scores that fall below a certain point. The standard normal curve allows us to find the percentages because we know the percent of area that falls under the standard normal curve.

There are a few z-score tables online. The important thing to understand is what's meant by "area beyond z" and "area between mean and z". With simple rearranging, we can calculate pretty much anything by referring to the graph.

  • Note 1: Column 3 + Column 5 = .5000
  • Note 2: The area under the entire curve = 1
  • Note 3: Because the standard normal distribution curve is symmetrical, the values for the above table are the same for their negative counterparts.