skip to main | skip to sidebar

See lectures by course…

Thursday, March 06, 2014

PSYC 2002: Correlation

1. Linear Correlation

Correlational designs allow us to measure each participant on each variable and find a relationship between two variables (which is calculated using a correlational coefficient and a scatterplot). This design looks for a relationship between two or more variables. Remember, you cannot infer causation.

The correlational coefficient, $r$, is a statistic used to quantitatively express the extent to which two variables are related, and the direction of that relationship (know when to use these):

  • (Nominal - same) Phi coefficient uses two dichotomous, nominal variables.
  • (Nominal - not same) Bi-serial $r$ uses nominal variables (they don't have to be dichotomous).
  • (Ordinal - ranks) Spearman $r$ uses ordinal variables and must be expressed as ranks.
  • (Ordinal - allows ties) Kendall's Tau uses ordinal data when the data allows you to have ties.
  • (Interval-ratio) Pearson's $r$ is for interval-ratio data (Pearson product moment correlation coefficient).

1.1. Covariance and Linear Correlation - Formulas Explained!

The cross product sum of squares (or "sum of cross product" or "cross product") is equal to the sum of the product of the $x$ and $y$ deviation scores. This indicates the degree to which two random variables ($x$ and $y$) covary. This is represented as:

$\displaystyle {SS}_{xy} = \sum{(X - \bar{X})(Y - \bar{Y})}$

Because the cross product varies depending on the scale (e.g., you would get a different result if you used a 1-10 scale versus a 1-100 scale), you can adjust the cross product by dividing it by the sample population, $N$. Dividing by the sample population normalizes the result.

This gives you the unstandardized covariance of $x$ and $y$:

$\displaystyle S_{xy} = \frac{\sum{(X - \bar{X})(Y - \bar{Y})}}{N}$

A linear correlation is the standardized version of covariance. This value allows us to compare covariances for different $x$ and $y$ pairs (much like how z-scores allow us to compare raw scores from different sized samples). It should be obvious that just like covariance, the linear correlation indicates a linear relationship between two variables - the extent to which a change in one variable ($x$, for example) is linearly related to a change in another variable ($y$, for example).

$\displaystyle r_{xy} = \frac{S_{xy}}{S_x S_y}$

Which, when expanded, looks like this:

$\displaystyle r_{xy} = \frac{\dfrac{\sum (X - \bar{X})(Y - \bar{Y})}{N}}{\sqrt{\dfrac{(X - \bar{X})^2}{N}} \sqrt{\dfrac{(Y - \bar{Y})^2}{N}}}$


OR in other words:

$\displaystyle \text{linear correlation} = \frac{\text{covariance of X and Y}}{\sqrt{\text{X variance}} \sqrt{\text{Y variance}}}$

We can further factor out the N (same shit, faster to calculate) and we get the linear correlation as:

$\displaystyle r_{xy} = \frac{\sum{(X - \bar{X})(Y - \bar{Y})}}{\sqrt{{(X - \bar{X})}^2{(Y - \bar{Y})}^2}}$

And (even better!) if you happen to have the Z-scores already calculated, you can compute the covariance between the z-scores using a simplified formula:

$\displaystyle r_{xy} = \sum{\frac{Z_x Z_y}{N}}$

1.2. Understanding $r$

Once we have calculated the correlation coefficient, $r$, we have to interpret it:

  1. The direction can be upward ($+r$), downward ($-r$), or non-existent ($r = 0$).
  2. The nature (form) of the relationship (linear) (Pearson's $r$) or curved (Spearman's $rho$).
  3. The consistency or strength. The closer $r$ is to 1, the more consistent the relationship (i.e., the closer the data points appear on the scatter plot).
    • Even if $r = 0$, there may be a relationship - it may just be curvilinear.

Be careful! There are 3 factors that can affect $r_{xy}$:

  1. Nonlinearity: if the relationship between two variables is curvilinear, $r$ may be low. Statistical computation is not a substitute for actually looking at the graph.
  2. Outliers: single outliers can have a major impact on the size of the correlation (just like outliers can have a major impact on your mean and standard deviation, it should be no surprise that it would affect a statistic that uses standard deviation and mean in it's formula.)
  3. Range Restriction: if the range of the $x$ or $y$ values are restricted, the 'true' correlation may be larger than what you computed.

We can use the coefficient of determination ($r^2$ or $r_{xy}^2$) to determine the strength of the relationship between two variables. This is the proportion of variance in one variables that can be explained by the variance in another variable. Naturally, any squared value must be positive, so $0 \leq r_{xy}^2 \leq 1$.

  • E.g., if we have a correlation coefficient $r = 0.80$, we calculate $r_{xy}^2 = 0.64$ and finally $1 - r^2 = 1 - 0.64 = 0.34$.
    This says that 34% of the variability in one variable cannot be explained by the variability in another.

In social sciences, we normally use this as a guide (but it's a loose guide! Some populations, like criminals, even a 0.30 correlation is viewed as a major find - so check with your peers and their findings first).

  • Small correlation: $0 \leq r \leq 0.30$
  • Medium correlation: $0.30 \leq r \leq 0.50$
  • Large correlation: $0.50 \leq r \leq 1$

0 comments: