skip to main | skip to sidebar

See lectures by course…

Thursday, March 06, 2014

PSYC 2002: Correlation

1. Linear Correlation

Correlational designs allow us to measure each participant on each variable and find a relationship between two variables (which is calculated using a correlational coefficient and a scatterplot). This design looks for a relationship between two or more variables. Remember, you cannot infer causation.

The correlational coefficient, $r$, is a statistic used to quantitatively express the extent to which two variables are related, and the direction of that relationship (know when to use these):

  • (Nominal - same) Phi coefficient uses two dichotomous, nominal variables.
  • (Nominal - not same) Bi-serial $r$ uses nominal variables (they don't have to be dichotomous).
  • (Ordinal - ranks) Spearman $r$ uses ordinal variables and must be expressed as ranks.
  • (Ordinal - allows ties) Kendall's Tau uses ordinal data when the data allows you to have ties.
  • (Interval-ratio) Pearson's $r$ is for interval-ratio data (Pearson product moment correlation coefficient).

1.1. Covariance and Linear Correlation - Formulas Explained!

The cross product sum of squares (or "sum of cross product" or "cross product") is equal to the sum of the product of the $x$ and $y$ deviation scores. This indicates the degree to which two random variables ($x$ and $y$) covary. This is represented as:

$\displaystyle {SS}_{xy} = \sum{(X - \bar{X})(Y - \bar{Y})}$

Because the cross product varies depending on the scale (e.g., you would get a different result if you used a 1-10 scale versus a 1-100 scale), you can adjust the cross product by dividing it by the sample population, $N$. Dividing by the sample population normalizes the result.

This gives you the unstandardized covariance of $x$ and $y$:

$\displaystyle S_{xy} = \frac{\sum{(X - \bar{X})(Y - \bar{Y})}}{N}$

A linear correlation is the standardized version of covariance. This value allows us to compare covariances for different $x$ and $y$ pairs (much like how z-scores allow us to compare raw scores from different sized samples). It should be obvious that just like covariance, the linear correlation indicates a linear relationship between two variables - the extent to which a change in one variable ($x$, for example) is linearly related to a change in another variable ($y$, for example).

$\displaystyle r_{xy} = \frac{S_{xy}}{S_x S_y}$

Which, when expanded, looks like this:

$\displaystyle r_{xy} = \frac{\dfrac{\sum (X - \bar{X})(Y - \bar{Y})}{N}}{\sqrt{\dfrac{(X - \bar{X})^2}{N}} \sqrt{\dfrac{(Y - \bar{Y})^2}{N}}}$


OR in other words:

$\displaystyle \text{linear correlation} = \frac{\text{covariance of X and Y}}{\sqrt{\text{X variance}} \sqrt{\text{Y variance}}}$

We can further factor out the N (same shit, faster to calculate) and we get the linear correlation as:

$\displaystyle r_{xy} = \frac{\sum{(X - \bar{X})(Y - \bar{Y})}}{\sqrt{{(X - \bar{X})}^2{(Y - \bar{Y})}^2}}$

And (even better!) if you happen to have the Z-scores already calculated, you can compute the covariance between the z-scores using a simplified formula:

$\displaystyle r_{xy} = \sum{\frac{Z_x Z_y}{N}}$

1.2. Understanding $r$

Once we have calculated the correlation coefficient, $r$, we have to interpret it:

  1. The direction can be upward ($+r$), downward ($-r$), or non-existent ($r = 0$).
  2. The nature (form) of the relationship (linear) (Pearson's $r$) or curved (Spearman's $rho$).
  3. The consistency or strength. The closer $r$ is to 1, the more consistent the relationship (i.e., the closer the data points appear on the scatter plot).
    • Even if $r = 0$, there may be a relationship - it may just be curvilinear.

Be careful! There are 3 factors that can affect $r_{xy}$:

  1. Nonlinearity: if the relationship between two variables is curvilinear, $r$ may be low. Statistical computation is not a substitute for actually looking at the graph.
  2. Outliers: single outliers can have a major impact on the size of the correlation (just like outliers can have a major impact on your mean and standard deviation, it should be no surprise that it would affect a statistic that uses standard deviation and mean in it's formula.)
  3. Range Restriction: if the range of the $x$ or $y$ values are restricted, the 'true' correlation may be larger than what you computed.

We can use the coefficient of determination ($r^2$ or $r_{xy}^2$) to determine the strength of the relationship between two variables. This is the proportion of variance in one variables that can be explained by the variance in another variable. Naturally, any squared value must be positive, so $0 \leq r_{xy}^2 \leq 1$.

  • E.g., if we have a correlation coefficient $r = 0.80$, we calculate $r_{xy}^2 = 0.64$ and finally $1 - r^2 = 1 - 0.64 = 0.34$.
    This says that 34% of the variability in one variable cannot be explained by the variability in another.

In social sciences, we normally use this as a guide (but it's a loose guide! Some populations, like criminals, even a 0.30 correlation is viewed as a major find - so check with your peers and their findings first).

  • Small correlation: $0 \leq r \leq 0.30$
  • Medium correlation: $0.30 \leq r \leq 0.50$
  • Large correlation: $0.50 \leq r \leq 1$

PSYC 2002: Descriptive Statistics

1. Measures of Central Tendency

Measures of central tendency describe the most typical value in the distribution. There are three measures:

  1. Mode: indicates the most frequently occurring category or the category that contains the most data values. Used primarily for nominal data but can be used with others.
    • Unimodal: 1 mode.
    • Bimodal: 2 modes.
    • Multimodal: many models.
    • Rectangle/Uniform: no modes (all the scores have the same frequency).
  2. Median: indicates the data point that falls in the middle of a ranked data set (i.e., 50th percentile, $X_2$ or the second quartile, $Q_2$). Used primarily for ordinal data. Can be useful with skewed interval and ratio data.
    • Order the numbers from the smallest to largest and find the middle number. If there are two middle numbers, calculate the middle of the two.
    • Equation:
      $\displaystyle \text{median location} = \frac{N+1}{2}, \text{where N is the number of your sample}$
  3. Median: indicates the average value for a data set. Used for interval and ratio data.
    • Equation: where x-bar is equal to the sum of each value ($X$), divided by the number of values.
      $\displaystyle \bar{X} = \frac{\sum X}{N}$
      • Summed Deviations: the summed deviations (the distance from any point to the mean) will always equal zero $\sum(X - \bar{X}) = 0$.
      • The Sum of the Squared Deviations (a.k.a., concept of least squares) gives you the unadjusted measure of dispersion. $\sum(X - \bar{X}^2 )$ (This is ${SS}_x$... you'll use this in the Linear Regression post.)
    • Weighted Mean: calculated as the overall mean for two or more different samples. The weight is calculated as the data set with more scores. X-bar is equal to the sum of each value ($i$) multiplied by the weight ($a$), divided by the sum of the weights.
      $\displaystyle \bar{X} = \frac{\sum_{i=1}^k a_i X_i}{\sum_i^k a_i}$
    • Mean, median and mode relationships: in single-peaked, symmetrical distributions, the mean, median and mode are all int he same place (the middle). In skewed distributions, the mode is at the peak, the median is in the middle (duhh), and the mean is pulled towards the skew (i.e., the tail).

2. Measures of Variability (Dispersion or Spread)

There are four measures of variability:

  1. Range (R): the difference between the highest and lowest scores.
    • Issues: sensitive to extreme scores and doesn't provide much information.
  2. Interquartile Range (IQR or IR): describes the variance (spread) of the data set by removing the outliers. It is calculated as the difference between the 75th ($Q_3$) and the 25th ($Q_1$) scores. $Q_2$ is the median (or 50th) percentile.
    • $IQR = Q_3 - Q_1$
      $\text{Where } Q_3 = \text{values of } X_.75 = (N + 1)(.75)$
      $\text{And } Q_1 = \text{value of } X_.25 = (N + 1)(.25)$
    • Issue: removes the extreme scores, which may be important.
  3. Variance Estimate: because $SS_x$ is affected by $N$, we can correct for this difference by calculating the variance.
    $\displaystyle S_{x}^2 = \frac{SS_x}{N} = \frac{\sum (X - \bar{X})^2}{N}$
    • Unbiased Variance Estimate: provides a sample of the population variance (i.e., systematically underestimates the value of the population variance). In order to obtain an unbiased estimate of the population variance (variation estimate), divide $SS_x$ by the degrees of freedom instead of N.
      $\displaystyle \hat{S}_x^2 = \frac{SS_x}{df} = \frac{SS_x}{N-1} = \frac{\sum (X - \bar{X})^2}{N-1}$
  4. Standard Deviation ($S_X$): calculate the standard deviation by squaring the variance.
    $\displaystyle \text{Standard Deviation} = \sqrt{\text{Variance}}$

    $\displaystyle S_x = \sqrt{S_x^2} = \sqrt{\frac{{SS}_x}{N}} = \sqrt{\frac{\sum (X - \bar{X})^2}{N}}$
    • Deviation Score, or distance of a score from the mean, is equal to $X - \bar{X}$; the sum of these scores is always zero.
    • Sum of Squares is equal to ${SS}_x = \sum (X - \bar{X})^2$

These are the roles to follow in order to select a measure of variability:

  1. When mean is used as the measure of central tendency, standard deviation should be used as the measure of variability.
  2. When median is used as the measure of central tendency, interquartile range should be used as the measure of variability.

3. Transformations

3.1. Linear Transformations

Linear transformations use a constant, which changes the mean, standard deviation, but not the shape. There are two ways to manipulate a dataset linearly:

  • Add/Subtract: calculate the sum, $b$, to all the scores.
    • The mean changes by the same constant:
      if $X + b \text{, then } \bar{X} + b$
    • The standard deviation does not change.
  • Multiply/Divide: calculate the product or quotient, $m$ of all the scores.
    • The mean changes by the same factor:
      if $X \times m \text{, then } \bar{X} \times m$
    • The standard deviation changes by the same factor:
      if $X \times m \text{, then } S_x = S_x \times m$

3.2. Non-Linear Transformations

Non-linear transformations use a square root or a $log$, which changes the mean, standard deviation, and the shape. The scores are affected non-uniformly.

  • Useful to reduce the skew of the data, which allows you to manipulate and interpret your data more. However, some argue that it distorts the scale.

4. Standard Normal Distribution

The Standard Normal Distribution is a mesokurtic curve made up of z-scores. It has a mean of 0 and a standard deviation of 1. It allows us to calculate the proportion of area below specific sections of the curve (namely, $Q_1$ to $Q_4$).

  • We know that 50% of the scores are above and 50% of the scores are below the mean.
  • -1 to +1 standard deviations from the mean is 68.26% of the scores (31.13% above and below).
  • -2 to +2 standard deviations from the mean is 95.44% of the scores (the above values plus 13.59% above and below).
  • -3 to +3 standard deviations from the mean is 99.74% of the scores (the above values plus 21.15% above and below).
  • The rest makes up the last 0.26% (0.13% above and below).

4.1. Standard Scores (Z-Scores)

Z-Score: describes the number of standard deviations any score is above or below the mean. It is calculated by dividing the deviation score (score minus mean) by the standard deviation.
E.g., A Z-score of -1.1 means that the score is 1.1 standard deviations below the mean.

$\displaystyle \text{Sample: } z = \frac{X - \bar{X}}{S_x}$

$\displaystyle \text{Population: } z = \frac{X - \mu_x}{\sigma_x}$

Z-scores are useful for two reasons:

  • Allow you to describe the exact location of a score within a single distribution.
  • Allow you to compare z-scores in one distribution to z-scores in another distribution. It gives you a way to compare across different scales.

There are three characteristics of z-scores:

  • The mean z-scores will always equal zero.
  • The variance and standard deviation of z-scores will equal to 1.
  • Converting the observed scores (X scores) to Z scores does not change the shape of the distribution.

4.2. Percentile Rank

The percentile rank tells us the percentage of scores that fall below a certain point. The standard normal curve allows us to find the percentages because we know the percent of area that falls under the standard normal curve.

There are a few z-score tables online. The important thing to understand is what's meant by "area beyond z" and "area between mean and z". With simple rearranging, we can calculate pretty much anything by referring to the graph.

  • Note 1: Column 3 + Column 5 = .5000
  • Note 2: The area under the entire curve = 1
  • Note 3: Because the standard normal distribution curve is symmetrical, the values for the above table are the same for their negative counterparts.

Wednesday, March 05, 2014

PSYC 2002: Math!

The wonderful world of Math! If you are confident with the material below, feel free to skip this post. I'm not great at math so I find quick reviews helpful at priming my brain to be in math-mode.

1. Review of Basic Math

1.1 Real Limits

The precision of measurement depends on the unit of measurement, which is typically selected by the researcher.

Real Limits are the numbers that establish the upper and lower limits within which the true valure is contained. To determine the real limit, divide the unit of measurement in half, and add AND subtract it from $X$ to get the limit for $X$.

$X = X \pm\frac{1}{2} \text{Unit of Measurement}$

1.2 Order of Operations

The order of operations must be completed in PEDMAS:

  • Parentheses;
  • Exponents, square roots;
  • Division, multiplication; and
  • Addition, subtraction.

1.3 Summary of Notation

Here's a quick summary of common notation:

  • $x$ or $y$: variables used to represent a set of observations or data.
  • $N$ = number of scores or observation in a population data set.
  • $n$ = number of people in each group.
  • $X_i$ = score on $X$ for person $i$.
  • $Y_i$ = score on $Y$ for person $i$.
  • $\sum$ (sum): where you add everything together that are represented by a variable
    • E.g. $\sum X$ means "add all $X$ values together."
    • $\sum_{i=\text{here is where you start}}^{\text{here is where you end}} \text{here are the values you calculate}$

Here are rules of (in)equality, using constants, where $N$ is the sample size and $C$ is a constant.

  1. When you are calculating the sum of a constant, multiply the sample size $N$ by the constant.
    $\displaystyle \sum_{i=1}^N C = NC$
  2. If you multiply the sum of $X$ by a constant, you can remove the constant from the summation and multiply the result by the constant.
    $\displaystyle \sum_{i=1}^N CX_i = C \sum_{i=1}^N X_i$
  3. If you are calculating the sum of $X$ plus the sum of $Y$, you can calculate the sum separately and add the results together.
    $\displaystyle \sum_{i=1}^N (X_i + Y_i) = \sum_{i=1}^N X_i + \sum_{i=1}^N Y_i$
  4. However, this only works with additions and subtractions, not multiplications! For example,
    $\displaystyle \sum_{i=1}^N X_i Y_i \neq \sum_{i=1}^N X_i \sum_{i=1}^N Y_i$

1.4 Rounding

Round at the last step only (i.e., only round your result). Round calculations to TWO more digits than the unit of measurement. Here are two rules for rounding:

  1. If the remainder is less than 5, drop the remainder. For example, $8.33333 = 8.33$
  2. If the remainder is greater than or equal to 5, increase the last digit by 1. For example, $5.66666 = 5.67$

2. Exploratory Data Analyses

Using exploratory data analyses, we can determine the best number that represents our data, the variance (spread) of the data set, how individual values compares to the entire set, whether there exists a systematic relationship between the variables being studied, and the shape of the distributions.

  1. Skew: the relative symmetry of a distribution of scores.
  2. Kurtosis: the degree to which values cluster versus being distributed across or throughout the range of values.

We can use stem and leaf graphs to display the original values in our data set. The stem (left of the line) represents the interval whereas the leaf (right of the line) represents the last digit values within the interval; the frequency of each scores is represented by the repeated values.

Stem Leaf
9 8 5 5 3 2 1
8 9 6 5 5 3 1
7 7 6 6 5 4 1
6 9 8 8 6
5 8
4 6
3 8

We can use frequency distributions to arrange scores in order of magnitude. The distribution shows the number of times each scores occurs (i.e., the frequency of the scores). It is composed of two elements: a class and a frequency.

  • Class: categorizing a grouping of values that are similar to each other.
  • Frequency: the number of times a score, $X$, occurs in a data set. $f(X)$

A frequency distribution table is created by listing all the $X$ scores from the highest to lowest and translating the frequencies into a table.

Stem Leaf
11 11 11 11 11 11 11
10 10 10 10 10 10 10 10 10 10 10
9 9 9 9 9 9 9 9 9
8 8 8 8 8 8 8
7 7 7 7 7 7
6 6 6 6

$\uparrow$turns into$\downarrow$

Stem Leaf
11 6
10 10
9 8
8 6
7 5
6 3

Grouped frequency distributions are used when the scores are spread out, especially when there are a small amount of frequencies. For example, salary ranges are normally displayed in grouped frequency distribution tables.

The cumulative frequency is the total frequency (or total number of scores) at or below that score. The total frequency of scores below the upper true limit of the class interval. The cumulative frequency distribution is where the scores are arranged in order of magnitude and the distribution shows the total frequency at or below each score.

Exam Score ($X$) $f(X)$ Cumulative $f(X)$
30-3944
40-49610
50-591020
60-69160180
70-7910190
80-896196
90-1004200

Relative Frequency (a.k.a. proportion) is where we divide the frequency of a score or class by $N$ (the total number of scores).

$\displaystyle \text{Relative Frequency} = \frac{f(X)}{N}$

The relative frequency distribution is where the scores are arranged in order of magnitude and the distribution shows the relative frequency for each score.

The cumulative relative frequency is the relative frequency of scores at or below that point. The cumulative relative frequency distribution is where the scores are arranged in order of magnitude and the distribution shows the relative frequency at or below each score.

Exam Score ($X$) $f(X)$ Cumulative $f(X)$ Relative $f(X)$ Cumulative Relative $f(X)$
30-39440.020.02
40-496100.030.05
50-5910200.050.1
60-691601800.80.9
70-79101900.050.95
80-8961960.030.98
90-10042000.021

Percent is where you multiply the relative frequency by 100.

$\displaystyle \text{Percent} = \frac{f(X)}{N} \times 100$

The percent distribution displays the scores arranged in order of magnitude and the distribution shows the percent for each score.

The cumulative percentage is the percentage of score at or below that point, or the percentage of scores below the upper true limit of the class interval. The cumulative percent distribution is where the scores are arranged in order of magnitude and the distribution shows the total percent at or below each score.

Exam Score $(X)$ $f(X)$ Cumulative $f(X)$ Relative $f(X)$ Cumulative Relative $f(X)$ % $(X)$ Cumulative X $(X)$
30-39440.020.0222
40-496100.030.0535
50-5910200.050.1510
60-691601800.80.98090
70-79101900.050.95595
80-8961960.030.98398
90-10042000.0212100

Converting frequencies to proportions or percentages allows us to compare two or more groups when the size of the groups differs.

3. Graphs

There are five fundamental characteristics of a graph:

  1. Two axes are drawn at a right angle.
  2. The horizontal axis is the x-axis (abscissa) and the vertical axis is the y-axis (ordinate).
  3. The IV is plotted along the x-axis and the DV is plotted along the y-axis.
  4. The variables must be clearly labelled on both the x- and y-axes.
  5. The graph should contain all of the information needed to understand the data and nothing more.

Bar graphs have spaces between the bars because they represent discrete, qualitative (nominal) data. The categories are plotted along the x-axis.

In contrast, histograms have no spaces between them. They are used to represent ordinal, interval, or ratio data. Quantitative values are plotted along the x-axis.

  • For continuous variables, a class interval is a category of numbers with specified limits. Each number on the x-axis represents the midpoint.
    • Height corresponds to the frequency for the category.
    • Each number on the x-axis represents the midpoint of the class interval.
  • For interval and ratio data, the width of the bar extends to the real limits for the category. E.g., for a value of 3 that falls equally between 2 and 4, the real limits would be 2.5 to 3.5.

Polygons and histograms are similar. If you add a dot at the top of each category's midpoint and connect the lines, you get a polygon. Make sure to join the outer limits of the polygon line to the x-axis at zero (to indicate frequency = 0).

PSYC 2002: Introduction to Statistics in Psychology

1. Statistical Terminology

Statistics: the process of collecting data in a systematic way and making decisions based on probability. They allow us to make educated, mathematical decisions based on probabilities. There are two ways statistics are used:

  • Descriptive Statistics: used to describe a group or particular data sample.
  • Inferential Statistics: used to make inferences about population parameters based on sample statistics (i.e., where we generalize from a sample to a population)

Let's review some important definitions:

  • Population: the entire group of individuals you could possibly measure on a variable.
    • A parameter is a summary value that describes a population (e.g., the average score for a population).
    • Population number ($N$).
    • Standard deviation ($\sigma$) (sigma).
    • Mean ($\mu$) (mu).
    Population uses uppercase alphabet and greek letters.
  • Sample: the group of individuals or scores that you select from a population and measure in your experiment. They have the same characteristics as the population you're interested in studying.
    • A statistic is a summary value that describes a sample (e.g., the average score for a sample). These are used to estimate population parameters but they're not perfect because of sampling errors.
    • Sample number ($n$).
    • Standard deviation ($s$).
    • Mean ($\bar{x}$) (x-bar)
    Samples use lowercase alphabet and greek letters.
  • Random Sample: a sample where each individual in the population has an equal chance of being selected.
  • Random Assignment: where each individual has an equal chance to be placed in each treatment condition.
  • Sampling Error: a naturally occurring difference between a sample statistic and its corresponding population parameter.

2. Scientific Research Methods

Research Design: the systematic process of collecting data in order to answer specific questions. There are two major classifications:

  1. Experimental: tests your hypotheses on the causal effects of the Independent Variable (IV) on the Dependent Variable (DV). The IV is actively manipulated and there are two levels. Your participants must be randomly assigned to conditions and you need to specific your procedures for testing your hypothesis. Also, you have to control for major threats to internal validity.
  2. Non-Experimental:
    1. Descriptive Research: describes a group using numerical scores on variables.
      1. Use: often a starting point in research.
      2. Statistics: mean, median, and standard deviation.
    2. Correlation Research: where we look for relationships between two or more vairables. We measure each participant on each variable but there is no causal inferences.
      1. Use: for testing theories and help form predictions; often as a secondary analysis.
      2. Statistics: often requires a linear relationship which is evaluated on strength and direction.
    3. Quasi-Experimental Research: where we test our hypotheses about relationships between the IV and DV but we have limited ability to actively manipulate the IV or randomly assign participants to conditions. We must still include specific procedures for testing the hypotheses and still control for the major threats to internal validity. We cannot make true causal inferences.

Variables: a characteristic or condition that changes or has different values for different individuals; any measure or characteristic that we use in research.

There are three types of experimental variables:

  1. Independent Variables (IV): what we manipulate.
    1. Experimental: where the experiment manipulates the variable.
    2. Subject: where the variable is predefined (e.g., gender, smoker versus non-smoker).
  2. Dependent Variables (DV): what we analyze or measure.
  3. Extraneous Variables (EV): any other variable (confound or not) that is not an IV or a DV.

There are two ways to describe variables:

  1. Qualitative (categorical): descriptive qualities that describe something (e.g., male/female, University attended).
    • Discontinuous (discrete): values that can only be whole numbers (no number in between). Measurements are exact (e.g., you have a whole person, not half of a person).
  2. Quantitative (numerical): quantifiable qualities that describe something (e.g., age, height)
    • Continuous (infinite): where they can be any number of values between two numbers.
    • Discontinuous (discrete).

In other words, qualitative variables are always discontinuous whereas quantitative variables can be either continuous or discontinuous.

3. Scales of Measurement

The level of measurement determines the types of questions we can ask, the types of analyses we can perform, and the conclusions we can draw. There are four ways to measure variables (NOIR):

  1. Nominal: categorizing variables in no particular order.
    • E.g., by gender (male/female/other), or religion (Jewish, Muslim, Hindu)
    • Statistics: chi-square; proportions, percentages, and mode.
  2. Ordinal: categorizing variables unequally along a continuum. (We can talk about magnitude.)
    • E.g., by satisfaction (1 through 7).
    • Statistics: Mann-Witney U; proportions, percentages, mode, and median.
  3. Interval: categorizing variables equally along a continuum. There is no true zero.
    • E.g., (temperature) it can be 10°C today and 20°C tomorrow but that doesn’t mean that tomorrow will be twice as hot as today.
    • E.g., (IQ) if you have an IQ of zero, you don’t have an absence of intelligence.
    • Statistics: t-test or ANOVA; median, mean and standard deviation.
  4. Ratio: categorizing variables equally along a continuum with a true zero.
    • E.g., length, height, and reaction times can all have zero values. A 10 second reaction time is twice as fast as a 5 second reaction time.
    • Statistics: t-test or ANOVA; median, mean and standard deviation.

In Psychology, we normally use interval scales in tests and measurements.

4. Experimental Designs

4.1. Experimental Design (True Experiments)

As summarized above, you are testing your hypothesis on the causal effects of the IV on the DV. Your goal is to determine if there exists a cause-and-effect relationship.

The general process is to randomly select individuals from the population; randomly assign them to different treatment conditions; expose subjects to different levels of the IV; and compare DV differences between conditions.

We must control our confounding variables only when:

  • A variable changes systematically with the IV; or
  • A variable influences the DV.

4.2. Quasi-Experimental Research

4.2.1. Descriptive Research

You are trying to obtain an accurate description of a population with numerical values. You cannot draw a conclusion - only describe!

The general process is to collect data on variables of interest and use descriptive statistics to summarize the data that you collect.

4.2.2. Correlational Research

As summarized above, you are trying to determine if there exists a relationship between two or more variables. You cannot determine causality or directionality.

The general process is to randomly select individuals from the population (using something like simple random sampling); collect information on the variables of interest; and calculate a correlation coefficient ($r$).

4.2.3. Comparing Intact Groups

This is where we are trying to determine if there exists a relationship between a grouping variable and some other variable. You can find significant differences between groups but you still cannot determine causality and direction.

The general process is to randomly select individuals from the population; place individuals into a group based on a person-variable (e.g., smoker versus non-smoker); and compare the group differences. (Think: non-equivalent group, pre-post group, and developmental designs.)