Natasha Zabchuk's Course Notes: PSYC 2002: Descriptive Statistics

1. Measures of Central Tendency
2. Measures of Variability (Dispersion or Spread)
3. Transformations
1. 3.1. Linear Transformations
2. 3.2. Non-Linear Transformations
4. Standard Normal Distribution
1. 4.1. Standard Scores (Z-Scores)
2. 4.2. Percentile Rank

1. Measures of Central Tendency

Measures of central tendency describe the most typical value in the distribution. There are three measures:

Mode: indicates the most frequently occurring category or the category that contains the most data values. Used primarily for nominal data but can be used with others.
- Unimodal: 1 mode.
- Bimodal: 2 modes.
- Multimodal: many models.
- Rectangle/Uniform: no modes (all the scores have the same frequency).
Median: indicates the data point that falls in the middle of a ranked data set (i.e., 50^th percentile, $X_2$ or the second quartile, $Q_2$). Used primarily for ordinal data. Can be useful with skewed interval and ratio data.
- Order the numbers from the smallest to largest and find the middle number. If there are two middle numbers, calculate the middle of the two.
- Equation:
  $\displaystyle \text{median location} = \frac{N+1}{2}, \text{where N is the number of your sample}$
Median: indicates the average value for a data set. Used for interval and ratio data.
- Equation: where x-bar is equal to the sum of each value ($X$), divided by the number of values.
  $\displaystyle \bar{X} = \frac{\sum X}{N}$
  - Summed Deviations: the summed deviations (the distance from any point to the mean) will always equal zero $\sum(X - \bar{X}) = 0$.
  - The Sum of the Squared Deviations (a.k.a., concept of least squares) gives you the unadjusted measure of dispersion. $\sum(X - \bar{X}^2 )$ (This is ${SS}_x$... you'll use this in the Linear Regression post.)
- Weighted Mean: calculated as the overall mean for two or more different samples. The weight is calculated as the data set with more scores. X-bar is equal to the sum of each value ($i$) multiplied by the weight ($a$), divided by the sum of the weights.
  $\displaystyle \bar{X} = \frac{\sum_{i=1}^k a_i X_i}{\sum_i^k a_i}$
- Mean, median and mode relationships: in single-peaked, symmetrical distributions, the mean, median and mode are all int he same place (the middle). In skewed distributions, the mode is at the peak, the median is in the middle (duhh), and the mean is pulled towards the skew (i.e., the tail).

2. Measures of Variability (Dispersion or Spread)

There are four measures of variability:

Range (R): the difference between the highest and lowest scores.
- Issues: sensitive to extreme scores and doesn't provide much information.
Interquartile Range (IQR or IR): describes the variance (spread) of the data set by removing the outliers. It is calculated as the difference between the 75^th ($Q_3$) and the 25^th ($Q_1$) scores. $Q_2$ is the median (or 50^th) percentile.
- $IQR = Q_3 - Q_1$
  $\text{Where } Q_3 = \text{values of } X_.75 = (N + 1)(.75)$
  $\text{And } Q_1 = \text{value of } X_.25 = (N + 1)(.25)$
- Issue: removes the extreme scores, which may be important.
Variance Estimate: because $SS_x$ is affected by $N$, we can correct for this difference by calculating the variance.

$\displaystyle S_{x}^2 = \frac{SS_x}{N} = \frac{\sum (X - \bar{X})^2}{N}$
- Unbiased Variance Estimate: provides a sample of the population variance (i.e., systematically underestimates the value of the population variance). In order to obtain an unbiased estimate of the population variance (variation estimate), divide $SS_x$ by the degrees of freedom instead of N.
  
  $\displaystyle \hat{S}_x^2 = \frac{SS_x}{df} = \frac{SS_x}{N-1} = \frac{\sum (X - \bar{X})^2}{N-1}$
Standard Deviation ($S_X$): calculate the standard deviation by squaring the variance.

$\displaystyle \text{Standard Deviation} = \sqrt{\text{Variance}}$

$\displaystyle S_x = \sqrt{S_x^2} = \sqrt{\frac{{SS}_x}{N}} = \sqrt{\frac{\sum (X - \bar{X})^2}{N}}$
- Deviation Score, or distance of a score from the mean, is equal to $X - \bar{X}$; the sum of these scores is always zero.
- Sum of Squares is equal to ${SS}_x = \sum (X - \bar{X})^2$

These are the roles to follow in order to select a measure of variability:

When mean is used as the measure of central tendency, standard deviation should be used as the measure of variability.
When median is used as the measure of central tendency, interquartile range should be used as the measure of variability.

3. Transformations

3.1. Linear Transformations

Linear transformations use a constant, which changes the mean, standard deviation, but not the shape. There are two ways to manipulate a dataset linearly:

Add/Subtract: calculate the sum, $b$, to all the scores.
- The mean changes by the same constant:
  if $X + b \text{, then } \bar{X} + b$
- The standard deviation does not change.
Multiply/Divide: calculate the product or quotient, $m$ of all the scores.
- The mean changes by the same factor:
  if $X \times m \text{, then } \bar{X} \times m$
- The standard deviation changes by the same factor:
  if $X \times m \text{, then } S_x = S_x \times m$

3.2. Non-Linear Transformations

Non-linear transformations use a square root or a $log$, which changes the mean, standard deviation, and the shape. The scores are affected non-uniformly.

Useful to reduce the skew of the data, which allows you to manipulate and interpret your data more. However, some argue that it distorts the scale.

4. Standard Normal Distribution

The Standard Normal Distribution is a mesokurtic curve made up of z-scores. It has a mean of 0 and a standard deviation of 1. It allows us to calculate the proportion of area below specific sections of the curve (namely, $Q_1$ to $Q_4$).

We know that 50% of the scores are above and 50% of the scores are below the mean.
-1 to +1 standard deviations from the mean is 68.26% of the scores (31.13% above and below).
-2 to +2 standard deviations from the mean is 95.44% of the scores (the above values plus 13.59% above and below).
-3 to +3 standard deviations from the mean is 99.74% of the scores (the above values plus 21.15% above and below).
The rest makes up the last 0.26% (0.13% above and below).

4.1. Standard Scores (Z-Scores)

Z-Score: describes the number of standard deviations any score is above or below the mean. It is calculated by dividing the deviation score (score minus mean) by the standard deviation.
E.g., A Z-score of -1.1 means that the score is 1.1 standard deviations below the mean.

$\displaystyle \text{Sample: } z = \frac{X - \bar{X}}{S_x}$

$\displaystyle \text{Population: } z = \frac{X - \mu_x}{\sigma_x}$

Z-scores are useful for two reasons:

Allow you to describe the exact location of a score within a single distribution.
Allow you to compare z-scores in one distribution to z-scores in another distribution. It gives you a way to compare across different scales.

There are three characteristics of z-scores:

The mean z-scores will always equal zero.
The variance and standard deviation of z-scores will equal to 1.
Converting the observed scores (X scores) to Z scores does not change the shape of the distribution.

4.2. Percentile Rank

The percentile rank tells us the percentage of scores that fall below a certain point. The standard normal curve allows us to find the percentages because we know the percent of area that falls under the standard normal curve.

There are a few z-score tables online. The important thing to understand is what's meant by "area beyond z" and "area between mean and z". With simple rearranging, we can calculate pretty much anything by referring to the graph.

Note 1: Column 3 + Column 5 = .5000
Note 2: The area under the entire curve = 1
Note 3: Because the standard normal distribution curve is symmetrical, the values for the above table are the same for their negative counterparts.

Notes by Lecture Date

PSYC 2002: Descriptive Statistics

Contents