🔗 Correlation
Pearson correlation + scatter plot + regression
What Is Correlation Analysis?
Correlation measures the strength and direction of the linear relationship between two continuous variables. It answers: "When variable X increases, does variable Y tend to increase, decrease, or stay the same?" The correlation coefficient, denoted r, ranges from −1 to +1.
- r = +1: Perfect positive linear relationship (both variables increase together)
- r = 0: No linear relationship
- r = −1: Perfect negative linear relationship (one increases as the other decreases)
Correlation is one of the most intuitive statistical concepts, but it is also one of the most frequently misinterpreted. Understanding its nuances is essential for responsible data analysis.
Pearson vs. Spearman: Choosing the Right Coefficient
Pearson's r (Parametric)
Pearson's correlation coefficient measures the strength of the linear relationship between two continuous variables. It assumes both variables are normally distributed, the relationship is linear, and there are no extreme outliers. It is the most commonly reported correlation in scientific papers.
Pearson's r is sensitive to outliers. A single extreme data point can dramatically inflate or deflate the correlation. Always visualize your data with a scatter plot before interpreting r.
Spearman's rho (Non-parametric)
Spearman's rank correlation converts your data to ranks before computing the correlation. This makes it robust to outliers and appropriate for ordinal data or non-linear monotonic relationships. If one variable consistently increases as the other increases (even non-linearly), Spearman's rho will capture this.
Use Spearman when: your data is ordinal (e.g., survey ratings), the relationship is monotonic but not linear, or your data contains outliers.
| Feature | Pearson's r | Spearman's rho |
|---|---|---|
| Data type | Continuous, normal | Ordinal or continuous |
| Relationship type | Linear | Monotonic |
| Outlier sensitivity | High | Low |
| Power (when assumptions met) | Higher | Slightly lower |
Interpreting Correlation Strength
While interpretation depends on context and field, a commonly used guideline (adapted from Cohen, 1988):
| |r| | Interpretation |
|---|---|
| 0.00 – 0.10 | Negligible |
| 0.10 – 0.30 | Small (weak) |
| 0.30 – 0.50 | Medium (moderate) |
| 0.50 – 0.70 | Large (strong) |
| 0.70 – 1.00 | Very large (very strong) |
Note: In some fields (e.g., psychometrics), r = 0.3 is considered meaningful, while in physics or engineering, r = 0.9 might be considered barely adequate. Always consider domain norms.
Correlation Does Not Imply Causation
This is perhaps the single most important principle in statistics. A strong correlation between X and Y can arise from at least four scenarios:
- X causes Y. (The relationship you hope for.)
- Y causes X. (Reverse causation.)
- A third variable Z causes both X and Y. (Confounding.)
- Pure coincidence. (Spurious correlation.)
The famous example: ice cream sales and drowning deaths are positively correlated — not because ice cream causes drowning, but because both increase during hot summer months (confounding variable: temperature). Establishing causation requires controlled experiments, not just correlation.
The Coefficient of Determination: r²
Squaring the correlation coefficient gives you r², which represents the proportion of variance in Y that is explained by X. If r = 0.7, then r² = 0.49, meaning X explains about 49% of the variability in Y. The remaining 51% is due to other factors. This is a more intuitive measure of practical significance than r alone.
Common Pitfalls
- Restricted range. If you only sample a narrow range of X values, the correlation will be artificially low. For example, correlating IQ and academic performance only among gifted students will underestimate the true relationship.
- Non-linear relationships. Pearson's r only captures linear associations. A perfect U-shaped relationship can yield r = 0. Always check the scatter plot.
- Outliers. A single outlier can create or destroy an apparent correlation. Robust methods or Spearman's rho help mitigate this.
- Multiple testing. Computing correlations among 10 variables produces 45 pairwise comparisons. Without correction, you expect about 2.25 false positives at alpha = 0.05.
Frequently Asked Questions
What sample size do I need for reliable correlation?
As a rule of thumb, you need at least 20–30 paired observations for a meaningful correlation estimate. For detecting a medium correlation (r = 0.3) with 80% power at alpha = 0.05, you need approximately 85 observations. Use a power analysis calculator for precise planning.
Can I correlate a continuous variable with a binary variable?
Yes, this is called the point-biserial correlation, which is mathematically equivalent to Pearson's r when one variable is dichotomous (0/1). It is commonly used to correlate test item scores (correct/incorrect) with total test scores.
How should I report correlation results?
APA style: "There was a significant positive correlation between study hours and exam scores, r(48) = .62, p < .001." Include the sample size (as degrees of freedom, n−2), the r value, and the p-value. A scatter plot is highly recommended.
My correlation is significant but small (r = 0.15). Is it meaningful?
Statistical significance and practical significance are different things. With a large sample (n > 500), even tiny correlations become "significant." Consider r²: at r = 0.15, only 2.25% of variance is shared. Whether this matters depends on your research context.
This tool is free forever. If it saved you time, consider buying me a coffee.
☕ Buy me a coffee