🔗 Correlation

Pearson correlation + scatter plot + regression

What Is Correlation Analysis?

Correlation measures the strength and direction of the linear relationship between two continuous variables. It answers: "When variable X increases, does variable Y tend to increase, decrease, or stay the same?" The correlation coefficient, denoted r, ranges from −1 to +1.

Correlation is one of the most intuitive statistical concepts, but it is also one of the most frequently misinterpreted. Understanding its nuances is essential for responsible data analysis.

Pearson vs. Spearman: Choosing the Right Coefficient

Pearson's r (Parametric)

Pearson's correlation coefficient measures the strength of the linear relationship between two continuous variables. It assumes both variables are normally distributed, the relationship is linear, and there are no extreme outliers. It is the most commonly reported correlation in scientific papers.

Pearson's r is sensitive to outliers. A single extreme data point can dramatically inflate or deflate the correlation. Always visualize your data with a scatter plot before interpreting r.

Spearman's rho (Non-parametric)

Spearman's rank correlation converts your data to ranks before computing the correlation. This makes it robust to outliers and appropriate for ordinal data or non-linear monotonic relationships. If one variable consistently increases as the other increases (even non-linearly), Spearman's rho will capture this.

Use Spearman when: your data is ordinal (e.g., survey ratings), the relationship is monotonic but not linear, or your data contains outliers.

FeaturePearson's rSpearman's rho
Data typeContinuous, normalOrdinal or continuous
Relationship typeLinearMonotonic
Outlier sensitivityHighLow
Power (when assumptions met)HigherSlightly lower

Interpreting Correlation Strength

While interpretation depends on context and field, a commonly used guideline (adapted from Cohen, 1988):

|r|Interpretation
0.00 – 0.10Negligible
0.10 – 0.30Small (weak)
0.30 – 0.50Medium (moderate)
0.50 – 0.70Large (strong)
0.70 – 1.00Very large (very strong)

Note: In some fields (e.g., psychometrics), r = 0.3 is considered meaningful, while in physics or engineering, r = 0.9 might be considered barely adequate. Always consider domain norms.

Correlation Does Not Imply Causation

This is perhaps the single most important principle in statistics. A strong correlation between X and Y can arise from at least four scenarios:

  1. X causes Y. (The relationship you hope for.)
  2. Y causes X. (Reverse causation.)
  3. A third variable Z causes both X and Y. (Confounding.)
  4. Pure coincidence. (Spurious correlation.)

The famous example: ice cream sales and drowning deaths are positively correlated — not because ice cream causes drowning, but because both increase during hot summer months (confounding variable: temperature). Establishing causation requires controlled experiments, not just correlation.

The Coefficient of Determination: r²

Squaring the correlation coefficient gives you , which represents the proportion of variance in Y that is explained by X. If r = 0.7, then r² = 0.49, meaning X explains about 49% of the variability in Y. The remaining 51% is due to other factors. This is a more intuitive measure of practical significance than r alone.

Common Pitfalls

Frequently Asked Questions

What sample size do I need for reliable correlation?

As a rule of thumb, you need at least 20–30 paired observations for a meaningful correlation estimate. For detecting a medium correlation (r = 0.3) with 80% power at alpha = 0.05, you need approximately 85 observations. Use a power analysis calculator for precise planning.

Can I correlate a continuous variable with a binary variable?

Yes, this is called the point-biserial correlation, which is mathematically equivalent to Pearson's r when one variable is dichotomous (0/1). It is commonly used to correlate test item scores (correct/incorrect) with total test scores.

How should I report correlation results?

APA style: "There was a significant positive correlation between study hours and exam scores, r(48) = .62, p < .001." Include the sample size (as degrees of freedom, n−2), the r value, and the p-value. A scatter plot is highly recommended.

My correlation is significant but small (r = 0.15). Is it meaningful?

Statistical significance and practical significance are different things. With a large sample (n > 500), even tiny correlations become "significant." Consider r²: at r = 0.15, only 2.25% of variance is shared. Whether this matters depends on your research context.

This tool is free forever. If it saved you time, consider buying me a coffee.

☕ Buy me a coffee