A simple plot for correlation between response and explanatory variables
Introduction
In this post, I’ll explain very briefly how to build a bar correlation plot
using ggplot2
and dplyr
. Such graphs are interesting to examine the relationship
between an outcome of interest and explanatory variables, which is an important
task that precedes a regression analysis. It is important to remember that the
Pearson Correlation gives us a linear correlation. Consequently, a low linear
correlation doesn’t necessarily imply that there is no correlation between two
variables because such correlation can follow a non-linear pattern. Another very
interesting remark is the famous quote “Correlation is not causation!“.
The data used is the well-know
mtcars
.
Straight to the point
Let’s say that we are interested in verify which variables have high correlations
with mpg
. To do it, use the following piece of code to compute the correlation matrix
for the data and then select just first the line of the matrix corresponding to the
variable of interest, after this, exclude the first position of the resulting
vector.
library(dplyr)
library(ggplot2)
library(magrittr)
correlation <- cor(mtcars)[1,][-1]
In matter of fact, the Pearson correlation is not adequated to measure the correlation between variables that are not continuous. However, this is just a toy example to show how to create a specific plot.
Now that the correlations have already been computed, store it in a tibble
(this is not necessary, but make the code more organized). Note that the
Covariates
variable is being reordered by the correlation. As a consequence,
when the data is plotted, the bars will be ordered by their correlation
coefficient with the response variable.
df <- tibble(Covariates = names(correlation),
Correlation = correlation) %>%
mutate(Covariates = factor(Covariates,
levels = Covariates[order(.$Correlation,
decreasing = T)]
)
)
In the following piece of code, the argument fill = as.factor(sign(Correlation))
will
give different colours for negative and positive correlations, these colours are specified
with scale_fill_manual
. Also, stat = 'identity'
is necessary because the default of geom_bar
assumes a count. In geom_text
, the argument vjust
needs to be multiplied by sign(Correlation)
since the bars assume positive and negative values and such parameter needs to be different
for each of these two cases.
ggplot(df) +
geom_bar(aes(x = Covariates, y = Correlation, fill = as.factor(sign(Correlation))),
stat = 'identity', width = .6) +
geom_text(aes(x = Covariates, y = Correlation,
label = formatC(Correlation, digits = 2)),
vjust = sign(df$Correlation)*1.6,
color = 'white',
size = 3.5)+
scale_fill_manual(values = c('#9b1d1d', '#129127')) +
guides(fill = F) +
labs(y = expression(rho(X[i], Y)),
x = expression(X[i])) +
theme_bw()
Note that the plot above can have another interesting ordering for the bars. If we
order the variable Covariates
by the absolute value of the Correlation
, then we’ll
have the plot ordered by the strength of the correlation, independently from its
signal. Sometimes such approach is more interesting and it is shown in the next
piece of code. There is just one subtle alteration, apply the function abs
to
.$Correlation
.
df %<>% mutate(Covariates = factor(Covariates,
levels = Covariates[order(abs(.$Correlation),
decreasing = T)]))
ggplot(df) +
geom_bar(aes(x = Covariates, y = Correlation, fill = as.factor(sign(Correlation))),
stat = 'identity', width = .6) +
geom_text(aes(x = Covariates, y = Correlation,
label = formatC(Correlation, digits = 2)),
vjust = sign(df$Correlation)*1.6,
color = 'white',
size = 3.5)+
scale_fill_manual(values = c('#9b1d1d', '#129127')) +
guides(fill = F) +
labs(y = expression(rho(X[i], Y)),
x = expression(X[i])) +
theme_bw()
So, that’s it. Hope that you’ve enjoyed this short reading.