A simple plot for correlation between response and explanatory variables

Introduction

In this post, I’ll explain very briefly how to build a bar correlation plot using ggplot2 and dplyr. Such graphs are interesting to examine the relationship between an outcome of interest and explanatory variables, which is an important task that precedes a regression analysis. It is important to remember that the Pearson Correlation gives us a linear correlation. Consequently, a low linear correlation doesn’t necessarily imply that there is no correlation between two variables because such correlation can follow a non-linear pattern. Another very interesting remark is the famous quote “Correlation is not causation!“.

The data used is the well-know mtcars.

Straight to the point

Let’s say that we are interested in verify which variables have high correlations with mpg. To do it, use the following piece of code to compute the correlation matrix for the data and then select just first the line of the matrix corresponding to the variable of interest, after this, exclude the first position of the resulting vector.

library(dplyr)
library(ggplot2)
library(magrittr)

correlation <- cor(mtcars)[1,][-1]

In matter of fact, the Pearson correlation is not adequated to measure the correlation between variables that are not continuous. However, this is just a toy example to show how to create a specific plot.

Now that the correlations have already been computed, store it in a tibble (this is not necessary, but make the code more organized). Note that the Covariates variable is being reordered by the correlation. As a consequence, when the data is plotted, the bars will be ordered by their correlation coefficient with the response variable.

df <- tibble(Covariates = names(correlation),
             Correlation = correlation) %>% 
  mutate(Covariates = factor(Covariates, 
                             levels = Covariates[order(.$Correlation, 
                                                       decreasing = T)]
  )
  )

In the following piece of code, the argument fill = as.factor(sign(Correlation)) will give different colours for negative and positive correlations, these colours are specified with scale_fill_manual. Also, stat = 'identity' is necessary because the default of geom_bar assumes a count. In geom_text, the argument vjust needs to be multiplied by sign(Correlation) since the bars assume positive and negative values and such parameter needs to be different for each of these two cases.

ggplot(df) +
  geom_bar(aes(x = Covariates, y = Correlation, fill = as.factor(sign(Correlation))),
           stat = 'identity', width = .6) +
  geom_text(aes(x = Covariates, y = Correlation, 
                label = formatC(Correlation, digits = 2)),
            vjust = sign(df$Correlation)*1.6,
            color = 'white', 
            size = 3.5)+
  scale_fill_manual(values = c('#9b1d1d', '#129127')) +
  guides(fill = F) +
  labs(y = expression(rho(X[i], Y)),
       x = expression(X[i])) +
  theme_bw()

Note that the plot above can have another interesting ordering for the bars. If we order the variable Covariates by the absolute value of the Correlation, then we’ll have the plot ordered by the strength of the correlation, independently from its signal. Sometimes such approach is more interesting and it is shown in the next piece of code. There is just one subtle alteration, apply the function abs to .$Correlation.

df %<>% mutate(Covariates = factor(Covariates, 
                                   levels = Covariates[order(abs(.$Correlation), 
                                                             decreasing = T)]))

ggplot(df) +
  geom_bar(aes(x = Covariates, y = Correlation, fill = as.factor(sign(Correlation))),
           stat = 'identity', width = .6) +
  geom_text(aes(x = Covariates, y = Correlation, 
                label = formatC(Correlation, digits = 2)),
            vjust = sign(df$Correlation)*1.6,
            color = 'white', 
            size = 3.5)+
  scale_fill_manual(values = c('#9b1d1d', '#129127')) +
  guides(fill = F) +
  labs(y = expression(rho(X[i], Y)),
       x = expression(X[i])) +
  theme_bw()

So, that’s it. Hope that you liked.

Avatar
Lucas Godoy
PhD Candidate / TA /GA

I’m a PhD Candidate in Stats interested in R, Open Data, and the most diverse applications of statistics.

comments powered by Disqus

Related