# A simple plot for correlation between response and explanatory variables

## Introduction

In this post, I’ll explain very briefly how to build a bar correlation plot
using `ggplot2`

and `dplyr`

. Such graphs are interesting to examine the relationship
between an outcome of interest and explanatory variables, which is an important
task that precedes a regression analysis. It is important to remember that the
Pearson Correlation gives us a linear correlation. Consequently, a low linear
correlation doesn’t necessarily imply that there is no correlation between two
variables because such correlation can follow a non-linear pattern. Another very
interesting remark is the famous quote “Correlation is not causation!“.

The data used is the well-know

`mtcars`

.

## Straight to the point

Let’s say that we are interested in verify which variables have high correlations
with `mpg`

. To do it, use the following piece of code to compute the correlation matrix
for the data and then select just first the line of the matrix corresponding to the
variable of interest, after this, exclude the first position of the resulting
vector.

```
library(dplyr)
library(ggplot2)
library(magrittr)
correlation <- cor(mtcars)[1,][-1]
```

In matter of fact, the Pearson correlation is not adequated to measure the correlation between variables that are not continuous. However, this is just a toy example to show how to create a specific plot.

Now that the correlations have already been computed, store it in a `tibble`

(this is not necessary, but make the code more organized). Note that the
`Covariates`

variable is being reordered by the correlation. As a consequence,
when the data is plotted, the bars will be ordered by their correlation
coefficient with the response variable.

```
df <- tibble(Covariates = names(correlation),
Correlation = correlation) %>%
mutate(Covariates = factor(Covariates,
levels = Covariates[order(.$Correlation,
decreasing = T)]
)
)
```

In the following piece of code, the argument `fill = as.factor(sign(Correlation))`

will
give different colours for negative and positive correlations, these colours are specified
with `scale_fill_manual`

. Also, `stat = 'identity'`

is necessary because the default of `geom_bar`

assumes a count. In `geom_text`

, the argument `vjust`

needs to be multiplied by `sign(Correlation)`

since the bars assume positive and negative values and such parameter needs to be different
for each of these two cases.

```
ggplot(df) +
geom_bar(aes(x = Covariates, y = Correlation, fill = as.factor(sign(Correlation))),
stat = 'identity', width = .6) +
geom_text(aes(x = Covariates, y = Correlation,
label = formatC(Correlation, digits = 2)),
vjust = sign(df$Correlation)*1.6,
color = 'white',
size = 3.5)+
scale_fill_manual(values = c('#9b1d1d', '#129127')) +
guides(fill = F) +
labs(y = expression(rho(X[i], Y)),
x = expression(X[i])) +
theme_bw()
```

Note that the plot above can have another interesting ordering for the bars. If we
order the variable `Covariates`

by the absolute value of the `Correlation`

, then we’ll
have the plot ordered by the strength of the correlation, independently from its
signal. Sometimes such approach is more interesting and it is shown in the next
piece of code. There is just one subtle alteration, apply the function `abs`

to
`.$Correlation`

.

```
df %<>% mutate(Covariates = factor(Covariates,
levels = Covariates[order(abs(.$Correlation),
decreasing = T)]))
ggplot(df) +
geom_bar(aes(x = Covariates, y = Correlation, fill = as.factor(sign(Correlation))),
stat = 'identity', width = .6) +
geom_text(aes(x = Covariates, y = Correlation,
label = formatC(Correlation, digits = 2)),
vjust = sign(df$Correlation)*1.6,
color = 'white',
size = 3.5)+
scale_fill_manual(values = c('#9b1d1d', '#129127')) +
guides(fill = F) +
labs(y = expression(rho(X[i], Y)),
x = expression(X[i])) +
theme_bw()
```

So, that’s it. Hope that you’ve enjoyed this short reading.