25

I have a table of data with a column representing a lab value for each study subject (rows).

I want to generate a series of histograms showing the distribution of values for each lab test (i.e. column). Each set of lab values would ideally have a different bin width (some are integers with a range of hundreds, some are numeric with a range of 2-3).

How do I do that?

nbro
  • 15,395
  • 32
  • 113
  • 196
veldhoen
  • 373
  • 1
  • 3
  • 7
  • I just came across the [multi.hist() function from the psych package](https://www.personality-project.org/r/html/multi.hist.html) . It allows you to quickly plot histograms by specific columns and looks like you can set different breaks for each column. – kitkat Jul 27 '18 at 14:23

2 Answers2

47

If you combine the tidyr and ggplot2 packages, you can use facet_wrap to make a quick set of histograms of each variable in your data.frame.

You need to reshape your data to long form with tidyr::gather, so you have key and value columns like such:

library(tidyr)
library(ggplot2)
# or `library(tidyverse)`

mtcars %>% gather() %>% head()
#>   key value
#> 1 mpg  21.0
#> 2 mpg  21.0
#> 3 mpg  22.8
#> 4 mpg  21.4
#> 5 mpg  18.7
#> 6 mpg  18.1

Using this as our data, we can map value as our x variable, and use facet_wrap to separate by the key column:

ggplot(gather(mtcars), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

The scales = 'free_x' is necessary unless your data is all of a similar scale.

You can replace bins = 10 with anything that evaluates to a number, which may allow you to set them somewhat individually with some creativity. Alternatively, you can set binwidth, which may be more practical, depending on what your data looks like. Regardless, binning will take some finesse.

alistaire
  • 42,459
  • 4
  • 77
  • 117
  • [Here's an approach for setting different binwidths](http://stackoverflow.com/a/17286264/4497050), though it's a bit more complicated. – alistaire Feb 12 '16 at 23:20
  • I really appreciate the help, I was really misunderstanding how the faceting worked. I'll work with the bitwidths article you posted, as they are so different between the values. Thank you very much. – veldhoen Feb 12 '16 at 23:27
11

You could generate the plots in a for loop with something like this, if your data frame is named "df" and you want to generate histograms starting with column 2 (if column 1 is your id):

for (col in 2:ncol(df)) {
    hist(df[,col])
}

The hist function automatically calculates a reasonable bin width, or you can specify a fixed number of bins for all histograms, by adding the breaks argument:

hist(df[,col], breaks=10)

If you use RStudio, all your plots will be automatically be saved in the plots pane. If not, you will need to save each plot to a separate file inside the loop, as explained here: http://www.r-bloggers.com/automatically-save-your-plots-to-a-folder/

nbro
  • 15,395
  • 32
  • 113
  • 196
KTWillow
  • 297
  • 1
  • 6
  • 1
    That is great, thank you. The automatic plot saving is a great addition to my understanding. – veldhoen Feb 12 '16 at 23:28
  • 1
    You could add `par(mfrow = c(x, y))` to display them in one plot. Or perhaps make the code wait somehow so that user has the time to look at the plot and proceed to next. Or perhaps add a sleep timer to display each image for a predefined time period. Or, just use something akin to what @alistaire did. :) – Roman Luštrik Feb 14 '16 at 11:19
  • 1
    I suggest adding `main = names(df[col])` as an additional argument to the hist function. It'll label each histogram with the name of the column in question. – Devin Lamothe Apr 30 '18 at 23:29