10

Exploring a new data set: What is the easiest, quickest way to visualise many (all) variables?

Ideally, the output shows the histograms next to each other with minimal clutter and maximum information. Key to this question is flexibility and stability to deal with large and different data sets. I'm using RStudio and usually deal with large and messy survey data.

One example which comes out of the box of Hmisc and works quite well here is:

library(ggplot2)
str(mpg)

library(Hmisc)
hist.data.frame(mpg)

Unfortunately, somewhere else I run into problems with data lables (Error in plot.new() : figure margins too large). It also crashed for a larger data set than mpg and I haven't figured out how to control binning. Moreover, I'd prefer a flexible solution in ggplot2. Note that I just started learning R and am used to the comfortable solutions provided by commercial software.

More questions on this topic:

R histogram - too many variables

...?

Community
  • 1
  • 1
Rico
  • 1,998
  • 3
  • 24
  • 46
  • 1
    Making a graph for every variable in a data set is fine for a small data set, but is simply a terribly idea if you have 3000 variables. The correct answer in that case is "Don't do that". – joran Jun 27 '12 at 14:26
  • Of course not; that was just an example for "messy". – Rico Jun 27 '12 at 14:34
  • 1
    I appreciate the effort you've gone to here, but your question simply isn't describing a concrete, specific programming problem. Instead, it feels very much like something that will lead to rambling answers with various recommendations, rather than a clear answer. Indeed, when I read your answer I'm more confused about what your criteria are than before. – joran Jun 27 '12 at 14:38

1 Answers1

13

There may be three broad approaches:

  1. Commands from packages such as hist.data.frame()
  2. Looping over variables or similar macro constructs
  3. Stacking variables and using facets

Packages

Other commands available that may be helpful:

library(plyr)
library(psych)
multi.hist(mpg) #error, not numeric
multi.hist(mpg[,sapply(mpg, is.numeric)])

or perhaps multhist from plotrix, which I haven't explored. Both of them do not offer the flexibilty I was looking for.

Loops

As an R beginner everyone advised me to stay away from loops. So I did, but perhaps it is worth a try here. Any suggestions are very welcome. Perhaps you could comment on how to combine the graphs into one file.

Stacking

My first suspicion was that stacking variables might get out of hand. However, it might be the best strategy for a reasonable set of variables.

One example I came up with uses the melt function.

library(reshape2)
mpgid <- mutate(mpg, id=as.numeric(rownames(mpg)))
mpgstack <- melt(mpgid, id="id")
pp <- qplot(value, data=mpgstack) + facet_wrap(~variable, scales="free")
# pp + stat_bin(geom="text", aes(label=..count.., vjust=-1))
ggsave("mpg-histograms.pdf", pp, scale=2)

(As you can see I tried to put value labels on the bars for more information density, but that didn't go so well. The labels on the x-axis are also less than ideal.)

No solution here is perfect and there won't be a one-size-fits-all command. But perhaps we can get closer to ease exploring a new data set.

Rico
  • 1,998
  • 3
  • 24
  • 46