Writing a loop to create ggplot figures with different data sources and titles

Question

I do not have experience with loops but it looks like I will need to create some of them to analyze my data properly. Could you show how to create a simple loop on the code which I already created ? Let's use looping to get some plots:

pdf(file = sprintf("complex I analysis", tbl_comp_abu1), paper='A4r')

ggplot(df_tbl_data1_comp1, aes(Size_Range, Abundance, group=factor(Gene_Name))) +
  theme(legend.title=element_blank()) +
  geom_line(aes(color=factor(Gene_Name))) +
  ggtitle("Data1 - complex I")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(df_tbl_data2_comp1, aes(Size_Range, Abundance, group=factor(Gene_Name))) +
  theme(legend.title=element_blank()) +
  geom_line(aes(color=factor(Gene_Name))) +
  ggtitle("Data2 - complex I")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))


ggplot(df_tbl_data3_comp1, aes(Size_Range, Abundance, group=factor(Gene_Name))) +
  theme(legend.title=element_blank()) +
  geom_line(aes(color=factor(Gene_Name))) +
  ggtitle("Datas3 - complex I")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

dev.off()

The question now is what I would like to achieve. So first of all I have like 10 complexes to analyze so that means 10 pdf files should be created and the example shows plots from three different data sets for the complex one. To make it properly the number in variable comp1 (from df_tbl_dataX_comp1) has to be changed from 1 to 10 - depends which complex we want to plot. The next thing which has to be changed through the loop is the name of pdf file and each of graphs... Is it possible to write such loop ?

Data:

structure(list(Size_Range = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 
3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 
8L, 8L, 9L, 9L, 9L, 10L, 10L, 10L, 11L, 11L, 11L, 12L, 12L, 12L, 
13L, 13L, 13L, 14L, 14L, 14L, 15L, 15L, 15L, 16L, 16L, 16L, 17L, 
17L, 17L, 18L, 18L, 18L, 19L, 19L, 19L, 20L, 20L, 20L), .Label = c("10", 
"34", "59", "84", "110", "134", "165", "199", "234", "257", "362", 
"433", "506", "581", "652", "733", "818", "896", "972", "1039"
), class = "factor"), Abundance = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 142733.475, 108263.525, 98261.11, 649286.165, 
3320759.803, 3708515.148, 6691260.945, 30946562.92, 180974.3725, 
4530005.805, 21499827.89, 0, 15032198.54, 4058060.583, 0, 3842964.97, 
2544030.857, 0, 1640476.977, 286249.1775, 0, 217388.5675, 1252965.433, 
0, 1314666.05, 167467.8825, 0, 253798.15, 107244.9925, 0, 207341.1925, 
15755.485, 0, 71015.85, 14828.5075, 0, 25966.2325, 0, 0, 0, 0, 
0, 0), Gene_Name = c("AT1G01080", "AT1G01090", "AT1G01320", "AT1G01420", 
"AT1G01470", "AT1G01560", "AT1G01800", "AT1G02150", "AT1G02500", 
"AT1G02560", "AT1G02780", "AT1G02880", "AT1G02920", "AT1G02930", 
"AT1G03030", "AT1G03090", "AT1G03110", "AT1G03130", "AT1G03220", 
"AT1G03230", "AT1G03330", "AT1G03475", "AT1G03630", "AT1G03680", 
"AT1G03870", "ATCG00420", "ATCG00470", "ATCG00480", "ATCG00490", 
"ATCG00500", "ATCG00650", "ATCG00660", "ATCG00670", "ATCG00740", 
"ATCG00750", "ATCG00842", "ATCG01100", "ATCG01030", "ATCG01114", 
"ATCG01665", "ATCG00770", "ATCG00780", "ATCG00800", "ATCG00810", 
"ATCG00820", "ATCG00722", "ATCG00744", "ATCG00855", "ATCG00853", 
"ATCG00888", "ATCG00733", "ATCG00766", "ATCG00812", "ATCG00821", 
"ATCG00856", "ATCG00830", "ATCG00900", "ATCG01060", "ATCG01110", 
"ATCG01120")), .Names = c("Size_Range", "Abundance", "Gene_Name"
), row.names = c(NA, -60L), class = "data.frame")

You might check out: http://stackoverflow.com/questions/23439266/list-for-multiple-plots-from-loop-ggplot2-list-elements-being-overwritten or http://stackoverflow.com/questions/11357139/r-saving-ggplot2-plots-in-a-list?rq=1 — Iris, Oct 21 '15 at 13:19
Are your data very large? You could consider creating a named list of dataframes (or even one large one) and using `lapply` or something similar. — Heroka, Oct 21 '15 at 19:47
They are not so big. Could easly to do that if I would know how... — Shaxi Liver, Oct 22 '15 at 08:32
Another method (if it's not essential to set up the plots in different files), would be to save the different plots to a list, the just write the list to a single pdf which would give you a page for each graph. `p = as.list(1:3)`, `p[[1]] = ggplot(...) + ...`, `p[[2]] = ...` etc then `pdf("plots.pdf", paper = "A4r"); p; dev.off()`. — Akhil Nair, Oct 25 '15 at 21:09

maRtin · Accepted Answer · 2015-10-27T19:34:11.360

3

This might do the trick: Initiate two loops, one for the complex iteration and a second for the dataset iteration. Then use paste0() or paste() to generate the correct filenames and headings.

PS.: I didn't test the code, since I dont have your data. But it should give you an idea.

#loop over complex    
for (c in 1:10) {

    #create pdf for every complex 
    pdf(file = paste0("complex", c, "analysis.pdf"), paper='A4r')

    #loop over datasets
    for(d in 1:3) {

    #plot
    ggplot(get(paste0("df_tbl_data",d,"_comp",c)), aes(Size_Range, Abundance, group=factor(Gene_Name))) +
      theme(legend.title=element_blank()) +
      geom_line(aes(color=factor(Gene_Name))) +
      ggtitle(paste0("Data",d," - complex ",c))+
      theme(axis.text.x = element_text(angle = 90, hjust = 1))
    }   
    dev.off()

}

edited Oct 27 '15 at 19:34

answered Oct 21 '15 at 13:37

maRtin

6,336
11
43
66

It creates files but without any extension (I mean without "pdf" extension). Even if I change it manually to pdf it doesn't open the file. – Shaxi Liver Oct 22 '15 at 08:32
@ShaxiLiver what is `tbl_comp_abu1` ? – maRtin Oct 22 '15 at 09:32
It's just a data frame which was used for plotting something else. Is it important ? It works for me when I apply the code from the first post. – Shaxi Liver Oct 22 '15 at 09:58
1

@ShaxiLiver I made a small change. Hope it works now. I cant really say what the problem is, because I dont have your data. Are there any error messages? – maRtin Oct 22 '15 at 11:53
It doesn't give any error message while running the code. This time the pdf files were created but I can't open them... I put the part of my data in first post. I couldn't do the `dput`. – Shaxi Liver Oct 22 '15 at 12:28
@ShaxiLiver try `dput(df_tbl_data1_comp1[1:20],)` That would help alot – maRtin Oct 22 '15 at 13:05
1

Is it not the easiest to use ggsave within your loop? You do have to give your plot a name. That worked for my plot just fine. That you put something in it like this: ggsave(filename=paste("complex",c,"analysis.pdf",sep=""), plot=myplot) – Marinka Oct 26 '15 at 16:47
3

Try enclosing the `ggplot` line in a `print` statement – jMathew Oct 29 '15 at 02:17

score 2 · Answer 2 · answered Oct 28 '15 at 05:03

So after making my answer, I realized it doesn't address the actual question about loops. However, I hope it shows you a different way of approaching your root problem (a.k.a I didn't want the work to go to waste).

I couldn't get your plot to work with the data you posted. There are 60 unique gene names in a 60-row data frame. When you try to make a geom_line and group by gene (aes(group=Gene_name)), you only have one point for each line. You need two points to make a line.

I made up some data and did an analysis.

# Function to generate random data
generate_data = function() {
  require(truncnorm)
  require(dplyr)

  gene_names = LETTERS[1:20]
  n_genes = length(gene_names)
  size_ranges = c(10, 34, 59, 84, 110, 134, 165, 199, 
                  234, 257, 362, 433, 506, 581, 652, 
                  733, 818, 896, 972, 1039)
  gene_size_means = rtruncnorm(n_genes, 10, 1000, 550, 300)
  genes_in_complex = rbinom(n_genes, 1, 0.3)
  true_variance = 50
  gene_size_variances = rchisq(n_genes, n_genes-1) * (true_variance/(n_genes-1))
  df = data.frame(gene_name=gene_names, 
                  gene_mean=gene_size_means, 
                  gene_var=gene_size_variances,
                  in_complex=genes_in_complex)
  df = df %>% group_by(gene_name) %>% 
    do(data.frame(size_ranges, 
                  abundance=dnorm(size_ranges, .$gene_mean, .$gene_var)*.$in_complex))
  return(df)
}

# Generate a list of tables. Each table is for one data set for one complex
data_tables = list()
n_comps = 3
for( complex_i in 1:2 ) {
  for( comp_j in 1:n_comps ) {
    loop_df = generate_data()
    loop_df$comp = comp_j
    loop_df$complex = complex_i
    data_tables = c(data_tables, list(loop_df))
  }
}

# Concatenate the tables into a larger data frame
dat = do.call(rbind, data_tables)

# Make a plots for each data set for complex 1
dat_complex1 = subset(dat, complex==1)
p = ggplot(dat_complex1, aes(x=size_ranges, y=abundance, color=gene_name, group=gene_name)) +
  geom_line() + 
  facet_wrap(~comp, ncol=1)
print(p)

# Make a plot with many subpanels for all complexes and data sets
p %+% dat + facet_grid(comp~complex) # screenshot shown below

So you're studying protein complexes in Arabidopsis? In case someone is familiar with your domain, a sentence of background might help them answering your question. Alternatively, a picture of the desired output could help. Also, some more complete example data and/or screenshots might generate more interest in your future posts.

score 1 · Answer 3 · answered Oct 30 '15 at 09:57

Have a look at this approach. It depends on a data.frame (dat) that contains the names of your datasets, the graph titles, as well as the file names.

First I create a function that creates the plot and saves it, then I call the function in a for-loop and also in an apply-loop (use apply where possible, its faster).

The code looks like this:

# create a custom function for ggplot, 
# which creates the plot and then saves it as a pdf
custom_ggplot_function <- function(input.data.name, graph.title, f.name){
  # get(input.data.name) gets you the variable which is stored as a string in
  # input.data.name

  p <- ggplot(get(input.data.name), aes(Size_Range, Abundance, group=factor(Gene_Name))) +
    theme(legend.title=element_blank()) +
    geom_line(aes(color=factor(Gene_Name))) +
    ggtitle(graph.title)+
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

  ggsave(filename = paste0(f.name, ".pdf"), plot = p)
  NULL
}

# dat contains the names of your datasets, the titles of the graphs and filenames
dat <- data.frame(df.names = c("df_tbl_data1_comp1",
                              "df_tbl_data2_comp1"),
                  graph.titles = c("Data1 - Complex I",
                                   "Data2 - Complex II"),
                  file.names = c("file1", "file2"))
# If you create your data.frame dat, you can also say 
# df.names  = paste0("df_tbl_data", 1:10, "_comp1") and
# graph.titles = paste0("Data", 1:10, " - Complex ", 1:10)     


# loop through the rows of dat
for (i in 1:nrow(dat)) {
  custom_ggplot_function(input.data.name = dat[i, "df.names"],
                         graph.title = dat[i, "graph.titles"], 
                         f.name = dat[i, "file.names"])
}

# or using the apply function
apply(dat, 1, function(row.el) {
  custom_ggplot_function(input.data.name = row.el["df.names"], 
                         graph.title = row.el["graph.titles"], 
                         f.name = row.el["file.names"])
})

Writing a loop to create ggplot figures with different data sources and titles

3 Answers3

Linked