0

I have an R script that creates a data frame with 61 columns. The data.frame is made by reading a bunch of csv files into a list of data.frames, then merging the list such that commonly named columns in each data.frame in the list populate the same column in the resulting data.frame.

Some of the columns that should be combined have been inconsistently named in the csv files (eg date.received vs received.on.date vs date.sample.received), and I was wondering what the best approach to combining them would be.

I had a couple ideas:

  • rename the columns before merging in a big lapply over the list.
  • combine the columns that should be the same once I have my data.frame, such that the column which has a value in that row is used

is the second approach possible (and how?) or is there a better way?

Camden Narzt
  • 2,271
  • 1
  • 23
  • 42

1 Answers1

0

The second approach is possible and it goes easy with rbind_all from dplyr package. Here is how:

First of all, if you have some information about the pattern of the names of columns that should be stacked together, I suggest you to try to fix it before stacking, like:

colnames_synonymous <- c("date.received", "received.on.date", "date.sample.received")

list_of_dfs <- lapply(list_of_dfs, function(df) {
  names(df)[names(df) %in% colnames_synonymous] <- "date_received"
  return(df)
})

Now you are good to go:

dplyr::rbind_all(list_of_dfs)

Maybe you will have to do some adjustments before getting all columns stacked right, but now, all you need to do is changing the lapply function to do so. I find this way easier than make some columns transformations after rbinding.

Athos
  • 650
  • 4
  • 10
  • That's actually how I would implement the first approach. What makes the second approach more difficult? – Camden Narzt Sep 26 '14 at 18:43
  • actually, I do not have a conclusive answer for this question, but I'll try to endorse the first approach (that I called "second" in my answer, sorry about that) =P. First of all, by going through the second approach you could ended up with an unnecessary large data frame, causing memory issues. Depending on how many dfs you are dealing, this could be challenging, once you would have to do some repairs and some others manipulation at this big data.frame. Also, one advantage of this first approach is, to get the columns rigth, all you'll need to do is to improve the first lapply. – Athos Sep 27 '14 at 18:03