Most efficient way to find common values for one column across many data frames

Question

I have many files and I am trying to find the most efficient way of reading the data frames and finding common values in one column.

For now I have: 1. I read a list of files using:

files = c("test1.txt", "test2.txt", test3.txt")
my.data <- lapply(files, read.table, header=T)

Each containing columns e.g.

df1 = data.frame(id=c("a", "b", "c"), v = c(1:3), c=c(10:12))
df2 = data.frame(id=c("x", "b", "c"), v = c(2:4), c=c(13:15))
df3 = data.frame(id=c("a", "n", "c"), v = c(4:6), c=c(16:18))

my.data = list(df1, df2, df3)

And now I am trying to subset the list of data frames to return the same list of data frames each containing only the common rows for the first column called "id", e.g.

df1, df2, and df3 in this case would be a list containing only "id" common to all read files, i.e. a row with only "c" in this case:
intersect(intersect(df1$id, df2$id), df3$id);
list(df1[3,], df2[3,], df3[3,])

but I can't figure out a way using lists to merge all data frames, maybe this is a longer/more difficult process than reading all files, merging them all first by the common column "id", and then splitting them into a list of data frames? Does anybody have any insight for most efficient ways? Thank you!

`Reduce(intersect, Map("[[", my.data, "id"))`. But it's not a list, just the intersection common to all ids. — Rich Scriven, Aug 03 '16 at 20:30
Hi Richard, thank you. this would give the common "id", but I am trying to return a list of data frames with those common ids... — user971102, Aug 03 '16 at 20:33
Do you mean like `lapply(my.data, function(x) x[x$id %in% Reduce(intersect, Map("[[", my.data, "id")), ])`? — Rich Scriven, Aug 03 '16 at 20:34
Yes, exactly! Is this just a crazy way of doing this? Would it be better doing a "read.table", then "merge" all data frames, then split all the data frames into a list with "split"... — user971102, Aug 03 '16 at 20:38
You aren't really splitting anything though. You are subsetting rows. — Rich Scriven, Aug 03 '16 at 20:39
Yes the split is just because at the end I was a list of data frames and not 1 single data frame, I am just wondering whether I should do the subsetting/merging from one data frame or if it is better to read in as a list and do the subsetting from a list already... — user971102, Aug 03 '16 at 20:41
You asked about efficiency, so maybe `library(data.table); rbindlist(my.data, id = "src")[, if (.N == length(my.data)) .SD, by=id]` assuming each id appears at most once per table. That stacks your data into one table, but that's probably the way to go if performance matters at all. — Frank, Aug 03 '16 at 21:00

score 3 · Accepted Answer · answered Aug 03 '16 at 20:37

To find the common intersection of the id columns, you can use

common <- Reduce(intersect, Map("[[", my.data, "id"))

Then we can use that to subset the list elements.

lapply(my.data, function(x) x[x$id %in% common, ])
# [[1]]
#   id v  c
# 3  c 3 12
#
# [[2]]
#   id v  c
# 3  c 4 15
#
# [[3]]
#   id v  c
# 3  c 6 18

Most efficient way to find common values for one column across many data frames

1 Answers1