Creating a count matrix from factor level occurences in a list of dataframes

Question

Since i cannot give example data, here are two small textfiles representing the first 5 lines of two of my input files:

https://www.dropbox.com/sh/s0rmi2zotb3dx3o/AAAq0G3LbOokfN8MrYf7jLofa?dl=0

I read all textfiles in the working directory into a list, cut some columns, set new names and subset by a numerical cutoff in the third column:

all.files <- list.files(pattern = ".*.txt")
data.list <- lapply(all.files, function(x)read.table(x, sep="\t"))
names(data.list) <- all.files
data.list <- lapply(data.list, function(x) x[,1:3])

new.names<-c("query", "sbjct", "ident")

data.list <- lapply(data.list, setNames, new.names)
new.list <- lapply(data.list, function(x) subset(x, ident>99))

I am ending up with a list of dataframes, which consist of three columns each.

Now, i want to

count the occurences of factors in the column "sbjct" in all dataframes in the list, and
build a matrix from the counts, in which rows=factor levels of "sbjct" and columns=occurences in each dataframe.

For each dataframe in the list, a new object with two columns (sbjct/counts) should be created named according to the original dataframe in the original list. In the end, all the new objects should be merged with cbind (for example), and empty cells (data absent) should be filled with zeros, resulting in a "sbjct x counts" matrix.

For example, if i would have a single dataframe, dplyr would help me like this:

library(dplyr)
some.object <- some.dataframe %>% 
                  group_by(sbjct) %>%
                    summarise(counts = length(sbjct))

>some.object
Source: local data frame [5 x 2]

            sbjct counts
1 AB619702.1.1454       1
2 EU287121.1.1497       1
3 HM062118.1.1478       1
4 KC437137.1.1283       1
5        Yq2He155       1

But it seems it cannot be applied to lists of dataframes.

If your data.frames are all identical you can first bind them in the same data.farme using something like : `DF <- do.call(rbind,new.list)` then apply your provided code. — agstudy, Mar 10 '15 at 09:08

Lalit Sachan · Answer 1 · 2015-03-10T09:42:49.487

1

Add a column to each data set which acts as indicator [lets name that Ndata] that the particular observation is coming from that dataset. Now rbind all these data sets.

Now when you make a cross table of sbjct X Ndata , you'll get the matrix that you are looking for.

here is some code to clarify:

t=c("a","b","c","d","e","f")
set.seed(10)
d1=data.frame(sbjt=sample(t,sample(20,1),rep=T))
d2=data.frame(sbjt=sample(t,sample(20,1),rep=T))
d3=data.frame(sbjt=sample(t,sample(20,1),rep=T))

d1$Ndata=rep("d1",nrow(d1))
d2$Ndata=rep("d2",nrow(d2))
d3$Ndata=rep("d3",nrow(d3))

all=rbind(d1,d2,d3)

ct=table(all$sbjt,all$Ndata)

ct looks like this:

edited Mar 10 '15 at 09:42

answered Mar 10 '15 at 09:24

Lalit Sachan

78
5

I would need to paste the name of the dataframe to the additional column. I am working with 360 dataframes, so that would be nasty to do by one-by-one. – nouse Mar 10 '15 at 09:50
For that step use `all<-do.call(rbind,Map(function(x,y) {x$Ndata<-y;x},data.list,1:length(data.list)))` and then follow the steps above. – nicola Mar 10 '15 at 09:54
Thank you! two more questions: 1. the 1:length argument is not perfect, because the dataframe names are not sequential. 2. how can the rbind be adapted to all dataframes in a list? I could use list2env first. – nouse Mar 10 '15 at 10:03

Creating a count matrix from factor level occurences in a list of dataframes

1 Answers1