9

I have a list of .stat files in tmp directory.

sample:

a.stat=>

abc,10

abc,20

abc,30

b.stat=>

xyz,10

xyz,30

xyz,70

and so on

I need to find summary of all .stat files. Currently I am using filelist<-list.files(path="/tmp/",pattern=".stat")

data<-sapply(paste("/tmp/",filelist,sep=''), read.csv, header=FALSE)

However I need to apply summary to all files being read. Or simply in n number of .stat files I need summary from 2nd column column

using

data<-sapply(paste("/tmp/",filelist,sep=''), summary, read.csv, header=FALSE) does not work and gives me summary with class character, which is no what I intend.

sapply(filelist, function(filename){df <- read.csv(filename, header=F);print(summary(df[,2]))}) works fine. However my overall objective is to find values that are more than 2 standard deviations away on either side (outliers). So I use sd, but at the same time need to check if all values in the file currently read come under 2SD range.

pythonRcpp
  • 2,042
  • 6
  • 26
  • 48
  • `sapply(filelist, function(filename){df <- read.csv(filename, header=F);print(summary(df[,2]))})` ? – Cath May 05 '15 at 12:44
  • 1
    If you need the summary of 2nd column, `summary(sapply(lst, "[[", 2))` – akrun May 05 '15 at 12:44

3 Answers3

12

To apply multiple functions at once:

f <- function(x){
  list(sum(x),mean(x))
}
sapply(x, f)

In your case you want to apply them sequentially, so first read csv data then do summary:

sapply(lapply(paste("/tmp/",filelist,sep=''), read.csv), summary)

To subset your datasets to run summary on particular column you can use change outer sapply function from summary to function(x) summary(x[[2]]).

jangorecki
  • 16,384
  • 4
  • 79
  • 160
  • This mostly worked for me, but why I am getting an extra row? 11 and 81 in below example [[1]] abc 11 [[2]] Min. 1st Qu. Median Mean 3rd Qu. Max. 2267000 2267000 3253000 2805000 3253000 3253000 [[3]] xyz 81 [[4]] Min. 1st Qu. Median Mean 3rd Qu. Max. 348000 645900 665200 649800 665200 963200 – pythonRcpp May 05 '15 at 12:53
  • @user1977867 Because this does not apply the functions sequentially, it applies the first function to produce the first row and the second function to produce the second row. – JaredS Oct 18 '20 at 10:47
4

For short functions you don't want to save in the environment, it can also just be done within the sapply call. For @flxflks 's example:

sapply(df, function(x) c(min = min(x), avg = mean(x)))
veghokstvd
  • 183
  • 1
  • 8
1

Adding to @Jangorecki, I changed the function to include a vector and not a list. Only then it worked for me. I am unsure why my function worked and not the other.

f <- function(x){
  c(min = min(x), avg = mean(x))
}
sapply(df, f)

I found the solution at https://www.r-bloggers.com/applying-multiple-functions-to-data-frame/

flxflks
  • 498
  • 2
  • 13