R: how to find select files in a folder based on matching specific column title

Question

Sorry for the generic question. I'm looking for pointers for sorting out a data folder, in which I have numerous .txt files. All of them have different titles, and for the vast majority of them, the files have the same dimension, that is the column numbers are the same. However, the pain is some of the files, despite having the same number of columns, have different column names. That is in those files, some other variables were measured.

I want to weed out these files, and I cannot do by simply comparing column numbers. Is there any method that I can pass a name of the column and check how many files in the directory have that column, so that I can remove them into a different folder?

UPDATE:

I have created a dummy folder to have files to reflect the problem please see link below to access the files on my google drive. In this folder, I have took 4 files that have the problem columns.

https://drive.google.com/drive/folders/1IDq7BwfQNkGb9y3RvwlLE3FeMQc38taD?usp=sharing

The problems is the code seem to be able to find files matching the selection criteria, aka the actual name of problem columns, but I cannot extract the real index of such files in the list. Any pointers?

library(data.table)

#read in the example file that have the problem column content
df_var <- read.delim("ctrl_S3127064__3S_DMSO_00_none.TXT", header = T, sep = "\t")

#read in a file that I want to use as reference
df_standard <- read.delim("ctrl__S162465_20190111_T8__3S_2DG_3mM_none.TXT", header = T, sep = "\t")

#get the names of columns of each file
standar.names <- names(df_standard)
var.names <- names(df_var)

same.titles <- var.names %in% standar.names

dff.titles <- !var.names %in% standar.names

#confirm the only 3 columns of problem is column 129,130 and 131 
mismatched.names <- colnames(df_var[129:131])

#visual check the names of the problematic columns
mismatched.names


# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],
                         sep = "\t",
                         header = T,
                         nrows = 2)
}

# get column names of all files
column_names <- lapply(l_files, names)

# get unique names of files
unique_names <- unique(mismatched.names)
unique_names[1]
# decide which files to remove
#here there the "too_keep" returns an integer vector that I don't undestand
#I thought the numbers should represent the ID/index of the elements
#but I have less than 10 files, but the numbers in to_keep are around 1000
#this is probably because it's matching the actually index of the unlisted list
#but if I use to_keep <- which(column_names%in% unique_names[1]) it returns empty vector

to_keep <- which(unlist(column_names)%in% unique_names[1])


#now if I want to slice the file using to_keep the files_to_keep returns NA NA NA
files_to_keep <- files_in_wd[to_keep]

#once I have a list of targeted files, I can remove them into a new folder by using file.remove
library(filesstrings)
file.move(files_to_keep, "C:/Users/mli/Desktop/weeding/need to reanalysis" )

Have a look at the fs package: https://www.tidyverse.org/blog/2018/01/fs-1.0.0/ — mharinga, Oct 25 '20 at 10:50
Hi @mharinga I didnt seem to find tools i that package for this particular issue — ML33M, Oct 26 '20 at 21:24

tester · Answer 1 · 2020-10-26T22:55:38.983

1

If you can distinguish the files you'd like to keep from those you'd like to drop depending on the column names, you could use something along these lines:

# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files")

# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],
                             sep = ';',
                             header = T,
                             nrows = 2)
}

# get column names of all files
column_names <- lapply(l_files, names)
# get unique names of files
unique_names <- unique(column_names)
# decide which files to keep
to_keep <- which(column_names %in% unique_names[1])

files_to_keep <- files_in_wd[to_keep]

If you have many files you should probably avoid the loop or just read in the header of the corresponding file.

edit after your comment:

by adding nrows = 2 the code only reads the first 2 rows + the header.
I assume that the first file in the folder has the structure that you'd like to keep, that's why column_names is checked against unique_names[1].
the files_to_keep contains the names of the files you'd like to keep
you could try to run that on a subset of your data and see if it works and worry about efficiency later. A vectorized approach might work better I think.

edit: This code works with your dummy-data.

library(filesstrings)

# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files/dummyset")

# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],
                             sep = "\t",
                             header = T,
                             nrows = 2,
                             encoding = "UTF-8",
                             check.names = FALSE
                            )
}

# get column names of all files
column_names <- lapply(l_files, names)
# decide which files to keep
to_keep <- column_names[[1]] # e.g. column names of file #1 are ok

# check if the other files have the same header:
df_filehelper <- data.frame('fileindex' = seq_along(files_in_wd),
  'filename' = files_in_wd,
  'keep' = NA)

for(i in 2:length(files_in_wd)){
  df_filehelper$keep[i] <- identical(to_keep, column_names[[i]])
}

df_filehelper$keep[1] <- TRUE # keep the original file used for selecting the right columns

# move files out of the current folder:
files_to_move <- df_filehelper$filename[!df_filehelper$keep] # selects file that are not to be kept

file.move(files_to_move, "C:/Users/tester/Desktop/generic-text-files/dummyset/testsubfolder/")

edited Oct 26 '20 at 22:55

answered Oct 23 '20 at 21:01

tester

1,662
1
10
16

I think your idea will work. And like you pointed out, the folder is huge, 200000K files, a few TB in size. How do I just read the header? And if I understand correctly the last line code to_keep only returns a list, each element is a file that matches the name selection. How do I output them into a folder? – ML33M Oct 23 '20 at 23:03
I really hope that I could test this on the weekend but we have a stupid server shutdown due to power testing in the hospital. I might have to test this on Monday and get back to you if this works or not. apologies. – ML33M Oct 23 '20 at 23:09
I'm testing the code you suggest. I have done some preprocessing work to identify the different columns. all the files have the same 131 columns, difference lies in the last 3 columns 129, 130 and 133, that they have different names. I have created a dummy small folder, however the last 2 lines of code, the to_keep <- which () always returns an empty. no matter if im matching all 3 column names or just 1. But I know for a fact all files in the folder contains all 3 headers... – ML33M Oct 26 '20 at 17:21
Saw some post and tested, i think we have to unlist() column_names. – ML33M Oct 26 '20 at 17:30
I have changed to_keep <- which(unlist(column_names)%in% unique_names[1]), this returns to_keep a int [1:8], so I have 8 files that are unique, however the last line to select the elements based on this to_keep returns NA. I have tried to as.vector (to_keep), still wont work. any ideas? – ML33M Oct 26 '20 at 18:30
I have updated the question and made a dummy data folder so the error is easier to be replicated. – ML33M Oct 26 '20 at 18:49
Probably not the most elegant solution, but the code above should work. If you face performance issues, a vectorized approach is the way to go. – tester Oct 26 '20 at 22:58
The code indeed works on the dummy set. But the reality is a little more annoying. Some files in the pile has 132 columns instead of 131, the extra column is essentially empty, created by some parsing problem in R. That is why in my code, I was trying #confirm the only 3 columns of problem is column 129,130 and 131 mismatched.names <- colnames(df_var[129:131]) #visual check the names of the problematic columns mismatched.names. – ML33M Oct 27 '20 at 00:17
So all the differences are in these 3 columns. I want to, say, heck if the file is to keep or to be removed, by matching the name column 129 – ML33M Oct 27 '20 at 00:18
your code breaks down on the for(i in 2:length(files_in_wd)){ + df_filehelper$keep[i] <- identical(to_keep, column_names[[i]]) + } error is in column_names[[i]] : subscript out of bounds. I think this is because some files have 131 and others are 132 columns – ML33M Oct 27 '20 at 00:19
yes, I'm able to get around the problem of differnt column numbers by adding [,1:131] after the read.delim. But in my test for a slightly larger dataset, all the files are removed, including the one I know is a target (I took the same file from the dummy set and added them into the larger set for testing). I think we have to go about the matching a particular header to be safe. – ML33M Oct 27 '20 at 00:37
1

I'm sorry for being annoying, and frankly I just realized what a dumb rookie I am, in your read delim loop, since I know all the problem happens in column 129-130, all I need is to add [,129:131] then I got away with all the column number problems and now the larger set works fine – ML33M Oct 27 '20 at 00:49
Thank you for putting up with me haha, well, somewhat I feel a small win for making a small input. I will be able to award it in 3 hrs as the system says – ML33M Oct 27 '20 at 16:16

jared_mamrot · Answer 2 · 2020-10-27T02:18:17.097

0

Due to the large number and size of files it might be worth looking at alternatives to R, e.g. in bash:

for f in ctrl*.txt
do
  if [[ "$(head -1 ctrl__S162465_20190111_T8__3S_2DG_3mM_none.txt | md5)" != "$(head -1 $f | md5)" ]]
    then echo "$f"
  fi
done

This command compares the column names of the 'good file' to the column names of every file and prints out the names of files that do not match.

edited Oct 27 '20 at 02:18

answered Oct 27 '20 at 00:51

jared_mamrot

22,354
4
21
46

1

Thank you for your time. I have to say sorry first that I'm such a noob that basic R is all I know... I was playing with testers answer and by changing the columns to read and focus on I got it to work. But I will surely read into your suggestion. – ML33M Oct 27 '20 at 04:41

R: how to find select files in a folder based on matching specific column title

2 Answers2

Linked