0

I have several txt files that I have successfully converted into csv files and I now want to clean them all in the same manner, but my script is having issues reading the file names.

First I converted all txt files in my folder of interest into csv files:

files_dir = r'/Desktop/raw_data'  
files = os.listdir(files_dir) 

for file in files:
    if fnmatch.fnmatch(file, 'deseq2*'):
        extension = os.path.splitext(file)[1]
        if extension =='.txt':
            filename = os.path.join(files_dir, file)
            df = pd.read_csv(filename, sep='|')
            new_filename = os.path.splitext(filename)[0] + '.csv'
            df.to_csv(new_filename, index=False)

I want to apply the following 'clean up' to all the csv files that were created and then save. This is taking a list of strings (genes) and only pulling out the data for those genes from the gene_name column.

cleaned = df[df['gene_name'].isin(genes)]

This is what I have attempted in order to do this to all of the files in my folder:

path = r'/Desktop/raw_data'
all_files = glob.glob(os.path.join(path, "*.csv")) #make list of paths

for file in all_files:
    # Getting the file name without extension
    file_name = os.path.splitext(os.path.basename(file))[0]
    df = pd.read_csv(file_name)
    cleaned = df[df['gene_name'].isin(genes)]
    df.to_csv(file_name)

I think that I have identified that the issue is occuring at the following line of code:

 df = pd.read_csv(file_name)

I get the following error: [Errno 2] No such file or directory: 'example_file'

I thought that maybe I needed to have .csv in the file name so I tried the following but I also got an error.

df = pd.read_csv(file_name +'.csv')

[Errno 2] No such file or directory: 'example_file.csv'

I am confused as to what is going on because the file definitely exist in the folder that I am referencing. Any help is appreciated.

Code for applying data cleaning to all csv files taken from here.

Zach Young
  • 10,137
  • 4
  • 32
  • 53
Adriana
  • 91
  • 8
  • 1
    `file` was already the path the the file. So do `df = pd.read_csv(file_name)`. Is the intent to replace the existing file? Then don't generate `file_name` at all. Its just the base name of the file without path or extension. Notice you only got "example_file". Not useful unless you plan to reassemble a path in a different directory with a different extension. – tdelaney Apr 20 '23 at 17:46
  • @tdelaney I am a little confused as to what you are suggesting I try. Yes, the intent is to replace the original file. – Adriana Apr 20 '23 at 19:11
  • `os.listdir()` returns _only_ the filenames. You must use `os.path.join()` to reassemble each filename with the original source directory name. The first code sample does this correctly. – John Gordon Apr 21 '23 at 02:12

1 Answers1

0

You get the filename without path or extension (the stem) and then try to use that partial name to open the file. But you need the full file name to actually find it on disk, not just the stem. You could print(file, file_name) to see the difference.

Since you want to replace the existing file, you can remove that processing completely. Also, make sure you write the scrubbed table, not the original.

path = r'/Desktop/raw_data'
all_files = glob.glob(os.path.join(path, "*.csv")) #make list of paths

for file in all_files:
    df = pd.read_csv(file)
    cleaned = df[df['gene_name'].isin(genes)]
    cleaned.to_csv(file)
tdelaney
  • 73,364
  • 6
  • 83
  • 116