Reading file from directory based on the selected filename on the list

Question

I have a large number of 2-dimensional files from which I am calculating an XX parameter as listed below.

 '2019-10-12_17-43.csv',
 '2019-10-12_17-42.csv',
 '2019-10-12_17-41.csv',
 '2019-10-12_17-44.csv',
 '2019-10-12_17-40.csv',
 '2019-10-11_17-40.csv',
 ......................
 and so on...

I am able to create a list of filenames and calculate the XX parameter for that particular file. After subsequent calculations, I create a data-frame named YY which contains the parameter along with the column containing filenames from which it was calculated. On the basis of the certain value of the calculated XX parameter, I would like to plot all the 2-dimensional data which gives rise to it. I also create a list of filenames from the column of the data frame. Obviously the code is longer up to XX parameter calculation, but for reading the data from selected filenames in the list I use the following code in last block:

# arbitrary functions
def Aval (a, b):
   ..............

def Bval (a, b):
   ..............

file_path = r"C:\Users\Desktop\Data"
read_files = glob.glob(os.path.join(file_path,"*.csv"))

# generating the list of filenames

file_list = []
XYZ_array = []
ABC_array = []

for (root, dirs, files) in os.walk(file_path):
   for filenames in files:
       file_list.append(filenames)
       df= pd.read_csv(os.path.join(root, filenames), header=0)

       #Calculation from the files
       ABC = ..................
       XYZ = ..................
       ABC_array.append(ABC)
       XYZ_array.append(XYZ)


#creating a dataframe from the arrays        
newdf = pd.DataFrame ({'ABC': ABC_array, 'XYZ':XYZ_array, 'Filename':file_list })

The dataframe generated looks like this:

Timestamp          ABC        XYZ           Filename  

2019-10-11_07-52   1.934985   0.187962     2019-10-11_07-52.csv 
2019-10-11_07-53   1.926435   0.200828     2019-10-11_07-53.csv  
2019-10-11_07-54   1.922927   0.215204     2019-10-11_07-54.csv
2019-10-11_07-55   1.951818   0.216678     2019-10-11_07-55.csv
2019-10-11_07-56   1.922523   0.245144     2019-10-11_07-56.csv
...                ...        ...          ...                    
2019-10-13_18-21   2.028409   1.149067     2019-10-13_18-21.csv
2019-10-13_18-22   2.027896   1.015862     2019-10-13_18-22.csv
2019-10-13_18-23   2.013004   0.871320     2019-10-13_18-23.csv
2019-10-13_18-24   1.991576   0.755164     2019-10-13_18-24.csv
2019-10-13_18-25   1.908259   0.570786     2019-10-13_18-25.csv

The ABC values are binned in three bins bins = [1.76,1.86,1.96]

Abc_sorted = newdf.sort_values('ABC')
Abc_sorted['Bin_names'] = pd.cut(Abc_sorted['ABC'], bins, labels=['1.76','1.86','1.96'])
T_df = Abc_sorted.sort_values(by=['Bin names']).dropna()

results in a dataframe like:

Timestamp            ABC          XYZ       Filename              Bin_names
2019-10-12_17-43    1.769676    72.841836   2019-10-12_17-43.csv    1.76
2019-10-12_17-42    1.771429    74.583635   2019-10-12_17-42.csv    1.76
2019-10-12_17-41    1.774526    76.104981   2019-10-12_17-41.csv    1.76
2019-10-12_17-44    1.774678    68.314091   2019-10-12_17-44.csv    1.76
2019-10-12_17-40    1.779273    76.589191   2019-10-12_17-40.csv    1.76
... ... ... ... ... ... ... ... ... ...
2019-10-12_09-48    1.988249    85.279987   2019-10-12_09-48.csv    1.96
2019-10-13_09-04    1.988266    28.716690   2019-10-13_09-04.csv    1.96
2019-10-12_11-27    1.988597    76.978562   2019-10-12_11-27.csv    1.96
2019-10-11_16-19    1.985438    76.343396   2019-10-11_16-19.csv    1.96
2019-10-11_08-11    1.999933    0.251199    2019-10-11_08-11.csv    1.96

A new dataframe is created based on the bin_name 1.76 and filenames as and a list containing filenames is created as:

ndf = T_df.loc[Total_df.Bin_names =='1.76'][['Filename', 'Bin_names']]
filename_list=ndf['Filename'].tolist()

Which results in dataframe as:

Filename             Bin_names
2019-10-12_17-43.csv    1.76
2019-10-12_17-42.csv    1.76
2019-10-12_17-41.csv    1.76
2019-10-12_17-44.csv    1.76
2019-10-12_17-40.csv    1.76

Now the main task is to import the files in the filename_list from main directory:

for i in range(len(filename_list)):
        print (filename_list[i])
for file in read_files:
    if fnmatch.fnmatch(file, filename_list[i]):
        print(file)

where read_files is the path, the file is the filename in the path and filename_list is the list containing the multiple files. I have binned the data to 3 different values and I want to import only the files that give ABC parameter value 1.76. But this doesn't seem to work and nothing is returned. Could anyone help?

What is `i`? Nothing is returned because you are only printing. Probably going to need more information. Please read [mre], write a *minimal* toy example that replicates the problem. — wwii, May 01 '20 at 21:01
@wwii The question is now updated with the clearer text and representative code. Thanks in advance — Basant, May 01 '20 at 22:11
Presumably `ndf = T_df.loc...` is the line that is giving you trouble, however nobody can tell because we have no idea what is in `newdf ` and we have no idea of what `T_df` looks like or how you are *binning* the data. The purpose of the [mre] is to provide us with **everything** we need to recreate the problem including representative data - with emphasis on **minimal** (for the code and data) - sometimes making an mre for your question will highlight problems for you even before you get an answer. — wwii, May 02 '20 at 14:34
Would it be safe to say that a `newdf` can be made/mimicked with: `import numpy as np; import pandas as pd; import random,string; abc = np.random.default_rng().normal(1.5, .5, 1000); xyz = np.random.default_rng().normal(1.5, .5, 1000); fnames = [''.join(random.choices(string.ascii_letters,k=7)) for _ in range(1000)]; newdf = pd.DataFrame ({'ABC': abc, 'XYZ':xyz, 'Filename':fnames})` ?? if so, include that in your question (since we don't have any of those csv files all of that stuff is irrelevant). — wwii, May 02 '20 at 14:39
Then we definitely need to know how you made the Series that you assigned to `newdf['Bin names']` - that seems integral to your question but you just skipped over that. — wwii, May 02 '20 at 14:42
@wwii As I am a novice having started using python fairly recently, I wasn't able to properly communicate the issue. I have thus edited the question and hopefully, this makes it clearer this time. I want to import the files whose Bin_names are 1.76 from the total files in the main directory. Could you understand this issue this time :) :) — Basant, May 02 '20 at 21:52

wwii · Answer 1 · 2020-05-03T15:08:20.167

If ndf looks like this:

>>> ndf
               Filename  Bin_names
0  2019-10-12_17-43.csv       1.76
1  2019-10-12_17-42.csv       1.76
2  2019-10-12_17-41.csv       1.76
3  2019-10-12_17-44.csv       1.76
4  2019-10-12_17-40.csv       1.76

and filename_list looks like this:

>>> filename_list = ndf['Filename'].to_list()
>>> filename_list
['2019-10-12_17-43.csv', '2019-10-12_17-42.csv', '2019-10-12_17-41.csv', '2019-10-12_17-44.csv', '2019-10-12_17-40.csv']

and the files are located in

file_path = r"C:\Users\Desktop\Data"

Then the complete paths to all your files should be

>>> [os.path.join(file_path, name) for name in filename_list]
['C:\\Users\\Desktop\\Data\\2019-10-12_17-43.csv', 'C:\\Users\\Desktop\\Data\\2019-10-12_17-42.csv', 'C:\\Users\\Desktop\\Data\\2019-10-12_17-41.csv', 'C:\\Users\\Desktop\\Data\\2019-10-12_17-44.csv', 'C:\\Users\\Desktop\\Data\\2019-10-12_17-40.csv']
>>>

You could also add the file path to the Filename column

>>> ndf.Filename.apply(lambda x: os.path.join(file_path,x))
0    C:\Users\Desktop\Data\2019-10-12_17-43.csv
1    C:\Users\Desktop\Data\2019-10-12_17-42.csv
2    C:\Users\Desktop\Data\2019-10-12_17-41.csv
3    C:\Users\Desktop\Data\2019-10-12_17-44.csv
4    C:\Users\Desktop\Data\2019-10-12_17-40.csv
Name: Filename, dtype: object
>>>

Or using pathlib

>>> import pathlib
>>> p = pathlib.PurePath(file_path)
>>> ndf.Filename.apply(p.joinpath)
0    C:\Users\Desktop\Data\2019-10-12_17-43.csv
1    C:\Users\Desktop\Data\2019-10-12_17-42.csv
2    C:\Users\Desktop\Data\2019-10-12_17-41.csv
3    C:\Users\Desktop\Data\2019-10-12_17-44.csv
4    C:\Users\Desktop\Data\2019-10-12_17-40.csv
Name: Filename, dtype: object
>>>

You used os.walk to find all the files then you appended the filename to a list but had to use os.path.join(root, filenames) to open the file with pandas. Maybe the files are in different directories and you should save the whole path when you make file_list - then you will be able to access the files using their absolute paths without searching for them.

Thanks for the help. The last segment of importing the selected files still displays nothing. Am I missing something here? — Basant, May 03 '20 at 12:02
`still displays nothing` - I don't know what that means. See edit. — wwii, May 03 '20 at 13:38
Suppose I want to read the contents of the files listed in ndf dataframe from the whole list of files and do calculations. The last segment was supposed to be that part, but that doesn't work. The second `for loop` is not properly functioning as nothing is displayed when this segment runs except for the filenames in the filename_list that was created from the `ndf` data frame. I have added the path for files in the `filename_list` by your previous suggestion. So what you mentioned on your edit should be covered by this. — Basant, May 03 '20 at 14:39

Reading file from directory based on the selected filename on the list

1 Answers1