Get the length of multiple txt files based on sorted filenames in python

Question

I want to get the length of each txt file in a folder. The files are all in txt format and are all in the same directory. The name of the txt files beginning with date Mon year format and followed with news titles such as upper case and lower case letters and signs such as space and '-', ','.

folder_path = '/home/runner/Final-Project/folder1/12 Aug 2020 File Name With Different Format.txt

I have sorted the txt files first according to the date and month format chronologically. Like below:

12 APR 2019 Nmae's something Something.txt

13 APR 2019 World's - as Countr something.txt

14 APR 2019 Name and location.txt

15 APR 2019 Name then location,for something.txt

and the code is below:

import re
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from datetime import datetime
import os
import glob

folder_path = '/home/runner/Final-Project/folder1'

results=[os.path.basename(filename) for filename in glob.glob(os.path.join(folder_path, '*.txt'))]

out_1=sorted(results, key=lambda file: datetime.strptime(' '.join(file.split()[:3]), '%d %b %Y'))

print(*out_1,sep='\n')

How do I get the length of each txt file? Namely the word counts of each text file according to this date sorted order?

Jawand S. · Answer 1 · 2022-01-02T21:31:43.630

2

The way you're processing the files means that you're trying to open "3 MAR 2020 filename.txt" which isn't a file. You want to just open the actual filename, so you could do filename.split(" ")[-1] to take the last element - which should be the file name in this case.

Edit 2: This code should work

my_list1=[]
for filename in out_1:
    with open(filename.split(" ")[-1], 'r') as f:
        text = f.read()
        my_list1.append(len(text))
        print(len(text))

Another issue you would have faced is that you weren't actually appending anything to my_list1

Edit: the second piece of code you posted isn't formatted correctly, so make sure to fix that so it's easy to reproduce/test the code you've posted.

Edit 3: If the filename has spaces it would also be split into words. To address this problem I would either add some character like "||" that's unlikely to be in a filename when you're appending words - I think you do that in this line, so replace the space with ||:

out_1=sorted(onlyfiles, key=lambda file: datetime.strptime(' ||'.join(file.split()[:3]), '%d %b %Y'))

And then you can split on "||" as indicated by the code above. Alternatively, you can make a dictionary where the key is the formatted date/time and the value is the filename.txt, then you can do the following:

with open(example_dict[filename], 'r') as f:
    text = f.read()
    my_list1.append(len(text))
    print(len(text))

In the future, I would recommend adding other relevant parts of you code.

edited Jan 02 '22 at 21:31

answered Jan 02 '22 at 01:45

Jawand S.

148
10

Thank you for the solution, this is helpful but the error still exists. The file name for example 3 MAR 2020 filename.txt contains titles of the articles so it has space and lower case and upper case in it. I assume by doing so the filename is split into words? – Maibaozi Jan 02 '22 at 10:36
1

Hopefully, this is addressed with edit 3 - normally filenames should also avoid special characters or spaced – Jawand S. Jan 02 '22 at 21:32
Thank you! I didn't know this, I will try to fix it – Maibaozi Jan 03 '22 at 10:40
hi Thank you. I have tried edit 3, but still there is error. I think it is related with the filename format and I did not address the question clear. I have edited the original question. Please have a check if you can. Many thanks – Maibaozi Jan 03 '22 at 17:08
1

@Maibaozi Please share the error message, otherwise it's hard to debug. Additionally if the above method doesn't work, what you can do it create a dictionary with the formatted filenames pointing to the filenames, and you can use the original filenames to open the files and count the words. The word count can then be added to another dictionary with the key being the formatted date/filename. You can then access and print the word counts with the formatted filename – Jawand S. Jan 03 '22 at 17:59
@Janwand S. thank you very much. by following the edit 2, I get such error `File "main.py", line 52, in with open(filename.split(" ")[-1], 'r') as f: FileNotFoundError: [Errno 2] No such file or directory: 'Somthing.txt'` this Something in Something.txt is the last word of the first file name. I also tried edit 3, seem get the same error but in the other places of the file name, I guess this is due to the space and '-' or ',' in the file names. The file names are news titles so the format varies. That is why I re edit the question and make the file name format more precisely. – Maibaozi Jan 03 '22 at 19:28
I do not follow this part when you say 'what you can do it create a dictionary with the formatted filenames pointing to the filenames, and you can use the original filenames to open the files and count the words. ' I get the way but 'formatted filenames pointing to the filenames' , you mean like my_dic ={} which part pointing to the filename ? – Maibaozi Jan 03 '22 at 20:02
1

for example, my_dic['14 APR 2019 Name and location.txt'] = 'Name and location.txt'. Then you can do: with open(my_dic['14 APR 2019 Name and location.txt'], 'r') as f: . This way it'll find the correct file – Jawand S. Jan 03 '22 at 23:43

Get the length of multiple txt files based on sorted filenames in python

1 Answers1