-3

I have a .txt with hundreds of thousands of paths and I simply have to check if each line is a folder or a file. The hard drive is not with me so I can't use the module os with the os.path.isdir() function. I've tried the code below but it is just not perfect since some folders contains . at the end.

for row in files:
   if (row[-6:].find(".") < 0):
       folders_count += 1

It is just not worth testing if the ending of the string contains any known file format (.zip, .pdf, .doc ...) since there are dozens of different files format inside this HD. When my code reads the .txt, it stores each line as a string inside an array, so my code should work with the string format.

An example of a folder path:

'path1/path2/truckMV.34'

An example of a file path:

'path1/path2/certificates.pdf'
martineau
  • 119,623
  • 25
  • 170
  • 301
  • 3
    So who do **you** distinguish between the two? Is it about extension being numeric or not? – PM 77-1 Aug 21 '20 at 00:24
  • 1
    You can't actually tell whether a path is a file or folder path from just the path itself. – user2357112 Aug 21 '20 at 00:28
  • 1
    @user2357112supportsMonica: I believe you're mistaken, here's `os.path.isfile()` and `os.path.isdir()` that only have a path argument. – martineau Aug 21 '20 at 01:03
  • 1
    @martineau: That requires the path *and the original file system*, which we don't have here. – user2357112 Aug 21 '20 at 01:05
  • 1
    "The hard drive is not with me so I can't use the module os with the os.path.isdir() function." – user2357112 Aug 21 '20 at 01:06
  • 1
    @user2357112 supports Monica: Indeed…obviously I missed that. Sorry, and in that case the OP has no choice but to check for known file extensions. – martineau Aug 21 '20 at 01:08

1 Answers1

1

It's impossible for us to judge if it's a file or path just by the string since an extension is just an arbitrary agreeable string that programs choose to decode in a certain way.

Having said that, if I had the same problem I would do my best to estimate with the following pseudo code:

  1. Create a hash map (or a dictionary as you are in Python)
  2. For every line of the file, read the last bit and see if there's a "." in the last path
  3. Create a key for it on the hash map with a counter of how many times you have encountered the "possible extensions".
  4. After you go through all of the list you will have a collection of possible extensions and how many you have encountered them. Assume the ones with only 1 occurrence (or any other low arbitrary number) to be a path and not an extension.

The basis of this heuristic is that it's unlikely for a person to have a lot of unique extensions on their desktop - but that's just an assumption I came up with.