how to capture only specific file types within a directory using glob

Question

I am trying to use python's "glob" to grab various files using a wildcard AND NOT the path in which the files came.

In this situation, I am trying to capture all files that begin with the name file_ within a directory. Though there can be situations in the future where I need to grab files base on their file extension(s) (i.e. all .csv and .log) files from a directory.

The python string below is what I am using, which is only able to grab the FULL PATH, along with the intended file. I only want to "glob" the file itself, and NOT THE PATH

import os
import glob
import boto3
from botocore.client import Config

ACCESS_KEY_ID = 'some_key'
ACCESS_SECRET_KEY = 'some_key'
BUCKET_NAME = 'some_bucket'


s3 = boto3.client(
    's3',
    aws_access_key_id=ACCESS_KEY_ID,
    aws_secret_access_key=ACCESS_SECRET_KEY,
    config=Config(signature_version='s3v4')
)

csv_files = glob.glob('/home/user/folder1/folder2/*.csv')
#json_files = glob.glob("/home/user/folder1/h_log_*.json")

for filename in csv_files:
     print("Putting %s" % filename)
     s3.upload_file(filename, BUCKET_NAME, 'new_folder' + '/' + filename)

#for filename in json_files:
#    print("Putting %s" % filename)
#    s3.upload_file(filename, BUCKET_NAME, filename)

print("All_Finished")

####################################################
####################################################

The string I am trying to concentrate on updating from the script preferably is below:

csv_files = glob.glob('/home/user/folder1/folder2/*.csv')


An example of a file directory containing various files and file types :

Below need to grab all files that end in `.csv`
/home/user/Desktop/folder_example/
file_1.csv
file_1.csv
file_1.csv
file_1.csv

Below need to grab all files that start with `file_`
/home/user/Desktop/folder_example/
file_2.log
file_2.csv
file_2.log
file_2.csv

score 0 · Answer 1 · answered Oct 04 '19 at 04:46

0

How about using os.path.basename?

You can combine glob with this function to get what you want:

[os.path.basename(item) for item in glob.glob("/home/user/folder1/folder2/*.csv")]

answered Oct 04 '19 at 04:46

ExplodingGayFish

2,807
1
5
14

Hey, let me update my code at the top as an edit to show the whole script to capture the full A>Z of what I am trying to accomplish. I tried the code you suggested, and still was having issues. I also tried this string, and was still running into issues`csv_files = glob.glob(os.path.basename('/home/user/Desktop/*.csv'))` @ExplodingGayFish – bobparker Oct 04 '19 at 19:11
I just updated the main code above, anything thought with the new edits how I can structure the code? @ExplodingGayFish – bobparker Oct 04 '19 at 19:19
Also using `csv_files = glob.glob(os.path.basename('/home/user/Desktop/*.csv'))` is wrong since it's just `csv_files = glob.glob('*.csv')` – ExplodingGayFish Oct 05 '19 at 04:01
I think I see your problem. In the line `s3.upload_file(filename, BUCKET_NAME, 'new_folder' + '/' + filename)` you need to separate the 2 `filename` variable. The first one is file's local path in your computer and the second one is the base name only. Try this and see if it works: https://pastebin.com/9uT8SQwU – ExplodingGayFish Oct 05 '19 at 04:04
Hey, I just tried your suggestion from pastebin, and it still doesn't work :( . Though, it does not throw an error. Also, it completes and shows my final print statement at the end of `All_Finished`, but does not send any files to my s3 bucket. Any thoughts? @ExplodingGayFish – bobparker Oct 05 '19 at 05:24
AFAIK, It will raise `boto3.exceptions.S3UploadFailedError` if there is any error when uploading file. You can also do a `head_object` (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.head_object) request to verify that the object looks like it should. This will raise a `botocore.ClientError` with the code `404` if the object does not exist. – ExplodingGayFish Oct 05 '19 at 06:09

score 0 · Answer 2 · answered Oct 04 '19 at 05:11

You could split the glob output based on the separator '/' or '\' and then keep the last part.

import os
target_path = r"/home/user/folder1/folder2"
fpaths = glob.glob(target_path+os.sep+'*.csv')
[fp.split(os.sep)[-1] for fp in fpaths]

Complete Example

Make Demo Folder and Demo Files

import glob, os

# Make Demo Files and a Demo Folder
target_path = os.path.join(os.getcwd(), 'temp_dump')
if not os.path.exists(target_path):
    os.makedirs(target_path)
print(os.listdir(os.getcwd()))

file_names = ['file_{}.{}'.format(fnum, fext) for fnum in range(5) for fext in ['csv', 'txt', 'log']]

for file_name in file_names:
    fpath = os.path.join(target_path, file_name)
    with open(fpath, 'w') as f:
        f.write(file_name)

print(sorted(os.listdir(target_path)))

Output:

['file_0.csv', 'file_0.log', 'file_0.txt', 
'file_1.csv', 'file_1.log', 'file_1.txt', 
'file_2.csv', 'file_2.log', 'file_2.txt', 
'file_3.csv', 'file_3.log', 'file_3.txt', 
'file_4.csv', 'file_4.log', 'file_4.txt']

Get File Names of `.csv` Files (No path, just name)

fpaths = glob.glob(target_path+os.sep+'*.csv')
[fp.split(os.sep)[-1] for fp in fpaths]

Output

['file_0.csv', 'file_3.csv', 'file_2.csv', 'file_1.csv', 'file_4.csv']

score 0 · Answer 3 · answered Oct 04 '19 at 05:20

Since there are only two types of files in your folder, you can read different types of files separately.

csv_files = glob.glob( os.path.join('/home/user/Desktop/folder_example/', '*.csv') )
log_files = glob.glob( os.path.join('/home/user/Desktop/folder_example/', '*.log') )

score 0 · Answer 4 · answered Jan 16 '21 at 17:57

You can use the pathlib library for Python >= 3.5. Path.glob() returns a generator through which you can iterate.

from pathlib import Path

path_generator = Path('/home/user/folder1/folder2').glob('*.csv')
[p.name for p in path_generator]

Output:

['file_0.csv', 
 'file_1.csv', 
 'file_2.csv', 
 'file_3.csv', 
 'file_4.csv']

how to capture only specific file types within a directory using glob

4 Answers4

Complete Example

Make Demo Folder and Demo Files

Get File Names of .csv Files (No path, just name)

Get File Names of `.csv` Files (No path, just name)