I can't save the cleaned df to target directory

Question

I am trying to remove duplicates from large files, but save those into a different directory. I ran the code below, but it saved them (overwrote) within the root directory. I know that if I switch to inplace='False' it won't overwrite those files in the root directory, but it also doesn't copy them into the target directory either, so that doesn't help.

Please advise and thank you! :)

import os
import pandas as pd
from glob import glob
import csv
from pathlib import Path

root = Path(r'C:\my root directory') 
target = Path(r'C:\my root directory\target')
file_list = root.glob("*.csv")

desired_columns = ['ZIP', 'COUNTY', 'COUNTYID']

for csv_file in file_list:
    df = pd.read_csv(csv_file)
    df.drop_duplicates(subset=desired_columns, keep="first", inplace=True)
    df.to_csv(os.path.join(target,csv_file))

Example:

ZIP COUNTYID    COUNTY
32609   1   ALACHUA
32609   1   ALACHUA
32666   1   ALACHUA
32694   1   ALACHUA
32694   1   ALACHUA
32694   1   ALACHUA
32666   1   ALACHUA
32666   1   ALACHUA
32694   1   ALACHUA

If you're using `pathlib` why not commit fully by using [`.glob`](https://docs.python.org/3/library/pathlib.html#pathlib.Path.glob) and the `/` [operator](https://docs.python.org/3/library/pathlib.html#operators)? Also, why are you importing `csv`? — ddejohn, Apr 21 '22 at 16:49
Can you please explain in more detail what is happening? I don't see any reason in your code for the behavior you're describing. — ddejohn, Apr 21 '22 at 16:50
Please provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). — ddejohn, Apr 21 '22 at 16:53
@ddejohn I can't commit fully because I'm not sufficiently familiar with glob, but I can walk that path if there's a good solution with glob and the /. As far as importing csv, that's an oops! I copied that code from another task and forgot to delete it. — PhD Student FSU, Apr 21 '22 at 16:55
@ddejohn For your second comment (3rd question), I was expecting the final line to write the dataframe as a csv file into the target directory, but it only writes it into the root directory (which overwrites the files I'm reading). I will edit the above code to include the minimal reproducible example. — PhD Student FSU, Apr 21 '22 at 16:58
Can you print `csv_file` during your `for` loop? I wonder if it includes the full path, or only the filename? — ddejohn, Apr 21 '22 at 17:02
@ddejohn yes, when I print the **csv_file** it does show the full path. — PhD Student FSU, Apr 21 '22 at 17:06
Gotcha, so just pull the filename out and you should be good. — ddejohn, Apr 21 '22 at 17:14
You are joining two absolute paths.... that just gives you the first absolute path. You really should try to `print` the results of things to see what is happening — juanpa.arrivillaga, Apr 21 '22 at 17:25

ddejohn · Accepted Answer · 2022-04-21T17:27:47.867

1

This should work, while also reducing your dependencies:

import pandas as pd
import pathlib

root = pathlib.Path(r"C:\my root directory") 
target = root / "target"
file_list = root.glob("*.csv")

desired_columns = ["ZIP", "COUNTY", "COUNTYID"]
for csv_file in file_list:
    df = pd.read_csv(csv_file)
    df.drop_duplicates(subset=desired_columns, keep="first", inplace=True)
    df.to_csv(target / csv_file.name)

Note that since target is relative to your root directory, you can simply join using the / operator.

edited Apr 21 '22 at 17:27

answered Apr 21 '22 at 17:17

ddejohn

8,775
3
17
30

1

You should probably explain the key change here, i.e. `target` is a relative path. – juanpa.arrivillaga Apr 21 '22 at 17:26
1

Also note, `root / "target"` should work as well. – juanpa.arrivillaga Apr 21 '22 at 17:26
yep, a remnant of a previous iteration of my answer – ddejohn Apr 21 '22 at 17:28
This code worked, but right before I saw your code, I had found a different solution. I change the code on the final line to: `df.to_csv(os.path.join(target,os.path.basename(csv_file)))` – PhD Student FSU Apr 21 '22 at 17:38
1

Sure, that's fine. The benefit of using my answer is that you only need `pathlib` for all the path operations you use in this script -- i.e., you can drop the `os` and `glob` imports. – ddejohn Apr 21 '22 at 17:39
1

Ohhh I'm definitely using **yours**!! I only wanted to point out that I had found another solution, but seemingly exactly as you were providing yours! :) – PhD Student FSU Apr 21 '22 at 17:54

I can't save the cleaned df to target directory

1 Answers1