1

Imagine you have these paths of files you want to get the filename without extension from:

                       relfilepath
0                  20210322636.pdf
12              factuur-f23622.pdf
14                ingram micro.pdf
19    upfront.nl domein - Copy.pdf
21           upfront.nl domein.pdf
Name: relfilepath, dtype: object

I came up with the following however this gives me the problem that for the first item it becomes a number outputting '20210322636.0'.

from pathlib import Path


for i, row in dffinalselection.iterrows():
    dffinalselection['xmlfilename'][i] = Path(dffinalselection['relfilepath'][i]).stem
    dffinalselection['xmlfilename'] = dffinalselection['xmlfilename'].astype(str)

This is wrong since it should be '20210322636'

Please help!

It_is_Chris
  • 13,504
  • 2
  • 23
  • 41
Max
  • 493
  • 2
  • 9
  • assuming the file extension is the last three characters . . . `df['relfilepath'].str[:-4]` – It_is_Chris Aug 18 '21 at 19:38
  • Thanks you but sometimes, there are jpegs involved.. So that would not work – Max Aug 18 '21 at 19:39
  • Use the `.name` method on `Path` objects. Note, not sure how reasonable it is to put `Path` objects in a `pd.DataFrame`, although, if it is already using `object` dtype I guess it can't hurt. As an aside, you really shouldn't be iterating over a dataframe like this. Use `df.at[ ] = whatever` indexing... – juanpa.arrivillaga Aug 18 '21 at 19:39
  • thank you but I want the file w/o extensions – Max Aug 18 '21 at 19:41
  • Whoops, I meant `Path.stem`, which you are already using. – juanpa.arrivillaga Aug 18 '21 at 19:42

2 Answers2

2

If the column values are always the filename/filepath, split it from right on . with maxsplit parameter as 1 and take the first value after splitting.

>>> df['relfilepath'].str.rsplit('.', n=1).str[0]

0                  20210322636
12              factuur-f23622
14                ingram micro
19    upfront.nl domein - Copy
21           upfront.nl domein
Name: relfilepath, dtype: object
ThePyGuy
  • 17,779
  • 5
  • 18
  • 45
1

You were doing it correctly, but your operaiton on the dataframe was incorrect.

from pathlib import Path


for i, row in dffinalselection.iterrows():
    dffinalselection['xmlfilename'][i] = Path(dffinalselection['relfilepath'][i]).stem # THIS WILL NOT RELIABLY MUTATE THE DATAFRAME
    dffinalselection['xmlfilename'] = dffinalselection['xmlfilename'].astype(str) # THIS OVERWROTE EVERYTHING

Instead, just do:

from pathlib import Path

dffinalselection['xmlfilename'] = ''
for row in dffinalselection.itertuples():
    dffinalselection.at[row.index, 'xmlfilename']= Path(row.relfilepath).stem

Or,

dffinalselection['xmlfilename'] = dffinalselection['relfilepath'].apply(lambda value: Path(value).stem)
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172