9

There are several questions about string manipulation, but I can't find an answer which allows me to do the following—I thought it should have been simple...

I have a DataFrame which includes a column containing a filename and path

The following produces a representative example DataFrame:

df = pd.DataFrame({
    'root': {'1': 'C:\\folder1\\folder2\\folder3\\folder4\\filename.csv'}
})
                                              root
1  C:\folder1\folder2\folder3\folder4\filename.csv

I want to end up with just the 'filename' part of the string. There is a large number of rows and the path is not constant, so I can't use str.replace

I can strip out the rightmost '.csv' part like this:

df['root'] = df['root'].str.rstrip('.csv') 
                                          root
1  C:\folder1\folder2\folder3\folder4\filename

But I cannot make any of the methods I have read about work to remove the path part in the left side of the string.

How can I return just the 'filename' part of this path (string), given that the preceding elements of the path can change from record to record?

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
rdh9
  • 665
  • 2
  • 11
  • 20
  • Depending on the form of your filenames, DSM's answer is the more robust but if the presumptions I've made are true I would expect the `str` based methods to be faster as they are vectorised – EdChum Aug 16 '14 at 21:53
  • Thanks, EdChum, so quick to answer and really helpful. Very difficult to know which answer to select, but I think the robustness you acknowledge in DSM's answer, plus the extra info concerning rstrip tips the scales... Appreciate the help nonetheless. – rdh9 Aug 16 '14 at 22:43

4 Answers4

12

You can use the utilities in os.path to make this easier, namely splitext and basename:

>>> import os
>>> df["root"].apply(lambda x: os.path.splitext(os.path.basename(x))[0])
0    filename
Name: root, dtype: object

PS: rstrip doesn't work the way you think it does-- it removes those characters, not that substring. For example:

>>> "a11_vsc.csv".rstrip(".csv")
'a11_'
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
DSM
  • 342,061
  • 65
  • 592
  • 494
  • This is working and does exactly what I asked. Thanks very much for the help and also for the extra info ref rstrip – rdh9 Aug 16 '14 at 22:48
3

For recent Python, pathlib is recommended. The basename can be obtained by applying .stem as follows. In general, DataFrames often have multiple rows, so the examples below also use pandas .apply.

from pathlib import Path 

df['root'].apply(lambda x: Path(x).stem)
# Out[1]:
# 1    filename
# Name: root, dtype: object

If you want to include the extension, you can get it by applying .name.

df['root'].apply(lambda x: Path(x).name)
# Out[2]:
# 1    filename.csv
# Name: root, dtype: object
Keiku
  • 8,205
  • 4
  • 41
  • 44
2

Presuming there is always at least a single depth in the path, we can split on the slashes, take the last element and then call rstrip on it:

In [9]:

df.root.str.split('\\').str[-1].str.rstrip('.csv')
Out[9]:
1    filename
Name: root, dtype: object

EDIT in light of what DSM has pointed out about rstrip, you could call split twice:

In [11]:

df.root.str.split('\\').str[-1].str.split('.').str[0]
Out[11]:
1    filename
Name: root, dtype: object
EdChum
  • 376,765
  • 198
  • 813
  • 562
1

There is nothing whatsoever pandas-specific about this, it is basic path handling with os.path.

Second, Windows/DOS has been accepting / as a path separator for at least 10-15 years now. So you can and should write mypath = 'C:/folder1/folder2/folder3/folder4/filename.csv' As you noticed, using backslash makes your string-handling life difficult because it has to be escaped, and results in nastier code. Defining os.sep = r'\\' doesn't seem to work.

import os
os.path.basename(r'C:/folder1/folder2/folder3/folder4/filename.csv')
'filename.csv'

Now if you really want to insist on writing OS-specific code in your Python (though there's no reason at all to do this), you can use the little-known platform-specficic versions of os.path:

import ntpath  # Windows/DOS-specific versions of os.path
ntpath.basename(r'C:\folder1\folder2\folder3\folder4\filename.csv')
'filename.csv'
smci
  • 32,567
  • 20
  • 113
  • 146
  • I wasn't aware of os.path until the previous answer -- I thought it was a pandas issue because I am dealing with a pandas DataFrame. Neither was I aware that the forward slash was acceptable as a separator -- just never had cause to consider it. I'm not entirely sure if I'm understanding the useage of your suggestion: 'os.path.basename(r'C:/folder1/folder2/folder3/folder4/filename.csv'), but the actual paths in my data vary, and are the result of reading in a large number of csv files -- I cannot state the path. I'm not sure if that's what you meant. Thanks anyway for the useful info. ' – rdh9 Aug 16 '14 at 22:52